Run Lizzy

This guide shows the most common ways to run Lizzy 7B from Hugging Face. Use the BF16 checkpoint when you want full precision, Transformers/vLLM serving, tensor parallelism, or fine-tuning. Use GGUF when you have a local runtime that supports the Lizzy GGUF architecture and want smaller files or local inference.

Before choosing a runtime, check Hardware Requirements for memory, disk, and long-context planning guidance.

Run with Transformers

Use Python 3.10 or later and install PyTorch, Transformers 5.x, jinja2, and protobuf. Transformers 5.x includes the TokenizersBackend tokenizer class used by the Lizzy tokenizer metadata:

pip install "transformers>=5,<6" torch accelerate jinja2 protobuf

Then load the base checkpoint with trust_remote_code=True:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "flwrlabs/Lizzy-7B"

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are Lizzy 7B."},
    {"role": "user", "content": "Summarise why queue etiquette matters in the UK."},
]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output_ids = model.generate(
    **inputs,
    max_new_tokens=512,
    do_sample=False,
)
response = tokenizer.decode(
    output_ids[0][inputs["input_ids"].shape[-1] :],
    skip_special_tokens=True,
)
print(response)

This path was tested on macOS with Python 3.13, torch==2.12.0, transformers==5.9.0, and the BF16 checkpoint. AutoTokenizer, chat-template rendering, and default cached generation returned the expected short response.

Run with vLLM

vLLM exposes an OpenAI-compatible API for GPU serving. Use this path on a Linux GPU server or another environment supported by your vLLM version:

pip install vllm
vllm serve "flwrlabs/Lizzy-7B" --trust-remote-code

Then create request.json:

{
  "model": "flwrlabs/Lizzy-7B",
  "messages": [
    {
      "role": "user",
      "content": "What is the capital of France?"
    }
  ]
}

Then call the server:

curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  --data @request.json

This path was smoke-tested with vllm==0.21.0 and torch==2.11.0+cu130 on single NVIDIA A40 and H100 GPUs. The OpenAI-compatible server responded on H100, and in-process generation worked on H100 up to max_model_len=32768 in short tests.

Tensor-parallel in-process generation also passed with tensor_parallel_size=2 on two H100 GPUs and tensor_parallel_size=4 on four A40 GPUs after Lizzy’s attention reshape logic was updated to use local tensor-parallel head counts. tensor_parallel_size=8 and OpenAI-compatible serving with tensor parallelism were not completed in this test pass, so validate the exact vLLM server configuration before relying on multi-GPU serving in production. Runtime support can vary by vLLM version, GPU generation, and backend.

Run the GGUF model with llama.cpp

The GGUF release provides quantized files for local runtimes that support the Lizzy GGUF architecture. The Hugging Face quick start uses the Q4_K_M variant:

The command below uses the currently tested Lizzy-compatible llama.cpp fork and branch. Use upstream llama.cpp or a packaged runtime instead once it includes support for general.architecture = lizzy.

git clone --branch lorenzo-dev https://github.com/relogu/llama.cpp.git
cd llama.cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release --target llama-completion llama-server
./build/bin/llama-server \
  -hf flwrlabs/Lizzy-7B-GGUF:Q4_K_M \
  -c 1024

If your runtime reports unknown model architecture: 'lizzy', it does not include Lizzy GGUF support. Use a Lizzy-compatible build or run the BF16 checkpoint with Transformers instead. See Troubleshooting for details from smoke tests.

For direct terminal completion:

./build/bin/llama-completion \
  -hf flwrlabs/Lizzy-7B-GGUF:Q4_K_M \
  -p "Q: 2+2? A:" \
  -n 16 \
  -c 1024

In constrained macOS, virtualized, or sandboxed environments, Metal device initialization can fail. To force a small CPU-only smoke test, use:

./build/bin/llama-completion \
  -hf flwrlabs/Lizzy-7B-GGUF:Q4_K_M \
  -p "Q: 2+2? A:" \
  -n 16 \
  -c 1024 \
  -t 2 \
  -dev none \
  -ngl 0 \
  --no-op-offload

Run the GGUF model with llama-cpp-python

This path requires a llama-cpp-python build linked against a llama.cpp version that supports the Lizzy GGUF architecture. It may also require llama-cpp-python support for Lizzy’s embedded chat template. If initialization fails with an unknown Jinja generation tag, use the llama.cpp server path or see Troubleshooting.

import llama_cpp.llama_chat_format as chat_format
from llama_cpp import Llama

original_init = chat_format.Jinja2ChatFormatter.__init__

def lizzy_chat_template_init(
    self,
    template,
    eos_token,
    bos_token,
    add_generation_prompt=True,
    stop_token_ids=None,
):
    template = template.replace("{% generation %}", "")
    template = template.replace("{% endgeneration %}", "")
    return original_init(
        self,
        template,
        eos_token,
        bos_token,
        add_generation_prompt,
        stop_token_ids,
    )

chat_format.Jinja2ChatFormatter.__init__ = lizzy_chat_template_init

llm = Llama(
    model_path="/path/to/lizzy-7b-q4_k_m.gguf",
    n_ctx=1024,
    n_gpu_layers=-1,
)

output = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "/no_think"},
        {"role": "user", "content": "Reply with exactly: ok"},
    ],
    max_tokens=16,
    temperature=0,
)
print(output["choices"][0]["message"]["content"])

Run the GGUF model with Ollama

Ollama support is work in progress. Current public Ollama builds may not yet include backend support for the Lizzy GGUF architecture. Until that support is available, prefer the Lizzy-compatible llama.cpp path above.

ollama run hf.co/flwrlabs/Lizzy-7B-GGUF:Q4_K_M

Desktop GGUF apps such as LM Studio and Jan are also work in progress for Lizzy. To use Lizzy in a desktop app, first confirm that the app version ships with a llama.cpp backend that recognizes general.architecture = lizzy. If the app lets you choose a custom backend, point it at a Lizzy-compatible llama.cpp build and import one of the flwrlabs/Lizzy-7B-GGUF quantizations. If it reports unknown model architecture: 'lizzy', update the app backend or use llama.cpp directly.

Choose a runtime

Runtime

Use it when

Transformers

You need Python integration, full precision, custom model code, or fine-tuning workflows.

vLLM

You need GPU serving, OpenAI-compatible APIs, batching, or tensor parallelism.

llama.cpp

You have a build that supports Lizzy GGUF and need local inference, CPU support, flexible GPU offload, or a small deployment footprint.

Ollama, LM Studio, Jan

Work in progress. Use only after verifying that the app backend includes Lizzy GGUF support.

Generation settings

The GGUF examples use temperature=0.6 and top_p=0.95. For deterministic documentation, coding, or extraction tasks, start with a lower temperature such as 0.2. For more conversational output, increase the temperature gradually and evaluate outputs for factuality and style.