Run Lizzy¶
This guide shows the most common ways to run Lizzy 7B from Hugging Face. Use the BF16 checkpoint when you want full precision, Transformers/vLLM serving, tensor parallelism, or fine-tuning. Use GGUF when you have a local runtime that supports the Lizzy GGUF architecture and want smaller files or local inference.
Before choosing a runtime, check Hardware Requirements for memory, disk, and long-context planning guidance.
Run with Transformers¶
Use Python 3.10 or later and install PyTorch, Transformers 5.x, jinja2,
and protobuf. Transformers 5.x includes the TokenizersBackend tokenizer
class used by the Lizzy tokenizer metadata:
pip install "transformers>=5,<6" torch accelerate jinja2 protobuf
Then load the base checkpoint with trust_remote_code=True:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
repo_id = "flwrlabs/Lizzy-7B"
tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
repo_id,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "system", "content": "You are Lizzy 7B."},
{"role": "user", "content": "Summarise why queue etiquette matters in the UK."},
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output_ids = model.generate(
**inputs,
max_new_tokens=512,
do_sample=False,
)
response = tokenizer.decode(
output_ids[0][inputs["input_ids"].shape[-1] :],
skip_special_tokens=True,
)
print(response)
This path was tested on macOS with Python 3.13, torch==2.12.0,
transformers==5.9.0, and the BF16 checkpoint. AutoTokenizer,
chat-template rendering, and default cached generation returned the expected
short response.
Run with vLLM¶
vLLM exposes an OpenAI-compatible API for GPU serving. Use this path on a Linux GPU server or another environment supported by your vLLM version:
pip install vllm
vllm serve "flwrlabs/Lizzy-7B" --trust-remote-code
Then create request.json:
{
"model": "flwrlabs/Lizzy-7B",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}
Then call the server:
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data @request.json
This path was smoke-tested with vllm==0.21.0 and
torch==2.11.0+cu130 on single NVIDIA A40 and H100 GPUs. The
OpenAI-compatible server responded on H100, and in-process generation worked
on H100 up to max_model_len=32768 in short tests.
Tensor-parallel in-process generation also passed with tensor_parallel_size=2
on two H100 GPUs and tensor_parallel_size=4 on four A40 GPUs after Lizzy’s
attention reshape logic was updated to use local tensor-parallel head counts.
tensor_parallel_size=8 and OpenAI-compatible serving with tensor
parallelism were not completed in this test pass, so validate the exact vLLM
server configuration before relying on multi-GPU serving in production. Runtime
support can vary by vLLM version, GPU generation, and backend.
Run the GGUF model with llama.cpp¶
The GGUF release provides quantized files for local runtimes that support the
Lizzy GGUF architecture. The Hugging Face quick start uses the Q4_K_M
variant:
The command below uses the currently tested Lizzy-compatible llama.cpp fork and
branch. Use upstream llama.cpp or a packaged runtime instead once it includes
support for general.architecture = lizzy.
git clone --branch lorenzo-dev https://github.com/relogu/llama.cpp.git
cd llama.cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release --target llama-completion llama-server
./build/bin/llama-server \
-hf flwrlabs/Lizzy-7B-GGUF:Q4_K_M \
-c 1024
If your runtime reports unknown model architecture: 'lizzy', it does not
include Lizzy GGUF support. Use a Lizzy-compatible build or run the BF16
checkpoint with Transformers instead. See Troubleshooting for details
from smoke tests.
For direct terminal completion:
./build/bin/llama-completion \
-hf flwrlabs/Lizzy-7B-GGUF:Q4_K_M \
-p "Q: 2+2? A:" \
-n 16 \
-c 1024
In constrained macOS, virtualized, or sandboxed environments, Metal device initialization can fail. To force a small CPU-only smoke test, use:
./build/bin/llama-completion \
-hf flwrlabs/Lizzy-7B-GGUF:Q4_K_M \
-p "Q: 2+2? A:" \
-n 16 \
-c 1024 \
-t 2 \
-dev none \
-ngl 0 \
--no-op-offload
Run the GGUF model with llama-cpp-python¶
This path requires a llama-cpp-python build linked against a llama.cpp version
that supports the Lizzy GGUF architecture. It may also require
llama-cpp-python support for Lizzy’s embedded chat template. If initialization
fails with an unknown Jinja generation tag, use the llama.cpp server path
or see Troubleshooting.
import llama_cpp.llama_chat_format as chat_format
from llama_cpp import Llama
original_init = chat_format.Jinja2ChatFormatter.__init__
def lizzy_chat_template_init(
self,
template,
eos_token,
bos_token,
add_generation_prompt=True,
stop_token_ids=None,
):
template = template.replace("{% generation %}", "")
template = template.replace("{% endgeneration %}", "")
return original_init(
self,
template,
eos_token,
bos_token,
add_generation_prompt,
stop_token_ids,
)
chat_format.Jinja2ChatFormatter.__init__ = lizzy_chat_template_init
llm = Llama(
model_path="/path/to/lizzy-7b-q4_k_m.gguf",
n_ctx=1024,
n_gpu_layers=-1,
)
output = llm.create_chat_completion(
messages=[
{"role": "system", "content": "/no_think"},
{"role": "user", "content": "Reply with exactly: ok"},
],
max_tokens=16,
temperature=0,
)
print(output["choices"][0]["message"]["content"])
Run the GGUF model with Ollama¶
Ollama support is work in progress. Current public Ollama builds may not yet include backend support for the Lizzy GGUF architecture. Until that support is available, prefer the Lizzy-compatible llama.cpp path above.
ollama run hf.co/flwrlabs/Lizzy-7B-GGUF:Q4_K_M
Desktop GGUF apps such as LM Studio and Jan are also work in progress for
Lizzy. To use Lizzy in a desktop app, first confirm that the app version ships
with a llama.cpp backend that recognizes general.architecture = lizzy. If
the app lets you choose a custom backend, point it at a Lizzy-compatible
llama.cpp build and import one of the flwrlabs/Lizzy-7B-GGUF
quantizations. If it reports unknown model architecture: 'lizzy', update
the app backend or use llama.cpp directly.
Choose a runtime¶
Runtime |
Use it when |
|---|---|
Transformers |
You need Python integration, full precision, custom model code, or fine-tuning workflows. |
vLLM |
You need GPU serving, OpenAI-compatible APIs, batching, or tensor parallelism. |
llama.cpp |
You have a build that supports Lizzy GGUF and need local inference, CPU support, flexible GPU offload, or a small deployment footprint. |
Ollama, LM Studio, Jan |
Work in progress. Use only after verifying that the app backend includes Lizzy GGUF support. |
Generation settings¶
The GGUF examples use temperature=0.6 and top_p=0.95.
For deterministic documentation, coding, or extraction tasks, start with a
lower temperature such as 0.2. For more conversational output, increase the
temperature gradually and evaluate outputs for factuality and style.