Lizzy 7B¶
Lizzy 7B is an open-weight Flower Labs language model. It is designed for general assistant use, reasoning, coding assistance, and UK-oriented language and knowledge. The model is available on Hugging Face in two formats:
flwrlabs/Lizzy-7B: the original BF16 Safetensors checkpoint for Transformers, vLLM, SGLang, and other GPU-serving stacks.
flwrlabs/Lizzy-7B-GGUF: GGUF quantizations for local inference with runtimes that support the Lizzy GGUF architecture. The
lorenzo-devbranch ofrelogu/llama.cpphas been smoke-tested with the Q4_K_M file.
At a glance¶
Property |
Value |
|---|---|
Publisher |
Flower Labs |
Model family |
Lizzy |
Parameter scale |
7B-class |
Architecture |
Decoder-only transformer |
Context length |
Up to 65,536 tokens, depending on runtime and serving profile |
Primary language |
English, with British English and UK-oriented behaviour enhancements |
Original checkpoint |
BF16 Safetensors |
Quantized checkpoint |
GGUF variants including Q4_K_M, Q5_K_M, Q6_K, Q8_0, and f16 |
License |
Apache-2.0 for the base model; GGUF redistribution terms refer back to the base model license |
Architecture and configuration¶
Lizzy 7B is a 32-layer decoder-only transformer with long-context support. The release uses 32 attention heads, sliding/local attention behaviour, custom chat/control tokens, and deployment-specific serving configuration.
The GGUF release reports the following serving-oriented configuration:
32 layers with post-norm architecture
hidden size 4096
sliding-window attention with a 4096-token window plus full-attention behaviour
YaRN RoPE scaling with factor 8.0 and original context 8192
100,278-token vocabulary
65,536-token context
Training approach¶
Lizzy 7B was produced through a multi-stage training process:
pre-training on large-scale public text, document, code, math, and encyclopedic corpora
supervised fine-tuning on instruction-following, dialogue, reasoning, and tool-use examples
direct preference optimisation for helpfulness, style, and answer quality
reinforcement learning with verifiable rewards for targeted behavioural refinement
Training data sources include broad public text and knowledge sources, instruction and preference data, and UK-specific examples and preference signals.
Evaluation highlights¶
The release compares Lizzy 7B with EuroLLM 9B and Apertus 8B on UK-oriented benchmarks and broader public benchmarks.
Benchmark |
Lizzy 7B |
EuroLLM 9B |
Apertus 8B |
|---|---|---|---|
Britishness MCQ |
71.0 |
77.6 |
80.8 |
Britishness CoT |
80.1 |
72.1 |
31.7 |
Britishness Domains |
89.9 |
69.0 |
32.6 |
Benchmark |
Lizzy 7B |
EuroLLM 9B |
Apertus 8B |
|---|---|---|---|
MATH |
77.9 |
31.3 |
22.4 |
MMLU |
67.9 |
57.4 |
63.4 |
GPQA |
34.6 |
26.8 |
28.1 |
HumanEvalPlus |
70.2 |
28.2 |
33.4 |
MBPP+ |
52.5 |
41.7 |
42.3 |
LiveCodeBench v3 |
39.1 |
6.3 |
8.5 |
AIME |
35.8 |
0.2 |
0.6 |
GSM8K |
91.8 |
64.7 |
64.7 |
Lizzy 7B trails the comparison set on Britishness MCQ recall-style probing, but leads on Britishness CoT, Britishness domain reasoning, and most listed reasoning, math, knowledge, and coding benchmarks.
Safety and limitations¶
Lizzy 7B should be treated as an assistant model that can make mistakes. It can produce incorrect, outdated, or over-confident responses, and higher-risk workflows require human oversight, domain review, and downstream moderation. The UK-oriented tuning improves local style and cultural alignment, but it can also bias tone and assumptions toward UK conventions.
The safety-evaluation summary reports:
Safety benchmark |
Metric |
Score |
|---|---|---|
Overall safety average |
|
66.7% |
WildGuardTest |
|
91.9% |
HarmBench |
|
57.5% |
ToxiGen (tiny) |
|
90.2% |
XSTest |
|
85.6% |
StrongReject (logprobs) |
|
78.8% |
BBQ |
|
66.5% |
WMDP |
|
47.5% |
Next steps¶
To run Lizzy locally or on a GPU server, see Run Lizzy.
To plan hardware and memory for Lizzy variants, see Hardware Requirements.
To understand the training method, see Lizzy Training and Evaluation.
To choose a quantized local-runtime model, see Lizzy GGUF.
For product details, see the Lizzy model page.
For Flower Labs research, see Flower Research.
For enterprise deployments and custom work, see Enterprise.