Lizzy 7B

Lizzy 7B is an open-weight Flower Labs language model. It is designed for general assistant use, reasoning, coding assistance, and UK-oriented language and knowledge. The model is available on Hugging Face in two formats:

  • flwrlabs/Lizzy-7B: the original BF16 Safetensors checkpoint for Transformers, vLLM, SGLang, and other GPU-serving stacks.

  • flwrlabs/Lizzy-7B-GGUF: GGUF quantizations for local inference with runtimes that support the Lizzy GGUF architecture. The lorenzo-dev branch of relogu/llama.cpp has been smoke-tested with the Q4_K_M file.

At a glance

Property

Value

Publisher

Flower Labs

Model family

Lizzy

Parameter scale

7B-class

Architecture

Decoder-only transformer

Context length

Up to 65,536 tokens, depending on runtime and serving profile

Primary language

English, with British English and UK-oriented behaviour enhancements

Original checkpoint

BF16 Safetensors

Quantized checkpoint

GGUF variants including Q4_K_M, Q5_K_M, Q6_K, Q8_0, and f16

License

Apache-2.0 for the base model; GGUF redistribution terms refer back to the base model license

Architecture and configuration

Lizzy 7B is a 32-layer decoder-only transformer with long-context support. The release uses 32 attention heads, sliding/local attention behaviour, custom chat/control tokens, and deployment-specific serving configuration.

The GGUF release reports the following serving-oriented configuration:

  • 32 layers with post-norm architecture

  • hidden size 4096

  • sliding-window attention with a 4096-token window plus full-attention behaviour

  • YaRN RoPE scaling with factor 8.0 and original context 8192

  • 100,278-token vocabulary

  • 65,536-token context

Training approach

Lizzy 7B was produced through a multi-stage training process:

  • pre-training on large-scale public text, document, code, math, and encyclopedic corpora

  • supervised fine-tuning on instruction-following, dialogue, reasoning, and tool-use examples

  • direct preference optimisation for helpfulness, style, and answer quality

  • reinforcement learning with verifiable rewards for targeted behavioural refinement

Training data sources include broad public text and knowledge sources, instruction and preference data, and UK-specific examples and preference signals.

Evaluation highlights

The release compares Lizzy 7B with EuroLLM 9B and Apertus 8B on UK-oriented benchmarks and broader public benchmarks.

Benchmark

Lizzy 7B

EuroLLM 9B

Apertus 8B

Britishness MCQ

71.0

77.6

80.8

Britishness CoT

80.1

72.1

31.7

Britishness Domains

89.9

69.0

32.6

Benchmark

Lizzy 7B

EuroLLM 9B

Apertus 8B

MATH

77.9

31.3

22.4

MMLU

67.9

57.4

63.4

GPQA

34.6

26.8

28.1

HumanEvalPlus

70.2

28.2

33.4

MBPP+

52.5

41.7

42.3

LiveCodeBench v3

39.1

6.3

8.5

AIME

35.8

0.2

0.6

GSM8K

91.8

64.7

64.7

Lizzy 7B trails the comparison set on Britishness MCQ recall-style probing, but leads on Britishness CoT, Britishness domain reasoning, and most listed reasoning, math, knowledge, and coding benchmarks.

Safety and limitations

Lizzy 7B should be treated as an assistant model that can make mistakes. It can produce incorrect, outdated, or over-confident responses, and higher-risk workflows require human oversight, domain review, and downstream moderation. The UK-oriented tuning improves local style and cultural alignment, but it can also bias tone and assumptions toward UK conventions.

The safety-evaluation summary reports:

Safety benchmark

Metric

Score

Overall safety average

overall_safety_average

66.7%

WildGuardTest

inverted_micro_harm_lower

91.9%

HarmBench

inverted_micro_asr_lower

57.5%

ToxiGen (tiny)

safe_overall

90.2%

XSTest

overall_accuracy

85.6%

StrongReject (logprobs)

inverted_asr

78.8%

BBQ

accuracy

66.5%

WMDP

inverted_accuracy

47.5%

Next steps

Lizzy 7B pages