Lizzy Training and Evaluation

Lizzy 7B was produced through a multi-stage training and evaluation process. This page gives a high-level summary of the method and reported evaluation areas without depending on non-public implementation details.

Training method

The release uses four broad stages:

Pre-training

Large-scale public text, document, code, math, and encyclopedic corpora.

Supervised fine-tuning

Instruction-following, dialogue, reasoning, and tool-use examples.

Direct preference optimisation

Preference pairs used to improve helpfulness, style, and answer quality.

Reinforcement learning with verifiable rewards

Targeted behavioural refinement with verifiable reward signals.

Data mix

The release describes a mix of:

  • broad public text and knowledge sources

  • instruction and preference data

  • UK-specific examples and preference signals

The full training data mix is not redistributed with the public checkpoints.

Evaluation

The release reports UK-oriented benchmarks, general knowledge and reasoning benchmarks, coding benchmarks, math benchmarks, instruction-following evaluations, and safety evaluations.

The reported comparison set includes EuroLLM 9B and Apertus 8B. Lizzy 7B leads that local baseline set on most represented reasoning, math, knowledge, and coding rows, while trailing on the Britishness MCQ recall-style benchmark.

Safety evaluation

The release reports a task-level safety summary across WildGuardTest, HarmBench, ToxiGen, XSTest, StrongReject, BBQ, and WMDP. These numbers are evaluation signals, not guarantees. Production deployments should include human oversight where appropriate, policy checks, monitoring, and downstream moderation.

Release and serving workflow

The GGUF release provides llama.cpp-compatible files for local deployment. The BF16 Safetensors release is the target for full-precision GPU serving, fine-tuning, and Transformers or vLLM workflows.