Lizzy 7B ======== Lizzy 7B is an open-weight Flower Labs language model. It is designed for general assistant use, reasoning, coding assistance, and UK-oriented language and knowledge. The model is available on Hugging Face in two formats: - `flwrlabs/Lizzy-7B `_: the original BF16 Safetensors checkpoint for Transformers, vLLM, SGLang, and other GPU-serving stacks. - `flwrlabs/Lizzy-7B-GGUF `_: GGUF quantizations for local inference with runtimes that support the Lizzy GGUF architecture. The ``lorenzo-dev`` branch of ``relogu/llama.cpp`` has been smoke-tested with the Q4_K_M file. At a glance ----------- .. list-table:: :header-rows: 1 * - Property - Value * - Publisher - Flower Labs * - Model family - Lizzy * - Parameter scale - 7B-class * - Architecture - Decoder-only transformer * - Context length - Up to 65,536 tokens, depending on runtime and serving profile * - Primary language - English, with British English and UK-oriented behaviour enhancements * - Original checkpoint - BF16 Safetensors * - Quantized checkpoint - GGUF variants including Q4_K_M, Q5_K_M, Q6_K, Q8_0, and f16 * - License - Apache-2.0 for the base model; GGUF redistribution terms refer back to the base model license Architecture and configuration ------------------------------ Lizzy 7B is a 32-layer decoder-only transformer with long-context support. The release uses 32 attention heads, sliding/local attention behaviour, custom chat/control tokens, and deployment-specific serving configuration. The GGUF release reports the following serving-oriented configuration: - 32 layers with post-norm architecture - hidden size 4096 - sliding-window attention with a 4096-token window plus full-attention behaviour - YaRN RoPE scaling with factor 8.0 and original context 8192 - 100,278-token vocabulary - 65,536-token context Training approach ----------------- Lizzy 7B was produced through a multi-stage training process: - pre-training on large-scale public text, document, code, math, and encyclopedic corpora - supervised fine-tuning on instruction-following, dialogue, reasoning, and tool-use examples - direct preference optimisation for helpfulness, style, and answer quality - reinforcement learning with verifiable rewards for targeted behavioural refinement Training data sources include broad public text and knowledge sources, instruction and preference data, and UK-specific examples and preference signals. Evaluation highlights --------------------- The release compares Lizzy 7B with EuroLLM 9B and Apertus 8B on UK-oriented benchmarks and broader public benchmarks. .. list-table:: :header-rows: 1 * - Benchmark - Lizzy 7B - EuroLLM 9B - Apertus 8B * - Britishness MCQ - 71.0 - 77.6 - 80.8 * - Britishness CoT - 80.1 - 72.1 - 31.7 * - Britishness Domains - 89.9 - 69.0 - 32.6 .. list-table:: :header-rows: 1 * - Benchmark - Lizzy 7B - EuroLLM 9B - Apertus 8B * - MATH - 77.9 - 31.3 - 22.4 * - MMLU - 67.9 - 57.4 - 63.4 * - GPQA - 34.6 - 26.8 - 28.1 * - HumanEvalPlus - 70.2 - 28.2 - 33.4 * - MBPP+ - 52.5 - 41.7 - 42.3 * - LiveCodeBench v3 - 39.1 - 6.3 - 8.5 * - AIME - 35.8 - 0.2 - 0.6 * - GSM8K - 91.8 - 64.7 - 64.7 Lizzy 7B trails the comparison set on Britishness MCQ recall-style probing, but leads on Britishness CoT, Britishness domain reasoning, and most listed reasoning, math, knowledge, and coding benchmarks. Safety and limitations ---------------------- Lizzy 7B should be treated as an assistant model that can make mistakes. It can produce incorrect, outdated, or over-confident responses, and higher-risk workflows require human oversight, domain review, and downstream moderation. The UK-oriented tuning improves local style and cultural alignment, but it can also bias tone and assumptions toward UK conventions. The safety-evaluation summary reports: .. list-table:: :header-rows: 1 * - Safety benchmark - Metric - Score * - Overall safety average - ``overall_safety_average`` - 66.7% * - WildGuardTest - ``inverted_micro_harm_lower`` - 91.9% * - HarmBench - ``inverted_micro_asr_lower`` - 57.5% * - ToxiGen (tiny) - ``safe_overall`` - 90.2% * - XSTest - ``overall_accuracy`` - 85.6% * - StrongReject (logprobs) - ``inverted_asr`` - 78.8% * - BBQ - ``accuracy`` - 66.5% * - WMDP - ``inverted_accuracy`` - 47.5% Next steps ---------- - To run Lizzy locally or on a GPU server, see :doc:`how-to-run-lizzy`. - To plan hardware and memory for Lizzy variants, see :doc:`hardware-requirements`. - To understand the training method, see :doc:`lizzy-training-and-evaluation`. - To choose a quantized local-runtime model, see :doc:`lizzy-gguf`. - For product details, see the `Lizzy model page `_. - For Flower Labs research, see `Flower Research `_. - For enterprise deployments and custom work, see :doc:`enterprise`. Lizzy 7B pages -------------- .. toctree:: :maxdepth: 1 hardware-requirements