Lizzy GGUF

flwrlabs/Lizzy-7B-GGUF contains quantized GGUF versions of Lizzy 7B for efficient local inference. Use these files when you want CPU inference, flexible GPU offload, smaller model files, or fast loading with a runtime that supports the Lizzy GGUF architecture.

Available variants

Variant

Reported file size

Reported quality retention

Recommended use

Q4_K_M

4.2 GB

92%

Resource-constrained environments

Q5_K_M

4.8 GB

95%

Best balance of quality and size

Q6_K

5.6 GB

97%

Between Q5 and Q8

Q8_0

7.2 GB

99%

Near-lossless compression

f16

13.6 GB

100%

Maximum quality and benchmarking

The Hugging Face file browser can show slightly different sizes because rounded sizes and file-browser metadata use different views.

All listed GGUF files were smoke-tested with the Lizzy-compatible relogu/llama.cpp branch at commit 991a41b using CPU-only inference, n_ctx=512, and a short prompt. The quantized files loaded through the llama.cpp Hugging Face shortcut using Q4_K_M, Q5_K_M, Q6_K, and Q8_0 selectors. The f16 file, lizzy-final.gguf, was tested by downloading the file directly and loading it with -m.

Architecture details

The GGUF release reports:

  • base model: Lizzy 7B

  • layers: 32 with post-norm architecture

  • hidden size: 4096

  • attention: sliding window 4096 plus full attention

  • RoPE: YaRN scaling with factor 8.0 and original context 8192

  • vocabulary: 100,278 tokens

  • context: 65,536 tokens

  • tensors: 355, including attn_post_norm and ffn_post_norm

Reasoning behaviour

Lizzy 7B can emit reasoning tokens before the final answer. Applications that expose model output directly should decide whether to show, hide, or post-process reasoning traces according to product and safety requirements.

When to use GGUF

Use GGUF when:

  • you need CPU inference

  • you want flexible GPU layer offload

  • you need smaller model files

  • you are using a local runtime that supports the Lizzy GGUF architecture

  • fast local loading matters

Use the original BF16 Safetensors checkpoint when:

  • you need full precision

  • you are using Transformers or vLLM

  • you need serving features not available in your GGUF runtime

  • you want to fine-tune the model

Example

The example below requires a llama.cpp build that supports the Lizzy GGUF architecture. The lorenzo-dev branch of relogu/llama.cpp is the currently tested compatibility branch for Lizzy. Use upstream llama.cpp or a packaged runtime instead once it includes support for general.architecture = lizzy. If another runtime reports unknown model architecture: 'lizzy', see Troubleshooting.

git clone --branch lorenzo-dev https://github.com/relogu/llama.cpp.git
cd llama.cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release --target llama-server
./build/bin/llama-server \
  -hf flwrlabs/Lizzy-7B-GGUF:Q5_K_M \
  -c 1024

Then use the llama.cpp web UI or OpenAI-compatible local API exposed by llama-server.