Lizzy GGUF¶

flwrlabs/Lizzy-7B-GGUF contains quantized GGUF versions of Lizzy 7B for efficient local inference. Use these files when you want CPU inference, flexible GPU offload, smaller model files, or fast loading with a runtime that supports the Lizzy GGUF architecture.

Available variants¶

Variant	Reported file size	Reported quality retention	Recommended use
Q4_K_M	4.2 GB	92%	Resource-constrained environments
Q5_K_M	4.8 GB	95%	Best balance of quality and size
Q6_K	5.6 GB	97%	Between Q5 and Q8
Q8_0	7.2 GB	99%	Near-lossless compression
f16	13.6 GB	100%	Maximum quality and benchmarking

The Hugging Face file browser can show slightly different sizes because rounded sizes and file-browser metadata use different views.

All listed GGUF files were smoke-tested with the Lizzy-compatible relogu/llama.cpp branch at commit 991a41b using CPU-only inference, n_ctx=512, and a short prompt. The quantized files loaded through the llama.cpp Hugging Face shortcut using Q4_K_M, Q5_K_M, Q6_K, and Q8_0 selectors. The f16 file, lizzy-final.gguf, was tested by downloading the file directly and loading it with -m.

Recommended default¶

Start with Q5_K_M when quality matters and you still want a compact local model. Use Q4_K_M when memory or disk is tight. Use Q8_0 or f16 for quality-sensitive benchmarking or when local resources are less constrained. See Hardware Requirements for local memory and disk planning guidance.

Architecture details¶

The GGUF release reports:

base model: Lizzy 7B
layers: 32 with post-norm architecture
hidden size: 4096
attention: sliding window 4096 plus full attention
RoPE: YaRN scaling with factor 8.0 and original context 8192
vocabulary: 100,278 tokens
context: 65,536 tokens
tensors: 355, including attn_post_norm and ffn_post_norm

Reasoning behaviour¶

Lizzy 7B can emit reasoning tokens before the final answer. Applications that expose model output directly should decide whether to show, hide, or post-process reasoning traces according to product and safety requirements.

When to use GGUF¶

Use GGUF when:

you need CPU inference
you want flexible GPU layer offload
you need smaller model files
you are using a local runtime that supports the Lizzy GGUF architecture
fast local loading matters

Use the original BF16 Safetensors checkpoint when:

you need full precision
you are using Transformers or vLLM
you need serving features not available in your GGUF runtime
you want to fine-tune the model

Example¶

The example below requires a llama.cpp build that supports the Lizzy GGUF architecture. The lorenzo-dev branch of relogu/llama.cpp is the currently tested compatibility branch for Lizzy. Use upstream llama.cpp or a packaged runtime instead once it includes support for general.architecture = lizzy. If another runtime reports unknown model architecture: 'lizzy', see Troubleshooting.

git clone --branch lorenzo-dev https://github.com/relogu/llama.cpp.git
cd llama.cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release --target llama-server
./build/bin/llama-server \
  -hf flwrlabs/Lizzy-7B-GGUF:Q5_K_M \
  -c 1024

Then use the llama.cpp web UI or OpenAI-compatible local API exposed by llama-server.