Hardware Requirements¶

Lizzy 7B hardware needs depend on the model format, runtime, context length, and batching. Treat the numbers below as planning guidance, then validate on the exact runtime and prompt lengths you will use in production.

Recommended starting points¶

Use case	Practical starting point	Notes
Transformers BF16, short prompts	24 GB GPU memory	The BF16 weights are about 14 GB before runtime overhead and KV cache.
Transformers BF16, long context	40 GB or more GPU memory	Long prompts can add tens of GB of KV cache.
vLLM serving	24 GB or more GPU memory	Single-GPU A40/H100 smoke tests passed. Tensor-parallel in-process generation passed on 2x H100 and 4x A40; validate serving separately.
GGUF Q4_K_M	8 GB unified memory or RAM minimum, 16 GB recommended	Requires a runtime that supports the Lizzy GGUF architecture.
GGUF Q5_K_M or Q6_K	16 GB unified memory or RAM recommended	Use these when quality matters and local memory is available.
GGUF Q8_0	24 GB unified memory or RAM recommended	Near-lossless quantization needs more memory headroom.
GGUF f16	32 GB unified memory or RAM recommended	Use for local quality checks only when memory is ample.

Disk space¶

Plan for the model file, cache duplication, and temporary download files. The GGUF repository includes variants from about 4.5 GB for Q4_K_M to about 14.6 GB for the f16 file. The BF16 Safetensors checkpoint requires more disk space than the GGUF quantizations.

For a comfortable local setup, keep at least 2x the selected model size free. This leaves room for the Hugging Face cache, partial downloads, and runtime metadata.

Context length and KV cache¶

Long context increases memory use even after the model weights fit. Lizzy uses 32 layers, hidden size 4096, and 32 attention heads. With BF16 or FP16 KV cache, a rough upper bound is about 0.5 MB per token for batch size 1.

Approximate KV cache sizes:

Context length	KV cache estimate
4,096 tokens	About 2 GB
8,192 tokens	About 4 GB
32,768 tokens	About 16 GB
65,536 tokens	About 32 GB

Batching and concurrent requests multiply KV cache use. If a runtime uses KV cache quantization or paged attention, memory use can be lower, but validate the actual configuration before relying on it.

CPU and memory¶

For GGUF local inference, prefer a modern CPU with high memory bandwidth. Apple Silicon machines can use unified memory effectively when the runtime supports the model architecture and can access the Metal device. In virtualized, sandboxed, or otherwise constrained macOS environments, start with a CPU-only smoke test before enabling GPU offload. On CPU-only systems, generation speed depends heavily on memory bandwidth, quantization level, thread count, and context length.

On an Apple M3 Ultra Mac Studio with the Lizzy-compatible llama.cpp branch, the Q4_K_M GGUF file offloaded all 33 layers to Metal and completed short completion and server tests. Treat these numbers as a capability check rather than a throughput guarantee for other prompts, quantizations, or machines.

For GPU serving, prefer CUDA-capable Linux systems for vLLM. Transformers can also run on other accelerator backends, but deployment behaviour and memory headroom should be tested per environment.

In short H100 vLLM smoke tests with vllm==0.21.0 and gpu_memory_utilization=0.72, Lizzy loaded at max_model_len values up to 32768. Tensor-parallel in-process generation passed with tensor_parallel_size=2 on H100 and tensor_parallel_size=4 on A40. tensor_parallel_size=8 was not tested because a full 8-GPU allocation was not available. OpenAI-compatible vLLM serving with tensor parallelism should still be validated separately before production use.

Runtime support¶

Hardware is necessary but not sufficient for the GGUF files. The runtime must also support general.architecture = lizzy. If loading fails with an architecture error, see Troubleshooting.