Troubleshooting¶
Use this page when a documented runtime does not start cleanly or returns a load-time error.
GGUF runtime reports unknown architecture¶
Lizzy GGUF files report the model architecture as lizzy. Local GGUF
runtimes must understand that architecture before they can load the model.
If you see an error like this, the runtime does not currently support the Lizzy GGUF architecture:
unknown model architecture: 'lizzy'
The following smoke tests on macOS used the Q4_K_M GGUF file. The table
shows what happened with the tested runtime and what may become supported when
the runtime is built from, or linked against, a Lizzy-compatible llama.cpp
fork:
Runtime |
Tested path |
Tested result |
With the Lizzy llama.cpp fork |
|---|---|---|---|
Stock llama.cpp 9330 |
|
Failed at model load with |
Use |
llama-cpp-python 0.3.23 |
|
Failed at model load with |
A source build of the released |
Ollama 0.23.4 |
|
Downloaded the model, then returned |
Work in progress. Ollama needs a backend that includes Lizzy architecture support. This was not retested with a custom Ollama build. |
Desktop GGUF apps |
LM Studio, Jan, and similar apps |
Work in progress; not smoke-tested in this pass. |
Check the app’s llama.cpp backend version before importing Lizzy. The
backend must recognize |
The lorenzo-dev branch of relogu/llama.cpp was smoke-tested at commit
991a41b with lizzy-7b-q4_k_m.gguf and completed llama-completion,
-hf loading, Metal offload, CPU-only inference, and OpenAI-compatible
llama-server inference. On an Apple M3 Ultra Mac Studio, the Q4_K_M file
offloaded all 33 layers to Metal and generated at about 100 tokens per second
in a short completion test. Use a Lizzy-compatible GGUF runtime or run the BF16
checkpoint with Transformers. If you maintain a local runtime, verify support
for general.architecture = lizzy before using the GGUF files in production.
To test the Lizzy-compatible llama.cpp branch directly:
git clone --branch lorenzo-dev https://github.com/relogu/llama.cpp.git
cd llama.cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release --target llama-completion
./build/bin/llama-completion \
-m /path/to/lizzy-7b-q4_k_m.gguf \
-p "Q: 2+2? A:" \
-n 16 \
-c 1024
Metal fails to initialize on macOS¶
In constrained macOS, virtualized, or sandboxed environments, llama.cpp can fail before inference with an error such as:
ggml_metal_init: error: failed to create command queue
This is an environment or device-access issue, not the same as an unsupported model architecture. Metal offload was verified on a real Apple M3 Ultra Mac Studio using the Lizzy-compatible llama.cpp branch. For a low-impact CPU-only smoke test, disable device offload and use a short context:
./build/bin/llama-completion \
-m /path/to/lizzy-7b-q4_k_m.gguf \
-p "Q: 2+2? A:" \
-n 16 \
-c 1024 \
-t 2 \
-dev none \
-ngl 0 \
--no-op-offload
Hugging Face shortcut does not use the expected cache¶
The llama.cpp -hf shortcut was verified with the Lizzy-compatible llama.cpp
branch on a real macOS host. It can read from its own cache as well as the
Hugging Face cache. If you are testing carefully and want to avoid unexpected
downloads or writes to your home directory, set temporary cache locations:
HF_HOME=/tmp/lizzy-hf-cache \
LLAMA_CACHE=/tmp/lizzy-llama-cache \
./build/bin/llama-completion \
-hf flwrlabs/Lizzy-7B-GGUF:Q4_K_M \
-p "Q: Say hi. A:" \
-n 8 \
-c 1024
If offline mode reports that a small preset or metadata file is missing, rerun
without --offline once to populate the temporary llama.cpp cache, or use the
direct -m /path/to/lizzy-7b-q4_k_m.gguf form. A non-fatal HEAD failed,
status: 404 message can appear during probing; the model can still resolve
and load successfully afterward.
llama-cpp-python fails while parsing the chat template¶
If a custom llama-cpp-python build gets past model loading but fails with a
Jinja error such as:
Encountered unknown tag 'generation'
then the native llama.cpp library is likely compatible with Lizzy, but the
Python wrapper does not support every Hugging Face chat-template extension
stored in the GGUF metadata. A temporary compatibility shim is to strip the
generation and endgeneration block tags before llama-cpp-python
compiles the embedded template:
import llama_cpp.llama_chat_format as chat_format
original_init = chat_format.Jinja2ChatFormatter.__init__
def lizzy_chat_template_init(
self,
template,
eos_token,
bos_token,
add_generation_prompt=True,
stop_token_ids=None,
):
template = template.replace("{% generation %}", "")
template = template.replace("{% endgeneration %}", "")
return original_init(
self,
template,
eos_token,
bos_token,
add_generation_prompt,
stop_token_ids,
)
chat_format.Jinja2ChatFormatter.__init__ = lizzy_chat_template_init
Apply the shim before constructing Llama. This was tested with a source
build of llama-cpp-python==0.3.23 linked against relogu/llama.cpp at
commit 991a41b; create_chat_completion then returned the expected
response on the Q4_K_M GGUF file.
vLLM installation or serving fails locally¶
vLLM is intended for supported accelerator environments, commonly Linux GPU servers. If installation or serving fails on a local laptop, test the same command on the target GPU environment before treating it as a model issue.
On single NVIDIA A40 and H100 GPUs, vllm==0.21.0 with
torch==2.11.0+cu130 loaded Lizzy with BF16 weights and completed short
generation tests. The A40 path used FlashAttention 2; the H100 path used
FlashAttention 3. The OpenAI-compatible vLLM server also started on H100 and
responded to a /v1/chat/completions request. vLLM used its Transformers
modeling backend for Lizzy, so validate throughput and feature coverage on the
exact serving configuration you plan to deploy.
Single-GPU H100 smoke tests passed at max_model_len values of 1024, 4096,
16384, and 32768. With gpu_memory_utilization=0.72, vLLM reported about
50.7 GiB available for KV cache and maximum concurrency of about 100x at 1024
tokens, 25x at 4096 tokens, 6x at 16384 tokens, and 4x at 32768 tokens. Treat
these as short smoke-test observations, not production sizing guarantees.
Tensor-parallel in-process generation now passes in the tested environments
after Lizzy’s attention implementation was updated to derive local query and
key/value head counts from the sharded projection tensors. The validated cases
are tensor_parallel_size=2 on two H100 GPUs and tensor_parallel_size=4
on four A40 GPUs. These tests used a fresh Hugging Face model snapshot with
max_position_embeddings=65536 and vllm==0.21.0.
tensor_parallel_size=8 is expected to be the next natural configuration for
Lizzy’s 32 query heads and 8 key/value heads, but it was not completed because
an 8-GPU allocation was not available during testing. OpenAI-compatible
vllm serve with tensor parallelism was also not confirmed in this pass.
Validate both before relying on them in production.
Older Lizzy model snapshots can fail under vLLM tensor parallelism with an
attention reshape error such as shape '[1, 16384, 32, 128]' is invalid. If
you see that error, clear the Hugging Face cache or pin to a newer model
revision that includes local tensor-parallel head-count handling in
modeling_lizzy.py.
On the tested H100 Slurm node, FlashInfer’s sampling JIT required the
ninja executable to be available on PATH. Installing the Python
ninja package was not enough unless the virtual environment’s bin
directory was also on PATH before starting vLLM.
A fully fresh Python user-base installation on the H100 node loaded the model
but failed during vLLM engine profiling because Triton could not compile a
small CUDA utility with the node’s system gcc. If a clean environment fails
before generation with a Triton or CUDA utility compilation error, check the
compiler, CUDA driver libraries, Python headers, and TMPDIR before treating
the failure as a Lizzy model issue.
On a two-GPU NVIDIA V100 machine, vllm==0.21.0 installed successfully but
pulled torch==2.11.0+cu130, whose CUDA kernels did not support V100 compute
capability 7.0. A basic CUDA tensor operation failed with no kernel image is
available for execution on the device.
The same host completed a short Lizzy generation with vllm==0.10.2,
torch==2.8.0+cu128, transformers==5.9.0, dtype=float16,
max_model_len=1024, and VLLM_USE_V1=0. This path used vLLM’s
Transformers fallback and XFormers backend. It also needed a temporary
compatibility shim because vLLM 0.10.2 expects the Transformers tokenizer
attribute all_special_tokens_extended, while Lizzy’s Transformers 5
TokenizersBackend exposes all_special_tokens instead. Treat this as a
validated workaround for V100 testing, not the preferred production path.
For production vLLM serving, prefer a newer NVIDIA GPU supported by the current
vLLM and PyTorch CUDA wheels, then validate vllm serve end to end with the
exact vLLM version, GPU type, context length, and batching settings you plan to
deploy.
Transformers AutoTokenizer fails¶
If AutoTokenizer.from_pretrained("flwrlabs/Lizzy-7B",
trust_remote_code=True) fails with:
Tokenizer class TokenizersBackend does not exist or is not currently imported
then your Transformers version does not include TokenizersBackend. Use
Python 3.10 or later and install Transformers 5.x:
pip install "transformers>=5,<6" jinja2 protobuf
AutoTokenizer and apply_chat_template were tested successfully with
transformers==5.9.0 on Python 3.13.
Transformers generation fails with token_type_ids or cache errors¶
The recommended Transformers 5.x path does not require either workaround. On
macOS with Python 3.13, torch==2.12.0, and transformers==5.9.0,
AutoTokenizer returned only input_ids and attention_mask, and
default cached generation completed successfully with the BF16 checkpoint.
If you are pinned to an older Transformers 4.x stack or using a manual
PreTrainedTokenizerFast workaround, generation may fail because extra
token_type_ids are passed to the model, or because cache handling raises an
error like 'int' object has no attribute 'shape'. In that older-stack case,
remove token_type_ids before generation and pass use_cache=False to
generate.
Transformers emits a RoPE scaling warning¶
When loading the Transformers config, the current model metadata can emit a warning about the explicit RoPE factor differing from the implicit factor. This warning comes from the model configuration. Validate generation quality on your target runtime and pin the model revision used for deployment.
Downloads are large¶
The smallest GGUF variant is several gigabytes, and the BF16 checkpoint is
larger. Check disk space and network stability before running the examples.
For constrained machines, start with Q4_K_M once your runtime supports
Lizzy GGUF. See Hardware Requirements for memory and disk planning
guidance.
The curl example cannot connect¶
The curl example expects an OpenAI-compatible server listening on
localhost:8000 and serving the model name flwrlabs/Lizzy-7B. Start the
server first, confirm the port, and keep the model name in the request body
aligned with the served model.