Troubleshooting =============== Use this page when a documented runtime does not start cleanly or returns a load-time error. GGUF runtime reports unknown architecture ----------------------------------------- Lizzy GGUF files report the model architecture as ``lizzy``. Local GGUF runtimes must understand that architecture before they can load the model. If you see an error like this, the runtime does not currently support the Lizzy GGUF architecture: .. code-block:: text unknown model architecture: 'lizzy' The following smoke tests on macOS used the ``Q4_K_M`` GGUF file. The table shows what happened with the tested runtime and what may become supported when the runtime is built from, or linked against, a Lizzy-compatible llama.cpp fork: .. list-table:: :header-rows: 1 * - Runtime - Tested path - Tested result - With the Lizzy llama.cpp fork * - Stock llama.cpp 9330 - ``llama-cli`` and ``llama-server`` - Failed at model load with ``unknown model architecture: 'lizzy'``. - Use ``relogu/llama.cpp`` branch ``lorenzo-dev``. ``llama-completion``, the ``-hf`` shortcut, Metal offload, CPU-only inference, and ``llama-server`` were smoke-tested successfully at commit ``991a41b``. * - llama-cpp-python 0.3.23 - ``Llama.from_pretrained`` - Failed at model load with ``unknown model architecture: 'lizzy'``. - A source build of the released ``llama-cpp-python==0.3.23`` package with vendored ``relogu/llama.cpp`` imported successfully and began loading Lizzy with Metal offload. It then failed while parsing the embedded chat template because ``llama-cpp-python`` did not understand the ``generation`` Jinja tag. This path needs Python wrapper support for Lizzy's chat template, or a compatible template override, before it can be considered end-to-end supported. * - Ollama 0.23.4 - ``ollama run hf.co/flwrlabs/Lizzy-7B-GGUF:Q4_K_M`` - Downloaded the model, then returned ``unable to load model``. - Work in progress. Ollama needs a backend that includes Lizzy architecture support. This was not retested with a custom Ollama build. * - Desktop GGUF apps - LM Studio, Jan, and similar apps - Work in progress; not smoke-tested in this pass. - Check the app's llama.cpp backend version before importing Lizzy. The backend must recognize ``general.architecture = lizzy``. If the app supports custom backends, use a Lizzy-compatible llama.cpp build; if it reports ``unknown model architecture: 'lizzy'``, update the backend or run Lizzy with llama.cpp directly. The ``lorenzo-dev`` branch of ``relogu/llama.cpp`` was smoke-tested at commit ``991a41b`` with ``lizzy-7b-q4_k_m.gguf`` and completed ``llama-completion``, ``-hf`` loading, Metal offload, CPU-only inference, and OpenAI-compatible ``llama-server`` inference. On an Apple M3 Ultra Mac Studio, the Q4_K_M file offloaded all 33 layers to Metal and generated at about 100 tokens per second in a short completion test. Use a Lizzy-compatible GGUF runtime or run the BF16 checkpoint with Transformers. If you maintain a local runtime, verify support for ``general.architecture = lizzy`` before using the GGUF files in production. To test the Lizzy-compatible llama.cpp branch directly: .. code-block:: bash git clone --branch lorenzo-dev https://github.com/relogu/llama.cpp.git cd llama.cpp cmake -B build -DCMAKE_BUILD_TYPE=Release cmake --build build --config Release --target llama-completion ./build/bin/llama-completion \ -m /path/to/lizzy-7b-q4_k_m.gguf \ -p "Q: 2+2? A:" \ -n 16 \ -c 1024 Metal fails to initialize on macOS ---------------------------------- In constrained macOS, virtualized, or sandboxed environments, llama.cpp can fail before inference with an error such as: .. code-block:: text ggml_metal_init: error: failed to create command queue This is an environment or device-access issue, not the same as an unsupported model architecture. Metal offload was verified on a real Apple M3 Ultra Mac Studio using the Lizzy-compatible llama.cpp branch. For a low-impact CPU-only smoke test, disable device offload and use a short context: .. code-block:: bash ./build/bin/llama-completion \ -m /path/to/lizzy-7b-q4_k_m.gguf \ -p "Q: 2+2? A:" \ -n 16 \ -c 1024 \ -t 2 \ -dev none \ -ngl 0 \ --no-op-offload Hugging Face shortcut does not use the expected cache ----------------------------------------------------- The llama.cpp ``-hf`` shortcut was verified with the Lizzy-compatible llama.cpp branch on a real macOS host. It can read from its own cache as well as the Hugging Face cache. If you are testing carefully and want to avoid unexpected downloads or writes to your home directory, set temporary cache locations: .. code-block:: bash HF_HOME=/tmp/lizzy-hf-cache \ LLAMA_CACHE=/tmp/lizzy-llama-cache \ ./build/bin/llama-completion \ -hf flwrlabs/Lizzy-7B-GGUF:Q4_K_M \ -p "Q: Say hi. A:" \ -n 8 \ -c 1024 If offline mode reports that a small preset or metadata file is missing, rerun without ``--offline`` once to populate the temporary llama.cpp cache, or use the direct ``-m /path/to/lizzy-7b-q4_k_m.gguf`` form. A non-fatal ``HEAD failed, status: 404`` message can appear during probing; the model can still resolve and load successfully afterward. llama-cpp-python fails while parsing the chat template ------------------------------------------------------ If a custom ``llama-cpp-python`` build gets past model loading but fails with a Jinja error such as: .. code-block:: text Encountered unknown tag 'generation' then the native llama.cpp library is likely compatible with Lizzy, but the Python wrapper does not support every Hugging Face chat-template extension stored in the GGUF metadata. A temporary compatibility shim is to strip the ``generation`` and ``endgeneration`` block tags before ``llama-cpp-python`` compiles the embedded template: .. code-block:: python import llama_cpp.llama_chat_format as chat_format original_init = chat_format.Jinja2ChatFormatter.__init__ def lizzy_chat_template_init( self, template, eos_token, bos_token, add_generation_prompt=True, stop_token_ids=None, ): template = template.replace("{% generation %}", "") template = template.replace("{% endgeneration %}", "") return original_init( self, template, eos_token, bos_token, add_generation_prompt, stop_token_ids, ) chat_format.Jinja2ChatFormatter.__init__ = lizzy_chat_template_init Apply the shim before constructing ``Llama``. This was tested with a source build of ``llama-cpp-python==0.3.23`` linked against ``relogu/llama.cpp`` at commit ``991a41b``; ``create_chat_completion`` then returned the expected response on the Q4_K_M GGUF file. vLLM installation or serving fails locally ------------------------------------------ vLLM is intended for supported accelerator environments, commonly Linux GPU servers. If installation or serving fails on a local laptop, test the same command on the target GPU environment before treating it as a model issue. On single NVIDIA A40 and H100 GPUs, ``vllm==0.21.0`` with ``torch==2.11.0+cu130`` loaded Lizzy with BF16 weights and completed short generation tests. The A40 path used FlashAttention 2; the H100 path used FlashAttention 3. The OpenAI-compatible vLLM server also started on H100 and responded to a ``/v1/chat/completions`` request. vLLM used its Transformers modeling backend for Lizzy, so validate throughput and feature coverage on the exact serving configuration you plan to deploy. Single-GPU H100 smoke tests passed at ``max_model_len`` values of 1024, 4096, 16384, and 32768. With ``gpu_memory_utilization=0.72``, vLLM reported about 50.7 GiB available for KV cache and maximum concurrency of about 100x at 1024 tokens, 25x at 4096 tokens, 6x at 16384 tokens, and 4x at 32768 tokens. Treat these as short smoke-test observations, not production sizing guarantees. Tensor-parallel in-process generation now passes in the tested environments after Lizzy's attention implementation was updated to derive local query and key/value head counts from the sharded projection tensors. The validated cases are ``tensor_parallel_size=2`` on two H100 GPUs and ``tensor_parallel_size=4`` on four A40 GPUs. These tests used a fresh Hugging Face model snapshot with ``max_position_embeddings=65536`` and ``vllm==0.21.0``. ``tensor_parallel_size=8`` is expected to be the next natural configuration for Lizzy's 32 query heads and 8 key/value heads, but it was not completed because an 8-GPU allocation was not available during testing. OpenAI-compatible ``vllm serve`` with tensor parallelism was also not confirmed in this pass. Validate both before relying on them in production. Older Lizzy model snapshots can fail under vLLM tensor parallelism with an attention reshape error such as ``shape '[1, 16384, 32, 128]' is invalid``. If you see that error, clear the Hugging Face cache or pin to a newer model revision that includes local tensor-parallel head-count handling in ``modeling_lizzy.py``. On the tested H100 Slurm node, FlashInfer's sampling JIT required the ``ninja`` executable to be available on ``PATH``. Installing the Python ``ninja`` package was not enough unless the virtual environment's ``bin`` directory was also on ``PATH`` before starting vLLM. A fully fresh Python user-base installation on the H100 node loaded the model but failed during vLLM engine profiling because Triton could not compile a small CUDA utility with the node's system ``gcc``. If a clean environment fails before generation with a Triton or CUDA utility compilation error, check the compiler, CUDA driver libraries, Python headers, and ``TMPDIR`` before treating the failure as a Lizzy model issue. On a two-GPU NVIDIA V100 machine, ``vllm==0.21.0`` installed successfully but pulled ``torch==2.11.0+cu130``, whose CUDA kernels did not support V100 compute capability 7.0. A basic CUDA tensor operation failed with ``no kernel image is available for execution on the device``. The same host completed a short Lizzy generation with ``vllm==0.10.2``, ``torch==2.8.0+cu128``, ``transformers==5.9.0``, ``dtype=float16``, ``max_model_len=1024``, and ``VLLM_USE_V1=0``. This path used vLLM's Transformers fallback and XFormers backend. It also needed a temporary compatibility shim because vLLM 0.10.2 expects the Transformers tokenizer attribute ``all_special_tokens_extended``, while Lizzy's Transformers 5 ``TokenizersBackend`` exposes ``all_special_tokens`` instead. Treat this as a validated workaround for V100 testing, not the preferred production path. For production vLLM serving, prefer a newer NVIDIA GPU supported by the current vLLM and PyTorch CUDA wheels, then validate ``vllm serve`` end to end with the exact vLLM version, GPU type, context length, and batching settings you plan to deploy. Transformers AutoTokenizer fails -------------------------------- If ``AutoTokenizer.from_pretrained("flwrlabs/Lizzy-7B", trust_remote_code=True)`` fails with: .. code-block:: text Tokenizer class TokenizersBackend does not exist or is not currently imported then your Transformers version does not include ``TokenizersBackend``. Use Python 3.10 or later and install Transformers 5.x: .. code-block:: bash pip install "transformers>=5,<6" jinja2 protobuf ``AutoTokenizer`` and ``apply_chat_template`` were tested successfully with ``transformers==5.9.0`` on Python 3.13. Transformers generation fails with token_type_ids or cache errors ----------------------------------------------------------------- The recommended Transformers 5.x path does not require either workaround. On macOS with Python 3.13, ``torch==2.12.0``, and ``transformers==5.9.0``, ``AutoTokenizer`` returned only ``input_ids`` and ``attention_mask``, and default cached generation completed successfully with the BF16 checkpoint. If you are pinned to an older Transformers 4.x stack or using a manual ``PreTrainedTokenizerFast`` workaround, generation may fail because extra ``token_type_ids`` are passed to the model, or because cache handling raises an error like ``'int' object has no attribute 'shape'``. In that older-stack case, remove ``token_type_ids`` before generation and pass ``use_cache=False`` to ``generate``. Transformers emits a RoPE scaling warning ----------------------------------------- When loading the Transformers config, the current model metadata can emit a warning about the explicit RoPE factor differing from the implicit factor. This warning comes from the model configuration. Validate generation quality on your target runtime and pin the model revision used for deployment. Downloads are large ------------------- The smallest GGUF variant is several gigabytes, and the BF16 checkpoint is larger. Check disk space and network stability before running the examples. For constrained machines, start with ``Q4_K_M`` once your runtime supports Lizzy GGUF. See :doc:`hardware-requirements` for memory and disk planning guidance. The curl example cannot connect ------------------------------- The curl example expects an OpenAI-compatible server listening on ``localhost:8000`` and serving the model name ``flwrlabs/Lizzy-7B``. Start the server first, confirm the port, and keep the model name in the request body aligned with the served model.