Run Lizzy
=========

This guide shows the most common ways to run Lizzy 7B from Hugging Face.
Use the BF16 checkpoint when you want full precision, Transformers/vLLM
serving, tensor parallelism, or fine-tuning. Use GGUF when you have a local
runtime that supports the Lizzy GGUF architecture and want smaller files or
local inference.

Before choosing a runtime, check :doc:`hardware-requirements` for memory,
disk, and long-context planning guidance.

Run with Transformers
---------------------

Use Python 3.10 or later and install PyTorch, Transformers 5.x, ``jinja2``,
and ``protobuf``. Transformers 5.x includes the ``TokenizersBackend`` tokenizer
class used by the Lizzy tokenizer metadata:

.. code-block:: bash

    pip install "transformers>=5,<6" torch accelerate jinja2 protobuf

Then load the base checkpoint with ``trust_remote_code=True``:

.. code-block:: python

    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer

    repo_id = "flwrlabs/Lizzy-7B"

    tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained(
        repo_id,
        trust_remote_code=True,
        torch_dtype=torch.bfloat16,
        device_map="auto",
    )

    messages = [
        {"role": "system", "content": "You are Lizzy 7B."},
        {"role": "user", "content": "Summarise why queue etiquette matters in the UK."},
    ]
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    output_ids = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=False,
    )
    response = tokenizer.decode(
        output_ids[0][inputs["input_ids"].shape[-1] :],
        skip_special_tokens=True,
    )
    print(response)

This path was tested on macOS with Python 3.13, ``torch==2.12.0``,
``transformers==5.9.0``, and the BF16 checkpoint. ``AutoTokenizer``,
chat-template rendering, and default cached generation returned the expected
short response.

Run with vLLM
-------------

vLLM exposes an OpenAI-compatible API for GPU serving. Use this path on a
Linux GPU server or another environment supported by your vLLM version:

.. code-block:: bash

    pip install vllm
    vllm serve "flwrlabs/Lizzy-7B" --trust-remote-code

Then create ``request.json``:

.. code-block:: json

    {
      "model": "flwrlabs/Lizzy-7B",
      "messages": [
        {
          "role": "user",
          "content": "What is the capital of France?"
        }
      ]
    }

Then call the server:

.. code-block:: bash

    curl -X POST "http://localhost:8000/v1/chat/completions" \
      -H "Content-Type: application/json" \
      --data @request.json

This path was smoke-tested with ``vllm==0.21.0`` and
``torch==2.11.0+cu130`` on single NVIDIA A40 and H100 GPUs. The
OpenAI-compatible server responded on H100, and in-process generation worked
on H100 up to ``max_model_len=32768`` in short tests.

Tensor-parallel in-process generation also passed with ``tensor_parallel_size=2``
on two H100 GPUs and ``tensor_parallel_size=4`` on four A40 GPUs after Lizzy's
attention reshape logic was updated to use local tensor-parallel head counts.
``tensor_parallel_size=8`` and OpenAI-compatible serving with tensor
parallelism were not completed in this test pass, so validate the exact vLLM
server configuration before relying on multi-GPU serving in production. Runtime
support can vary by vLLM version, GPU generation, and backend.

Run the GGUF model with llama.cpp
---------------------------------

The GGUF release provides quantized files for local runtimes that support the
Lizzy GGUF architecture. The Hugging Face quick start uses the ``Q4_K_M``
variant:

The command below uses the currently tested Lizzy-compatible llama.cpp fork and
branch. Use upstream llama.cpp or a packaged runtime instead once it includes
support for ``general.architecture = lizzy``.

.. code-block:: bash

    git clone --branch lorenzo-dev https://github.com/relogu/llama.cpp.git
    cd llama.cpp
    cmake -B build -DCMAKE_BUILD_TYPE=Release
    cmake --build build --config Release --target llama-completion llama-server
    ./build/bin/llama-server \
      -hf flwrlabs/Lizzy-7B-GGUF:Q4_K_M \
      -c 1024

If your runtime reports ``unknown model architecture: 'lizzy'``, it does not
include Lizzy GGUF support. Use a Lizzy-compatible build or run the BF16
checkpoint with Transformers instead. See :doc:`troubleshooting` for details
from smoke tests.

For direct terminal completion:

.. code-block:: bash

    ./build/bin/llama-completion \
      -hf flwrlabs/Lizzy-7B-GGUF:Q4_K_M \
      -p "Q: 2+2? A:" \
      -n 16 \
      -c 1024

In constrained macOS, virtualized, or sandboxed environments, Metal device
initialization can fail. To force a small CPU-only smoke test, use:

.. code-block:: bash

    ./build/bin/llama-completion \
      -hf flwrlabs/Lizzy-7B-GGUF:Q4_K_M \
      -p "Q: 2+2? A:" \
      -n 16 \
      -c 1024 \
      -t 2 \
      -dev none \
      -ngl 0 \
      --no-op-offload

Run the GGUF model with llama-cpp-python
----------------------------------------

This path requires a llama-cpp-python build linked against a llama.cpp version
that supports the Lizzy GGUF architecture. It may also require
llama-cpp-python support for Lizzy's embedded chat template. If initialization
fails with an unknown Jinja ``generation`` tag, use the llama.cpp server path
or see :doc:`troubleshooting`.

.. code-block:: python

    import llama_cpp.llama_chat_format as chat_format
    from llama_cpp import Llama

    original_init = chat_format.Jinja2ChatFormatter.__init__

    def lizzy_chat_template_init(
        self,
        template,
        eos_token,
        bos_token,
        add_generation_prompt=True,
        stop_token_ids=None,
    ):
        template = template.replace("{% generation %}", "")
        template = template.replace("{% endgeneration %}", "")
        return original_init(
            self,
            template,
            eos_token,
            bos_token,
            add_generation_prompt,
            stop_token_ids,
        )

    chat_format.Jinja2ChatFormatter.__init__ = lizzy_chat_template_init

    llm = Llama(
        model_path="/path/to/lizzy-7b-q4_k_m.gguf",
        n_ctx=1024,
        n_gpu_layers=-1,
    )

    output = llm.create_chat_completion(
        messages=[
            {"role": "system", "content": "/no_think"},
            {"role": "user", "content": "Reply with exactly: ok"},
        ],
        max_tokens=16,
        temperature=0,
    )
    print(output["choices"][0]["message"]["content"])

Run the GGUF model with Ollama
------------------------------

Ollama support is work in progress. Current public Ollama builds may not yet
include backend support for the Lizzy GGUF architecture. Until that support is
available, prefer the Lizzy-compatible llama.cpp path above.

.. code-block:: bash

    ollama run hf.co/flwrlabs/Lizzy-7B-GGUF:Q4_K_M

Desktop GGUF apps such as LM Studio and Jan are also work in progress for
Lizzy. To use Lizzy in a desktop app, first confirm that the app version ships
with a llama.cpp backend that recognizes ``general.architecture = lizzy``. If
the app lets you choose a custom backend, point it at a Lizzy-compatible
llama.cpp build and import one of the ``flwrlabs/Lizzy-7B-GGUF``
quantizations. If it reports ``unknown model architecture: 'lizzy'``, update
the app backend or use llama.cpp directly.

Choose a runtime
----------------

.. list-table::
   :header-rows: 1

   * - Runtime
     - Use it when
   * - Transformers
     - You need Python integration, full precision, custom model code, or
       fine-tuning workflows.
   * - vLLM
     - You need GPU serving, OpenAI-compatible APIs, batching, or tensor
       parallelism.
   * - llama.cpp
     - You have a build that supports Lizzy GGUF and need local inference, CPU
       support, flexible GPU offload, or a small deployment footprint.
   * - Ollama, LM Studio, Jan
     - Work in progress. Use only after verifying that the app backend includes
       Lizzy GGUF support.

Generation settings
-------------------

The GGUF examples use ``temperature=0.6`` and ``top_p=0.95``.
For deterministic documentation, coding, or extraction tasks, start with a
lower temperature such as ``0.2``. For more conversational output, increase the
temperature gradually and evaluate outputs for factuality and style.