@flwrlabs/fed-legal-llm

Federated Legal LLM Fine-Tuning with Flower and LoRA

This example demonstrates how to perform federated fine-tuning of a large language model (LLM) using Flower, PyTorch, Transformers, and PEFT/LoRA. It uses an English-only 5-silo legal instruction-tuning dataset hosted on Hugging Face and supports both simulation and deployment workflows.

The project includes:

A causal language model for instruction tuning
LoRA-based parameter-efficient fine-tuning
Data loading pipelines using Hugging Face Datasets
Federated training with Flower
Support for both simulation and deployment workflows

Fetch the App

Install Flower:

pip install flwr

Fetch the app:

flwr new @flwrlabs/fed-legal-llm

Then, install dependencies:

cd fed-legal-llm && pip install -e .

Project structure:

fed-legal-llm
├── fed_legal_llm
│   ├── __init__.py
│   ├── client_app.py   # Client-side LoRA fine-tuning logic
│   ├── server_app.py   # Server-side orchestration and evaluation
│   └── task.py         # Model, data loading, training, evaluation
├── pyproject.toml      # Dependencies and configuration
└── README.md

Run the App

This Flower App supports both simulation mode and deployment mode without code changes.

Run with the Simulation Engine

In simulation mode:

The federated legal dataset is automatically downloaded from Hugging Face
Data is partitioned across clients using the dataset’s natural client identifier
Each client trains on its own legal silo using LoRA fine-tuning

Run with default configuration:

flwr run .

Override configuration (example):

flwr run . --run-config "num-server-rounds=5 local-epochs=1 batch-size=2"

Key configurable parameters (from pyproject.toml):

model-name: pretrained base model to fine-tune
dataset-id: Hugging Face dataset repo
num-server-rounds: number of FL rounds
local-epochs: local training epochs per client
batch-size: local batch size
learning-rate: local optimizer learning rate
max-seq-length: maximum tokenized sequence length
load-in-4bit: whether to use 4-bit quantized loading
torch-dtype: model compute dtype
lora-r: LoRA rank
lora-alpha: LoRA alpha
lora-dropout: LoRA dropout
lora-target-modules: modules to adapt with LoRA
fraction-train: fraction of clients sampled for training each round
fraction-evaluate: fraction of clients sampled for client-side evaluation if enabled

Model

The default model is:

HuggingFaceTB/SmolLM3-3B

The code is written to make switching to another pre-trained causal LLM easy by changing the model name in pyproject.toml, for example:

mistralai/Mistral-7B-Instruct-v0.3

The model is loaded with Hugging Face AutoModelForCausalLM and tokenized with AutoTokenizer. LoRA adapters are applied using PEFT, so only the adapter parameters are trained and federated.

Typical LoRA configuration includes:

r
alpha
dropout
target_modules

This keeps communication and GPU memory requirements much lower than full-model fine-tuning.

Dataset

Data is loaded from a Hugging Face dataset containing three splits:

train
valid
test

Each row contains:

messages
client_id
task_type
language
example_id

The dataset is an English-only 5-silo legal instruction dataset with the following clients:

0: legal provision classification
1: case holding selection
2: unfair Terms of Service classification
3: Supreme Court issue classification
4: contract NLI / legal reasoning

In simulation mode:

clients are identified using the dataset’s client_id
each Flower client trains only on its own silo

In deployment mode:

each node loads a pre-partitioned local dataset from disk

Data Pipeline

The data pipeline:

loads the Hugging Face dataset
filters rows by client_id
applies the tokenizer’s chat template to messages
tokenizes the prompt/response sequence
creates labels for supervised causal language modeling

Two modes are supported:

Simulation mode → partitions the centralized Hugging Face dataset by client_id
Deployment mode → loads locally stored partitions with load_from_disk(...)

The code is designed so that changing the underlying base model only requires updating configuration, as long as the model is compatible with Hugging Face transformers.

Training

Each client:

receives the current global LoRA adapter weights
reconstructs the base model and applies the LoRA adapters
trains locally on its own legal training partition
returns updated adapter weights to the server

Local training uses:

causal language modeling loss
AdamW optimizer
LoRA fine-tuning through PEFT

Only the LoRA weights are exchanged during federation, not the full base model weights.

This makes the setup practical for large instruction-tuned models.

Training logic is defined in:

fed_legal_llm/task.py
fed_legal_llm/client_app.py

Evaluation

Server-side evaluation:

uses the centralized test split
evaluates the global model after each federated round
reports text-generation metrics suitable for all legal silos

The evaluation metrics are:

Exact Match (EM) — strict normalized match between prediction and reference
Token-level F1 — overlap-based score between prediction and reference
Macro client F1 — average token-level F1 across clients, giving equal weight to each client
Macro client EM — average exact match across clients

Why these metrics:

the dataset contains heterogeneous legal tasks
all tasks are cast as text generation
exact match is strict and interpretable
token F1 gives partial credit when answers are close but not identical
macro client scores prevent large silos from dominating the overall result

Run with the Deployment Engine

For deployment, you must provide local dataset partitions.

Step 1: Prepare data

Download and partition the dataset manually, then store one local partition per node.

Each local partition should contain examples from only one client_id, or otherwise represent that node’s local legal data.

Step 2: Start SuperNodes

Each node must point to its local data:

flower-supernode \
    --insecure \
    --superlink <SUPERLINK-FLEET-API> \
    --node-config="data-path=/path/to/local_partition"

Step 3: Run the federation

flwr run . <SUPERLINK-CONNECTION> --stream

Benchmarking and System Metrics

This app writes a benchmark summary next to the standard Flower result pickle:

result_<run-name>_communication.json

The summary includes per-round and total communication volume:

total_comm_bytes
comm_bytes_total per training round

Enable system metric tracking with:

flwr run . <SUPERLINK-CONNECTION> --stream --run-config "benchmark-system-metrics=true"

When enabled, the benchmark summary also includes:

client_train_time_sec
server_aggregation_time_sec
round_wall_clock_sec
client_peak_cpu_memory_mb
client_peak_gpu_memory_mb

Server-side centralized evaluation can be disabled for benchmark-only runs:

flwr run . <SUPERLINK-CONNECTION> --stream --run-config "benchmark-run-server-eval=false"

Example Configuration

A typical setup in pyproject.toml may look like this:

model-name = "HuggingFaceTB/SmolLM3-3B"
dataset-id = "flwrlabs/fed-legal"
load-in-4bit = true
torch-dtype = "bfloat16"

lora-r = 16
lora-alpha = 32
lora-dropout = 0.05
lora-target-modules = "q_proj,k_proj,v_proj,o_proj"

num-server-rounds = 5
local-epochs = 1
batch-size = 2
learning-rate = 2e-4
max-seq-length = 2048
fraction-train = 1.0
fraction-evaluate = 1.0

To switch models, update:

model-name = "mistralai/Mistral-7B-Instruct-v0.3"

and, if needed, adjust LoRA target modules and dtype.