@flwrlabs/sprind-llm

Quickstart

flwr new @flwrlabs/sprind-llm

Readme

SPRIN-D: Language Modeling

This FlowerApp federates the pre-training of a GPT-style LLM where the ClientApps uses nanoGPT's training loop. The dataset can be OpenWebText, C4 or a subset of it. The models can be different huggingface models like EleutherAI/gpt-neo-1.3B.

The contents of this Flower App are as follows:

sprind-llm
├── flwrapp
│   ├── __init__.py
│   ├── client_app.py   # Defines your ClientApp
│   ├── server_app.py   # Defines your ServerApp
│   ├── strategy.py     # Defines a custom strategy for easy logging to W&B
│   ├── model.py        # Enables the configuration of GPT2-style models
│   ├── train.py        # Training loop adapted from nanoGPT backend
│   └── utils.py        # Various utility functions for this app
├── pyproject.toml      # Project metadata like dependencies and configs
└── README.md

Running the App

NOTE

This section assumes you have already deployed a Flower Federation with at least two SuperNodes. Please refer to the provided instructions on how to connect SuperNodes to a running SuperLink.

Before running the app, you need to configure it to point to the SuperLink. This is an easy process and only requires you to edit one line in the pyproject.toml in this directory. Concretely, the address field found at the bottom of the file.

[tool.flwr.federations.sprind-federation]
address = "SUPERLINK-CONTROL-ADDRESS" # <--- Replace with the provided SuperLink IP:PORT

To run the app with default settings simply execute this command from the directory where this README.md lives:

# If you know your Weight & Biases token
flwr run . --run-config="wandb-token='<YOUR-WANDB-TOKEN'" --stream

# If you don't have one
flwr run . --stream

Expected Output

On the terminal where you execute flwr run from you'll see an output similiar to the one below. Note this output was obtained when running with Weight and Biases (hence the first few log lines with wandb prefix) and in a federation of 3 SuperNodes. By default, each round the ServerApp samples all of the connected SuperNodes for a round of training with them streaming the C4 dataset from Huggingface Hub. By default the app runs for three rounds using a GTP-Neo-125M model.

Loading project configuration...
Success
🎊 Successfully started run 7522963691491767233
INFO :      Starting logstream for run_id `7522963691491767233`
INFO :      Start `flwr-serverapp` process
wandb: Currently logged in as: YOUR-USERNAME to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.23.0
wandb: Run data is saved locally in <YOUR-LOCAL-FS>/wandb/run-20251125_174027-fnr1s6fq
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run 7522963691491767233-ServerApp
wandb: ⭐️ View project at https://wandb.ai/YOUR-USERNAME/sprind-llm
wandb: 🚀 View run at https://wandb.ai/YOUR-USERNAME/sprind-llm/runs/fnr1s6fq
INFO :      Starting FedAvgWithWandB strategy:
INFO :          ├── Number of rounds: 3
INFO :          ├── ArrayRecord (0.00 MB)
INFO :          ├── ConfigRecord (train): (empty!)
INFO :          ├── ConfigRecord (evaluate): (empty!)
INFO :          ├──> Sampling:
INFO :          │       ├──Fraction: train (0.50) | evaluate ( 0.00)
INFO :          │       ├──Minimum nodes: train (1) | evaluate (0)
INFO :          │       └──Minimum available nodes: 1
INFO :          └──> Keys in records:
INFO :                  ├── Weighted by: 'optimizer_steps'
INFO :                  ├── ArrayRecord key: 'arrays'
INFO :                  └── ConfigRecord key: 'config'
INFO :
INFO :
INFO :      [ROUND 1/3]
INFO :      configure_train: Sampled 2 nodes (out of 2)
INFO :      aggregate_train: Received 2 results and 0 failures
INFO :          └──> Aggregated MetricRecord: {'train_loss': 16051.246813020638, 'valid_loss': 3605.4747982025146, 'valid_accuracy': 0.017324455082416534}
INFO :
INFO :      [ROUND 2/3]
INFO :      configure_train: Sampled 2 nodes (out of 2)
INFO :      aggregate_train: Received 2 results and 0 failures
INFO :          └──> Aggregated MetricRecord: {'train_loss': 15842.587494065094, 'valid_loss': 3606.9595260620126, 'valid_accuracy': 0.015316593460738659}
INFO :
INFO :      [ROUND 3/3]
INFO :      configure_train: Sampled 2 nodes (out of 2)
INFO :      aggregate_train: Received 2 results and 0 failures
INFO :          └──> Aggregated MetricRecord: {'train_loss': 15845.837125325066, 'valid_loss': 3583.5984764099117, 'valid_accuracy': 0.01753143221139908}
INFO :
INFO :      Strategy execution finished in 346.69s
INFO :
INFO :      Final results:
INFO :
INFO :          Global Arrays:
INFO :                  ArrayRecord (353.935 MB)
INFO :
INFO :          Aggregated ClientApp-side Train Metrics:
INFO :          { 1: { 'train_loss': '1.6051e+04',
INFO :                 'valid_accuracy': '1.7324e-02',
INFO :                 'valid_loss': '3.6055e+03'},
INFO :            2: { 'train_loss': '1.5843e+04',
INFO :                 'valid_accuracy': '1.5317e-02',
INFO :                 'valid_loss': '3.6070e+03'},
INFO :            3: { 'train_loss': '1.5846e+04',
INFO :                 'valid_accuracy': '1.7531e-02',
INFO :                 'valid_loss': '3.5836e+03'}}
INFO :
INFO :          Aggregated ClientApp-side Evaluate Metrics:
INFO :          {}
INFO :
INFO :          ServerApp-side Evaluate Metrics:
INFO :          {}
INFO :

Override Run Config

You can also override the settings for your ClientApp and ServerApp defined in the [tool.flwr.app.config] section of the pyproject.toml. This can be done by extending the list of arguments passed via the --run-config argument to flwr run. For example:

# Run for 5 rounds
flwr run . --run-config="wandb-token='<YOUR-WANDB-TOKEN' num-server-rounds=5" --stream

# Use GPT-Neo-1.3B (override model.model-name)
# The run might take longer to complete depending on the communication BW of your SuperNodes
flwr run . --run-config="wandb-token='<YOUR-WANDB-TOKEN' model.model-name='EleutherAI/gpt-neo-1.3B'" --stream