Federated Learning with Hugging Face LeRobot and Flower (Quickstart Example)ΒΆ

View on GitHub

This is an introductory example to using πŸ€—LeRobot with 🌼Flower. It demonstrates that it is feasible to collaboratively train a robotics AI model in remote environments with their local data and then aggregated it in a shared model.

In this example, we will federate the training of a Diffusion policy on the PushT dataset. The data will be downloaded and partitioned using Flower Datasets. This example runs best when a GPU is available.

Set up the projectΒΆ

Clone the projectΒΆ

Start by cloning the example project. We prepared a single-line command that you can copy into your shell which will checkout the example for you:

git clone --depth=1 https://github.com/adap/flower.git _tmp \
		&& mv _tmp/examples/quickstart-lerobot . \
		&& rm -rf _tmp && cd quickstart-lerobot

This will create a new directory called quickstart-lerobot containing the following files:

quickstart-lerobot
β”œβ”€β”€ lerobot_example
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ client_app.py   # Defines your ClientApp
β”‚   β”œβ”€β”€ server_app.py   # Defines your ServerApp
β”‚   β”œβ”€β”€ task.py         # Defines your model, training and data loading
β”‚   β”œβ”€β”€ lerobot_federated_dataset.py   # Defines the dataset
β”‚   └── configs/		# configuration files
β”‚ 		β”œβ”€β”€ env/        	# gym environment config
β”‚   	β”œβ”€β”€ policy/			# policy config
β”‚   	└── default.yaml 	# default config settings
β”‚
β”œβ”€β”€ pyproject.toml      # Project metadata like dependencies and configs
└── README.md

Install dependencies and projectΒΆ

Install the dependencies defined in pyproject.toml as well as the lerobot_example package.

pip install -e .

Choose training parametersΒΆ

You can leave the default parameters for an initial quick test. It will run for 50 rounds sampling 4 clients per round. However for best results, total number of training rounds should be at least 100,000. You can achieve this for example by setting num-server-rounds=500 and local_epochs=200 in the pyproject.toml.

Run the ExampleΒΆ

You can run your Flower project in both simulation and deployment mode without making changes to the code. If you are starting with Flower, we recommend you using the simulation mode as it requires fewer components to be launched manually. By default, flwr run will make use of the Simulation Engine. You can read more about how the Simulation Engine work in the documentation.

Run with the Simulation EngineΒΆ

[!TIP] This example runs faster when the ClientApps have access to a GPU. If your system has one, you can make use of it by configuring the backend.client-resources component in your Flower Configuration. Check the Simulation Engine documentation to learn more about Flower simulations and how to optimize them.

# Run with the default federation (CPU only)
flwr run .

You can add a new connection in your Flower Configuration (find if via flwr config list):

[superlink.local-gpu]
options.num-supernodes = 10
options.backend.client-resources.num-cpus = 4 # each ClientApp assumes to use 4CPUs
options.backend.client-resources.num-gpus = 0.5 # at most 2 ClientApp will run in a given GPU (lower it to increase parallelism)

And then run the app

# Run with the `local-gpu` settings
flwr run . local-gpu

You can also override some of the settings for your ClientApp and ServerApp defined in pyproject.toml. For example

flwr run . local-gpu --run-config "num-server-rounds=5 fraction-fit=0.1"

Result outputΒΆ

Results of training steps for each client and server logs will be under the outputs/ directory. For each run there will be a subdirectory corresponding to the date and time of the run. For example:

outputs/date_time/
β”œβ”€β”€ evaluate  # Each subdirectory contains .mp4 renders generated by clients
β”‚   β”œβ”€β”€ round_5	# Evaluations in a given round
β”‚	β”‚   β”œβ”€β”€ client_3
β”‚	β”‚	...	└── rollout_20241207-105418.mp4 # render .mp4 for client at a given round
β”‚	β”‚	└── client_1
β”‚   ...
β”‚   └── round_n   	# local client model checkpoint
└── global_model # Each subdirectory contains the global model of a round
	β”œβ”€β”€ round_1
	...
	└── round_n

Run with the Deployment EngineΒΆ

Follow this how-to guide to run the same app in this example but with Flower’s Deployment Engine. After that, you might be intersted in setting up secure TLS-enabled communications and SuperNode authentication in your federation.

If you are already familiar with how the Deployment Engine works, you may want to learn how to run it using Docker. Check out the Flower with Docker documentation.