Federated Learning with Hugging Face LeRobot and Flower (Quickstart Example)ΒΆ

View on GitHub

This is an introductory example to using πŸ€—LeRobot with 🌼Flower. It demonstrates that it is feasible to collaboratively train a robotics AI model in remote environments with their local data and then aggregated it in a shared model.

In this example, we will federate the training of a Diffusion policy on the PushT dataset. The data will be downloaded and partitioned using Flower Datasets. This example runs best when a GPU is available.

Set up the projectΒΆ

Clone the projectΒΆ

Start by cloning the example project. We prepared a single-line command that you can copy into your shell which will checkout the example for you:

git clone --depth=1 https://github.com/adap/flower.git _tmp \
		&& mv _tmp/examples/quickstart-lerobot . \
		&& rm -rf _tmp && cd quickstart-lerobot

This will create a new directory called quickstart-lerobot containing the following files:

quickstart-lerobot
β”œβ”€β”€ lerobot_example
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ client_app.py   # Defines your ClientApp
β”‚   β”œβ”€β”€ server_app.py   # Defines your ServerApp
β”‚   β”œβ”€β”€ task.py         # Defines your model, training and data loading
β”‚   β”œβ”€β”€ lerobot_federated_dataset.py   # Defines the dataset
β”‚   └── configs/		# configuration files
β”‚ 		β”œβ”€β”€ env/        	# gym environment config
β”‚   	β”œβ”€β”€ policy/			# policy config
β”‚   	└── default.yaml 	# default config settings
β”‚
β”œβ”€β”€ pyproject.toml      # Project metadata like dependencies and configs
└── README.md

Install dependencies and projectΒΆ

Install the dependencies defined in pyproject.toml as well as the lerobot_example package.

pip install -e .

Choose training parametersΒΆ

You can leave the default parameters for an initial quick test. It will run for 50 rounds sampling 4 clients per round. However for best results, total number of training rounds should be at least 100,000. You can achieve this for example by setting num-server-rounds=500 and local_epochs=200 in the pyproject.toml.

Run the ExampleΒΆ

You can run your Flower project in both simulation and deployment mode without making changes to the code. If you are starting with Flower, we recommend you using the simulation mode as it requires fewer components to be launched manually. By default, flwr run will make use of the Simulation Engine. You can read more about how the Simulation Engine work in the documentation.

Run with the Simulation EngineΒΆ

[!TIP] This example runs much faster when the ClientApps have access to a GPU. If your system has one, you might want to try running the example with GPU right away, use the local-simulation-gpu federation as shown below.

# Run with the default federation (CPU only)
flwr run .

Run the project in the local-simulation-gpu federation that gives CPU and GPU resources to each ClientApp. By default, at most 2xClientApp (using ~2 GB of VRAM each) will run in parallel in each available GPU. Note you can adjust the degree of parallelism but modifying the client-resources specification. Running with the settings as in the pyproject.toml it takes 1h in a 2x RTX 3090 machine.

# Run with the `local-simulation-gpu` federation
flwr run . local-simulation-gpu

You can also override some of the settings for your ClientApp and ServerApp defined in pyproject.toml. For example

flwr run . local-simulation-gpu --run-config "num-server-rounds=5 fraction-fit=0.1"

Result outputΒΆ

Results of training steps for each client and server logs will be under the outputs/ directory. For each run there will be a subdirectory corresponding to the date and time of the run. For example:

outputs/date_time/
β”œβ”€β”€ evaluate  # Each subdirectory contains .mp4 renders generated by clients
β”‚   β”œβ”€β”€ round_5	# Evaluations in a given round
β”‚	β”‚   β”œβ”€β”€ client_3
β”‚	β”‚	...	└── rollout_20241207-105418.mp4 # render .mp4 for client at a given round
β”‚	β”‚	└── client_1
β”‚   ...
β”‚   └── round_n   	# local client model checkpoint
└── global_model # Each subdirectory contains the global model of a round
	β”œβ”€β”€ round_1
	...
	└── round_n

Run with the Deployment EngineΒΆ

Follow this how-to guide to run the same app in this example but with Flower’s Deployment Engine. After that, you might be intersted in setting up secure TLS-enabled communications and SuperNode authentication in your federation.

If you are already familiar with how the Deployment Engine works, you may want to learn how to run it using Docker. Check out the Flower with Docker documentation.