Federated Learning with Hugging Face LeRobot and Flower (Quickstart Example)ΒΆ
This is an introductory example to using π€LeRobot with πΌFlower. It demonstrates that it is feasible to collaboratively train a robotics AI model in remote environments with their local data and then aggregated it in a shared model.
In this example, we will federate the training of a Diffusion policy on the PushT dataset. The data will be downloaded and partitioned using Flower Datasets. This example runs best when a GPU is available.
Set up the projectΒΆ
Clone the projectΒΆ
Start by cloning the example project. We prepared a single-line command that you can copy into your shell which will checkout the example for you:
git clone --depth=1 https://github.com/adap/flower.git _tmp \
&& mv _tmp/examples/quickstart-lerobot . \
&& rm -rf _tmp && cd quickstart-lerobot
This will create a new directory called quickstart-lerobot
containing the following files:
quickstart-lerobot
βββ lerobot_example
β βββ __init__.py
β βββ client_app.py # Defines your ClientApp
β βββ server_app.py # Defines your ServerApp
β βββ task.py # Defines your model, training and data loading
β βββ lerobot_federated_dataset.py # Defines the dataset
β βββ configs/ # configuration files
β βββ env/ # gym environment config
β βββ policy/ # policy config
β βββ default.yaml # default config settings
β
βββ pyproject.toml # Project metadata like dependencies and configs
βββ README.md
Install dependencies and projectΒΆ
Install the dependencies defined in pyproject.toml
as well as the lerobot_example
package.
pip install -e .
Choose training parametersΒΆ
You can leave the default parameters for an initial quick test. It will run for 50 rounds sampling 4 clients per round. However for best results, total number of training rounds should be at least 100,000. You can achieve this for example by setting num-server-rounds=500
and local_epochs=200
in the pyproject.toml
.
Run the ExampleΒΆ
You can run your Flower project in both simulation and deployment mode without making changes to the code. If you are starting with Flower, we recommend you using the simulation mode as it requires fewer components to be launched manually. By default, flwr run
will make use of the Simulation Engine. You can read more about how the Simulation Engine work in the documentation.
Run with the Simulation EngineΒΆ
[!TIP] This example runs much faster when the
ClientApp
s have access to a GPU. If your system has one, you might want to try running the example with GPU right away, use thelocal-simulation-gpu
federation as shown below.
# Run with the default federation (CPU only)
flwr run .
Run the project in the local-simulation-gpu
federation that gives CPU and GPU resources to each ClientApp
. By default, at most 2xClientApp
(using ~2 GB of VRAM each) will run in parallel in each available GPU. Note you can adjust the degree of parallelism but modifying the client-resources
specification. Running with the settings as in the pyproject.toml
it takes 1h in a 2x RTX 3090 machine.
# Run with the `local-simulation-gpu` federation
flwr run . local-simulation-gpu
You can also override some of the settings for your ClientApp
and ServerApp
defined in pyproject.toml
. For example
flwr run . local-simulation-gpu --run-config "num-server-rounds=5 fraction-fit=0.1"
Result outputΒΆ
Results of training steps for each client and server logs will be under the outputs/
directory. For each run there will be a subdirectory corresponding to the date and time of the run. For example:
outputs/date_time/
βββ evaluate # Each subdirectory contains .mp4 renders generated by clients
β βββ round_5 # Evaluations in a given round
β β βββ client_3
β β ... βββ rollout_20241207-105418.mp4 # render .mp4 for client at a given round
β β βββ client_1
β ...
β βββ round_n # local client model checkpoint
βββ global_model # Each subdirectory contains the global model of a round
βββ round_1
...
βββ round_n
Run with the Deployment EngineΒΆ
Follow this how-to guide to run the same app in this example but with Flowerβs Deployment Engine. After that, you might be intersted in setting up secure TLS-enabled communications and SuperNode authentication in your federation.
If you are already familiar with how the Deployment Engine works, you may want to learn how to run it using Docker. Check out the Flower with Docker documentation.