:og:description: Learn how to train a logistic regression on the Iris dataset using federated learning with Flower and scikit-learn in this step-by-step tutorial. .. meta:: :description: Learn how to train a logistic regression on the Iris dataset using federated learning with Flower and scikit-learn in this step-by-step tutorial. .. _quickstart-pytorch: .. |message_link| replace:: ``Message`` .. _message_link: ref-api/flwr.app.Message.html .. |arrayrecord_link| replace:: ``ArrayRecord`` .. _arrayrecord_link: ref-api/flwr.app.ArrayRecord.html .. |clientapp_link| replace:: ``ClientApp`` .. _clientapp_link: ref-api/flwr.clientapp.ClientApp.html .. |fedavg_link| replace:: ``FedAvg`` .. _fedavg_link: ref-api/flwr.serverapp.strategy.FedAvg.html .. |serverapp_link| replace:: ``ServerApp`` .. _serverapp_link: ref-api/flwr.serverapp.ServerApp.html .. |strategy_start_link| replace:: ``start`` .. _strategy_start_link: ref-api/flwr.serverapp.Strategy.html#flwr.serverapp.Strategy.start .. |strategy_link| replace:: ``Strategy`` .. _strategy_link: ref-api/flwr.serverapp.Strategy.html ######################### Quickstart scikit-learn ######################### In this federated learning tutorial we will learn how to train a Logistic Regression on the Iris dataset using Flower and scikit-learn. It is recommended to create a virtual environment and run everything within a :doc:`virtualenv `. Let's use ``flwr new`` to create a complete Flower+scikit-learn project. It will generate all the files needed to run a federation of 10 nodes using |fedavg_link|_. By default, the generated app uses a local simulation profile that ``flwr run`` submits to a managed local SuperLink, which then executes the run with the Flower Simulation Runtime. The dataset will be partitioned using |flowerdatasets|_'s |iidpartitioner|_ Now that we have a rough idea of what this example is about, let's get started. First, install Flower in your new environment: .. code-block:: shell # In a new Python environment $ pip install flwr[simulation] Then, run the command below: .. code-block:: shell $ flwr new @flwrlabs/quickstart-sklearn After running it you'll notice a new directory named ``quickstart-sklearn`` has been created. It should have the following structure: .. code-block:: shell quickstart-sklearn ├── sklearnexample │ ├── __init__.py │ ├── client_app.py # Defines your ClientApp │ ├── server_app.py # Defines your ServerApp │ └── task.py # Defines your model, training and data loading ├── pyproject.toml # Project metadata like dependencies and configs └── README.md If you haven't yet installed the project and its dependencies, you can do so by: .. code-block:: shell # From the directory where your pyproject.toml is $ pip install -e . To run the project, do: .. code-block:: shell # Run with default arguments and stream logs $ flwr run . --stream Plain ``flwr run .`` submits the run, prints the run ID, and returns without streaming logs. For the full local workflow, see :doc:`how-to-run-flower-locally`. With default arguments you will see streamed output like this: .. code-block:: shell Successfully built flwrlabs.quickstart-sklearn.1-0-0.014c8eb3.fab Starting local SuperLink on 127.0.0.1:39093... Successfully started run 1859953118041441032 INFO : Starting FedAvg strategy: INFO : ├── Number of rounds: 3 INFO : [ROUND 1/3] INFO : configure_train: Sampled 10 nodes (out of 10) INFO : aggregate_train: Received 10 results and 0 failures INFO : └──> Aggregated MetricRecord: {'train_logloss': 1.3937176081476854} INFO : configure_evaluate: Sampled 10 nodes (out of 10) INFO : aggregate_evaluate: Received 10 results and 0 failures INFO : └──> Aggregated MetricRecord: {'test_logloss': 1.23306, 'accuracy': 0.69154, 'precision': 0.68659, 'recall': 0.68046, 'f1': 0.65752} INFO : [ROUND 2/3] INFO : ... INFO : [ROUND 3/3] INFO : ... INFO : Strategy execution finished in 17.87s INFO : Final results: INFO : ServerApp-side Evaluate Metrics: INFO : {} Saving final model to disk... You can also override the parameters defined in the ``[tool.flwr.app.config]`` section in ``pyproject.toml`` like this: .. code-block:: shell # Override some arguments $ flwr run . --run-config "num-server-rounds=5 local-epochs=2" What follows is an explanation of each component in the project you just created: dataset partition, the model, defining the ``ClientApp`` and defining the ``ServerApp``. ********** The Data ********** This tutorial uses |flowerdatasets|_ to easily download and partition the `Iris `_ dataset. In this example you'll make use of the |iidpartitioner|_ to generate ``num_partitions`` partitions. You can choose |otherpartitioners|_ available in Flower Datasets. Each ``ClientApp`` will call this function to create dataloaders with the data that correspond to their data partition. Note that in this example only a subset of the columns are going to be used. .. code-block:: python FEATURES = ["petal_length", "petal_width", "sepal_length", "sepal_width"] partitioner = IidPartitioner(num_partitions=num_partitions) fds = FederatedDataset(dataset="hitorilabs/iris", partitioners={"train": partitioner}) dataset = fds.load_partition(partition_id, "train").with_format("pandas")[:] X = dataset[FEATURES] y = dataset["species"] # Split the on-edge data: 80% train, 20% test X_train, X_test = X[: int(0.8 * len(X))], X[int(0.8 * len(X)) :] y_train, y_test = y[: int(0.8 * len(y))], y[int(0.8 * len(y)) :] return X_train.values, y_train.values, X_test.values, y_test.values *********** The Model *********** We define the |logisticregression|_ model from scikit-learn in the ``create_log_reg_and_instantiate_parameters()`` function. This helper function also initializes the model parameters using the ``set_initial_params()`` utility function in the same file. .. code-block:: python def create_log_reg_and_instantiate_parameters(penalty): model = LogisticRegression( penalty=penalty, max_iter=1, # local epoch warm_start=True, # prevent refreshing weights when fitting, solver="saga", ) # Setting initial parameters, akin to model.compile for keras models set_initial_params(model, n_features=len(FEATURES), n_classes=len(UNIQUE_LABELS)) return model *************** The ClientApp *************** The main changes we have to make to use ``Scikit-learn`` with ``Flower`` have to do with converting the |arrayrecord_link|_ received in the |message_link|_ into numpy ndarrays and then use them to set the model parameters. After training, another auxiliary function can be used to extract then pack the updated numpy ndarrays into a ``Message`` from the ClientApp. We can make use of built-in methods in the ``ArrayRecord`` to make these conversions: .. code-block:: python @app.train() def train(msg: Message, context: Context): # Create LogisticRegression Model penalty = context.run_config["penalty"] # Create LogisticRegression Model model = create_log_reg_and_instantiate_parameters(penalty) # Apply received parameters ndarrays = msg.content["arrays"].to_numpy_ndarrays() set_model_params(model, ndarrays) # Train the model ... # Extract the updated model parameters with auxhiliary function ndarrays = get_model_params(model) # Pack the updated parameters into an ArrayRecord model_record = ArrayRecord(ndarrays) The rest of the functionality is directly inspired by the centralized case. The |clientapp_link|_ comes with three core methods (``train``, ``evaluate``, and ``query``) that we can implement for different purposes. For example: ``train`` to train the received model using the local data; ``evaluate`` to assess its performance of the received model on a validation set; and ``query`` to retrieve information about the node executing the ``ClientApp``. In this tutorial we will only make use of ``train`` and ``evaluate``. Let's see how the ``train`` method can be implemented. It receives as input arguments a |message_link|_ from the ``ServerApp``. By default it carries: - an ``ArrayRecord`` with the arrays of the model to federate. By default they can be retrieved with key ``"arrays"`` when accessing the message content. - a ``ConfigRecord`` with the configuration sent from the ``ServerApp``. By default it can be retrieved with key ``"config"`` when accessing the message content. The ``train`` method also receives the ``Context``, giving access to configs for your run and node. The run config hyperparameters are defined in the ``pyproject.toml`` of your Flower App. The node config can only be set when running Flower with the Deployment Runtime and is not directly configurable during simulations. .. code-block:: python app = ClientApp() @app.train() def train(msg: Message, context: Context): """Train the model on local data.""" # Create LogisticRegression Model penalty = context.run_config["penalty"] # Create LogisticRegression Model model = create_log_reg_and_instantiate_parameters(penalty) # Apply received parameters ndarrays = msg.content["arrays"].to_numpy_ndarrays() set_model_params(model, ndarrays) # Load the data partition_id = context.node_config["partition-id"] num_partitions = context.node_config["num-partitions"] X_train, y_train, _, _ = load_data(partition_id, num_partitions) # Ignore convergence failure due to low local epochs with warnings.catch_warnings(): warnings.simplefilter("ignore") # Train the model on local data model.fit(X_train, y_train) # Let's compute train loss y_train_pred_proba = model.predict_proba(X_train) train_logloss = log_loss(y_train, y_train_pred_proba, labels=UNIQUE_LABELS) accuracy = model.score(X_train, y_train) # Construct and return reply Message ndarrays = get_model_params(model) model_record = ArrayRecord(ndarrays) metrics = { "num-examples": len(X_train), "train_logloss": train_logloss, "train_accuracy": accuracy, } metric_record = MetricRecord(metrics) content = RecordDict({"arrays": model_record, "metrics": metric_record}) return Message(content=content, reply_to=msg) The ``@app.evaluate`` method mirrors ``train`` but only evaluates the received model on the local validation set. It returns a ``MetricRecord`` containing the evaluation loss and accuracy and does not include the model weights, since they are not modified during evaluation. *************** The ServerApp *************** To construct a |serverapp_link|_ we define its ``@app.main()`` method. This method receive as input arguments: - a ``Grid`` object that will be used to interface with the nodes running the ``ClientApp`` to involve them in a round of train/evaluate/query or other. - a ``Context`` object that provides access to the run configuration. In this example we use the |fedavg_link|_ and configure it with a specific value of ``fraction_train`` which is read from the run config. You can find the default value defined in the ``pyproject.toml``. Then, the execution of the strategy is launched when invoking its |strategy_start_link|_ method. To it we pass: - the ``Grid`` object. - an ``ArrayRecord`` carrying a randomly initialized model that will serve as the global model to federate. - a ``ConfigRecord`` with the training hyperparameters to be sent to the clients. The strategy will also insert the current round number in this config before sending it to the participating nodes. - the ``num_rounds`` parameter specifying how many rounds of ``FedAvg`` to perform. .. code-block:: python app = ServerApp() @app.main() def main(grid: Grid, context: Context) -> None: """Main entry point for the ServerApp.""" # Read run config num_rounds: int = context.run_config["num-server-rounds"] # Create LogisticRegression Model penalty = context.run_config["penalty"] model = create_log_reg_and_instantiate_parameters(penalty) # Construct ArrayRecord representation arrays = ArrayRecord(get_model_params(model)) # Initialize FedAvg strategy strategy = FedAvg(fraction_train=1.0, fraction_evaluate=1.0) # Start strategy, run FedAvg for `num_rounds` result = strategy.start( grid=grid, initial_arrays=arrays, num_rounds=num_rounds, ) # Save final model parameters print("\nSaving final model to disk...") ndarrays = result.arrays.to_numpy_ndarrays() set_model_params(model, ndarrays) joblib.dump(model, "logreg_model.pkl") Congratulations! You've successfully built and run your first federated learning system in scikit-learn on the Iris dataset using the new Message API. .. tip:: Check the :doc:`how-to-run-simulations` documentation to learn more about how to configure and run Flower simulations. .. note:: Check the source code of another Flower App using ``scikit-learn`` in the `Flower GitHub repository `_. .. |flowerdatasets| replace:: Flower Datasets .. |iidpartitioner| replace:: ``IidPartitioner`` .. |logisticregression| replace:: ``LogisticRegression`` .. |otherpartitioners| replace:: other partitioners .. |quickstart_sklearn_link| replace:: ``examples/sklearn-logreg-mnist`` .. _client: ref-api/flwr.client.Client.html#client .. _fedavg: ref-api/flwr.server.strategy.FedAvg.html#flwr.server.strategy.FedAvg .. _flowerdatasets: https://flower.ai/docs/datasets/ .. _iidpartitioner: https://flower.ai/docs/datasets/ref-api/flwr_datasets.partitioner.IidPartitioner.html#flwr_datasets.partitioner.IidPartitioner .. _logisticregression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html .. _otherpartitioners: https://flower.ai/docs/datasets/ref-api/flwr_datasets.partitioner.html .. meta:: :description: Check out this Federated Learning quickstart tutorial for using Flower with scikit-learn to train a logistic regression model.