:og:description: Aggregate custom evaluation results from federated clients in Flower using a strategy that applies weighted averaging for metrics like accuracy. .. meta:: :description: Aggregate custom evaluation results from federated clients in Flower using a strategy that applies weighted averaging for metrics like accuracy. Aggregate evaluation results ============================ The Flower server does not prescribe a way to aggregate evaluation results, but it enables the user to fully customize result aggregation. Aggregate Custom Evaluation Results ----------------------------------- The same ``Strategy``-customization approach can be used to aggregate custom evaluation results coming from individual clients. Clients can return custom metrics to the server by returning a dictionary: .. code-block:: python from flwr.client import NumPyClient class FlowerClient(NumPyClient): def fit(self, parameters, config): # ... pass def evaluate(self, parameters, config): """Evaluate parameters on the locally held test set.""" # Update local model with global parameters self.model.set_weights(parameters) # Evaluate global model parameters on the local test data loss, accuracy = self.model.evaluate(self.x_test, self.y_test) # Return results, including the custom accuracy metric num_examples_test = len(self.x_test) return float(loss), num_examples_test, {"accuracy": float(accuracy)} The server can then use a customized strategy to aggregate the metrics provided in these dictionaries: .. code-block:: python from flwr.server.strategy import FedAvg class AggregateCustomMetricStrategy(FedAvg): def aggregate_evaluate( self, server_round: int, results: List[Tuple[ClientProxy, EvaluateRes]], failures: List[Union[Tuple[ClientProxy, FitRes], BaseException]], ) -> Tuple[Optional[float], Dict[str, Scalar]]: """Aggregate evaluation accuracy using weighted average.""" if not results: return None, {} # Call aggregate_evaluate from base class (FedAvg) to aggregate loss and metrics aggregated_loss, aggregated_metrics = super().aggregate_evaluate( server_round, results, failures ) # Weigh accuracy of each client by number of examples used accuracies = [r.metrics["accuracy"] * r.num_examples for _, r in results] examples = [r.num_examples for _, r in results] # Aggregate and print custom metric aggregated_accuracy = sum(accuracies) / sum(examples) print( f"Round {server_round} accuracy aggregated from client results: {aggregated_accuracy}" ) # Return aggregated loss and metrics (i.e., aggregated accuracy) return float(aggregated_loss), {"accuracy": float(aggregated_accuracy)} def server_fn(context: Context) -> ServerAppComponents: # Read federation rounds from config num_rounds = context.run_config["num-server-rounds"] config = ServerConfig(num_rounds=num_rounds) # Define strategy strategy = AggregateCustomMetricStrategy( # (same arguments as FedAvg here) ) return ServerAppComponents( config=config, strategy=strategy, # <-- pass the custom strategy here ) # Create ServerApp app = ServerApp(server_fn=server_fn)