@chongsource/federated_pft_classification
flwr new @chongsource/federated_pft_classificationFederated Learning with XGBoost and Flower — PFT Lung Function Classification
The purpose of this project is to use federated learning combined with PFT (pulmonary function tests) data pre-processing, to classify patients' lung function into the following categories: mixed restriction and obstruction, obstruction, restriction + small airway obstruction, small airway obstruction, restriction, gas trapping, normal.
We only use spirometry metrics to predict patients' lung function. This way we can sidestep the use of plethysmography machines. This is because spirometry machines are portable and widely available, while plethysmography machines are only found in specialized hospitals. However, plethysmography machines are crucial for the diagnosis of restrictive lung diseases, and with plethysmography results, we can diagnose the patient with standardized decision trees (e.g. UofT PFT guidelines). Therefore, we use both spirometry results and plethysmography results to generate the labels using a decision tree, and then train a model to classify lung function based on only spirometry data.
Data Processing Pipeline
Each client's patient data must contain the following metrics:
| Category | Columns |
|---|---|
| Biometrics | age, sex, height |
| Plethysmography | tlc, rv, rv_tlc |
| Spirometry | fev1, fvc, fev1_fvc, fef75 |
At runtime, the pipeline automatically:
- Computes LLN (Lower Limit of Normal) and ULN (Upper Limit of Normal) reference values using GLI 2022 (spirometry) and ERS 2021 (lung volumes) equations
- Labels each patient using the Computer Aided Decision Tree Used in The Toronto General Pulmonary Function Laboratory
- Trains XGBoost using only the spirometry + biometric features (age, height, sex, fev1, fvc, fev1_fvc)
Label encoding
| Label | Diagnosis |
|---|---|
| 0 | N — normal |
| 1 | AO — obstruction |
| 2 | R — restriction |
| 3 | R+AO — mixed restriction + obstruction |
| 4 | GT — gas trapping |
| 5 | SAO — small airway obstruction |
| 6 | R+SAO — restriction + small airway obstruction |
Model
We use XGBoost as the primary model because of its superior performance on tabular data, clinical interpretability (feature importance), and its handling of small to medium datasets (datasets are usually of size 1000–8000).
This example provides two federated training strategies:
- Bagging aggregation (): Each client trains a new tree on its local data each round; all trees are aggregated on the server. With M clients and R rounds, the global model contains M × R trees.
- Cyclic training (): Clients train one at a time in a round-robin fashion, passing the model sequentially.
Project Structure
federated-pft-classification ├── federated_pft_classification │ ├── __init__.py │ ├── client_app.py # ClientApp — simulation/deployment dispatch + training │ ├── server_app.py # ServerApp — aggregation strategy + model saving │ ├── task.py # Data loaders, PFT preprocessing, DMatrix conversion │ └── data_processing/ │ ├── decision_tree.py # UofT PFT decision tree │ ├── gli22_calc.py # GLI 2022 spirometry reference values │ ├── ers21_lung_volumes_calc.py # ERS 2021 lung volume reference values │ ├── fef75_calc.py # FEF75 reference values │ └── main.py # Combined reference value calculator ├── generate_sim_data.py # Script to generate a synthetic simulation dataset ├── data/ │ └── simulation_data.xlsx # Combined patient file used in simulation mode ├── pyproject.toml # Dependencies and app configuration └── README.md
Set up the project
Install dependencies
Create and activate a conda environment, then install the project:
conda create -n flwr_pft python=3.12 -y conda activate flwr_pft pip install -e .
Prepare simulation data
The app expects a single combined Excel file at data/simulation_data.xlsx (configurable via sim-data-path in pyproject.toml). Each row is a patient with the columns listed above.
A synthetic dataset generator is included for testing:
python generate_sim_data.py
This creates data/simulation_data.xlsx with 270 synthetic patients across all 7 diagnostic categories.
Run the project
The app runs smoothly in both Simulation and Deployment without code changes. If you are starting with Flower, we recommend using the simulation mode as it requires fewer components to be launched manually. By default, flwr run will make use of the Simulation Engine.
Run with the Simulation Engine
TIP
Check the Simulation Engine documentation to learn more about Flower simulations, how to use more virtual SuperNodes, and how to configure CPU/GPU usage in your ClientApp. Before you run the flwr run . command, the old Superlink Process might be orphaned and it's input and output file descriptors are made invalid. However, because the network socket might still be intact, it can still accept gRPC requests and fail silently, as such :🎊 Successfully started run 7678037348555877013
pkill -f "flower-superlink"; flwr run . --stream
Note that the simulated data that you created above is mentioned in the pyproject.toml, which will enable the injection of the simulated into context.run_config for every ClientApp call.
You can override settings defined in pyproject.toml. For example:
# Run 10 rounds with cyclic training flwr run . --run-config "train-method='cyclic' num-server-rounds=10" # Use a different simulation dataset flwr run . --run-config "sim-data-path='data/my_dataset.xlsx'"
Run with the Deployment Engine
In deployment, each SuperNode loads its own local .xlsx file. Pass the path via node-config:
flower-supernode \ --insecure \ --superlink <SUPERLINK-FLEET-API> \ --node-config="data-path=/path/to/hospital_data.xlsx"
Then launch the run pointing to your SuperLink:
flwr run . <SUPERLINK-CONNECTION> --stream
TIP
Follow this how-to guide to run the app with Flower's Deployment Engine. After that, you might be interested in setting up secure TLS-enabled communications and SuperNode authentication in your federation.
Configuration reference
Key settings in pyproject.toml:
| Key | Default | Description |
|---|---|---|
| train-method | bagging | bagging or cyclic |
| num-server-rounds | 3 | Number of federated rounds |
| local-epochs | 1 | Trees added per client per round |
| test-fraction | 0.2 | Fraction of each client's data held out for validation |
| sim-data-path | data/simulation_data.xlsx | Combined Excel file for simulation |
| col-age … col-rv-tlc | age … rv_tlc | Column names in the Excel files |
| params.objective | multi:softmax | XGBoost multiclass objective |
| params.num-class | 7 | Number of lung function categories |
| params.eval-metric | mlogloss | Evaluation metric (multiclass log loss) |