@bloodcounts/nhanes-t2d-fedmap-fl

NHANES T2D Federated Task

This repository runs a federated learning experiment to detect Type 2 Diabetes (T2D) from NHANES clinical features using a small PyTorch MLP and the FedMAP method (MAP-style federated aggregation with learned Input-Convex Neural Network priors)(https://arxiv.org/abs/2405.19000). The project uses Flower (flwr) and Flower Datasets to partition the public NHANES dataset into federated sites and run simulated federated rounds.

nhanes_diabetes/nhanes_data.py: dataset constants and build_Xy() to extract features/labels from a partition.
nhanes_diabetes/model.py: MLP model and local training utilities.
nhanes_diabetes/client_app.py: Flower ClientApp implementing FedMAP local training and evaluation logic.
nhanes_diabetes/server_app.py: Flower ServerApp wiring the FedMAP strategy and server-side ICNN modules.

Type 2 diabetes affects roughly 462 million people worldwide and is a leading cause of cardiovascular disease, kidney failure, lower-limb amputation, and preventable blindness. A substantial fraction of cases remain undiagnosed, because the condition can progress silently for years before symptoms prompt clinical investigation.

Early identification of T2D (or its precursor states: impaired fasting glucose and impaired glucose tolerance) enables interventions like lifestyle modification, metformin, or GLP-1 receptor agonists, that can delay or prevent progression to overt diabetes and its microvascular and macrovascular complications. Risk stratification models built from routine clinical measurements (anthropometry, blood pressure, lipid panels, fasting glucose, HbA1c) are therefore of direct clinical value, particularly in primary-care or population-screening settings where a specialist laboratory workup (e.g. oral glucose tolerance testing) is not always feasible.

This task uses NHANES data, which captures exactly the kind of measurements available in a standard clinical encounter: age, BMI, waist circumference, blood pressure, fasting glucose, glycated haemoglobin (HbA1c), fasting insulin, HDL and LDL cholesterol, triglycerides, serum creatinine, estimated glomerular filtration rate (eGFR), and C-reactive protein. The label (Diabetes_Type) is derived from self-reported physician diagnosis and laboratory criteria, providing a realistic (if imperfect) ground truth representative of how T2D is ascertained in epidemiological studies.

FedMAP in this Context

Standard federated averaging (FedAvg) treats all clients equally and aggregates by simple weighted mean. FedMAP extends this by learning a per-parameter prior using Input-Convex Neural Networks (ICNNs), which captures the distribution of model parameters across clients. Each client's local objective then includes a MAP (maximum a posteriori) penalty that regularises its updates toward this learned prior. This is especially relevant in healthcare settings where sites genuinely differ in their data distributions.

Simulating Multi-Site Heterogeneity

The underlying dataset (Hugging Face rtweera/nhanes-data-converted) is a single pooled collection derived from multiple NHANES survey cycles. Because the original cycle identifiers have been removed during preprocessing, we cannot partition by survey cycle to obtain naturally distinct "sites".

Instead, we use Flower Datasets' DirichletPartitioner with alpha = 0.5 to split the data into N_PARTITIONS synthetic federated clients with heterogeneous label distributions. Lower alpha values produce greater label skew between clients, simulating the kind of prevalence differences one might see between, say, an endocrinology referral centre (high T2D prevalence) and a general primary care practice (lower prevalence). This is a standard approach in FL benchmarking literature.

Key Files

File	Purpose
nhanes_diabetes/nhanes_data.py	Dataset constants, feature specification, and build_Xy() to extract features and labels from a partition DataFrame.
nhanes_diabetes/model.py	MLP model and local training utilities with early stopping.
nhanes_diabetes/fedmap_loss.py	FedMAP loss function, ICNN prior modules, and convexity enforcement.
nhanes_diabetes/fedmap_strategy.py	Flower Strategy subclass implementing FedMAP server-side aggregation and ICNN training.
nhanes_diabetes/client_app.py	Flower ClientApp implementing FedMAP local training and evaluation.
nhanes_diabetes/server_app.py	Flower ServerApp wiring the FedMAP strategy, global evaluation, and ICNN initialisation.

Features Used

The model uses 14 clinical features drawn from three domains:

Demographic: age (years).

Physical examination: BMI, waist circumference, resting pulse.

Laboratory: fasting plasma glucose, glycated haemoglobin (HbA1c), fasting insulin, HDL cholesterol, LDL cholesterol, triglycerides, total cholesterol, serum creatinine, estimated GFR, and C-reactive protein.

Missing values are imputed with per-column medians. Rows with the label value "Skipped" are excluded; the binary task is T2D vs. not diabetic.

Quick Start

Install project dependencies:

pip install -e .

Run the simulated federated experiment locally:

flwr run .

Configuration

Key parameters are set in pyproject.toml under [tool.flwr.app.config]:

Parameter	Default	Description
num-server-rounds	100	Number of federated rounds
fraction-evaluate	1.0	Fraction of clients used for evaluation
local-epochs	5	Local training epochs per round
learning-rate	0.001	Client-side learning rate
batch-size	32	Client-side mini-batch size

Partitioning parameters (N_PARTITIONS, DIRICHLET_ALPHA) are defined in nhanes_data.py.

Data Source and Licensing

The dataset is rtweera/nhanes-data-converted on Hugging Face (Apache-2.0 licence). It is derived from the CDC's National Health and Nutrition Examination Survey (NHANES), which is collected under the authority of the National Center for Health Statistics and released as public-use data. NHANES data are produced by a US government agency and are therefore in the public domain under 17 U.S.C. § 105.