Federated Analytics with OMOP CDM using Flower¶

View on GitHub

This example will show you how you can use Flower to run federated analytics workloads on distributed SQL databases, which are applicable to many biostatistics and healthcare use cases. You will use an artificial dataset generated in adherence to the OMOP Common Data Model, which uses the OHDSI standardized vocabularies widely adopted in clinical domains. You will also run this example with Flower’s Deployment Engine to demonstrate how each SuperNode can be configured to connect to different PostgreSQL databases, respectively, while the database connection will be handled using SQLAlchemy and the psycopg adapter which is the latest implementation of the PostgreSQL adapter for Python.

Set up the project¶

Prerequisites¶

This example will make use of PostgreSQL databases running in containers. Make sure you have Docker and Docker Compose v2 installed.

Clone the project¶

After cloning the project, this will create a new directory called federated-analytics containing the following files:

federated-analytics
├── federated-analytics
│   ├── client_app.py   # Defines your ClientApp
│   ├── server_app.py   # Defines your ServerApp
│   └── task.py         # Defines your database connection and data loading
├── pyproject.toml      # Project metadata like dependencies and configs
└── README.md

Install dependencies and project¶

Install the dependencies defined in pyproject.toml as well as the federated_analytics package.

# From a new python environment, run:
pip install -e .

Start PostgreSQL databases¶

First, create a database initialization script that defines the OMOP CDM table structure and generates sample data. Create a file named db_init.sh:

#!/usr/bin/env bash

SEED=${DB_SEED:-0.42}

psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" <<-EOSQL
    CREATE TABLE person_measurements (
        person_id         INTEGER PRIMARY KEY,
        age               INTEGER,
        bmi               FLOAT,
        systolic_bp       INTEGER,
        diastolic_bp      INTEGER,
        ldl_cholesterol   FLOAT,
        hba1c             FLOAT
    );

    SELECT setseed($SEED);

    INSERT INTO person_measurements
    SELECT
        gs AS person_id,
        20 + (random() * 60)::INT        AS age,
        18 + (random() * 15)             AS bmi,
        100 + (random() * 40)::INT       AS systolic_bp,
        60 + (random() * 25)::INT        AS diastolic_bp,
        70 + (random() * 120)            AS ldl_cholesterol,
        4.5 + (random() * 4)             AS hba1c
    FROM generate_series(1, 100) gs;
EOSQL

Make it executable:

chmod +x db_init.sh

Next, create a script to start the PostgreSQL containers. Create a file named db_start.sh:

#!/usr/bin/env bash

set -e

N=${1:-2}   # number of PostgreSQL databases (default = 2)
BASE_PORT=5433

{
    echo "services:"

    for i in $(seq 1 "$N"); do
        PORT=$((BASE_PORT + i - 1))
        # Set a seed for each of the database for producing different random data
        SEED=$(echo "scale=2; $i / 100" | bc)
        cat <<EOF
  postgres_$i:
    image: postgres:18
    container_name: postgres_$i
    environment:
      POSTGRES_USER: flwrlabs
      POSTGRES_PASSWORD: flwrlabs
      POSTGRES_DB: flwrlabs
      DB_SEED: $SEED
    ports:
      - "$PORT:5432"
    volumes:
      - ./db_init.sh:/docker-entrypoint-initdb.d/init.sh:ro

EOF
    done
} | docker compose -f - up -d

Make it executable and run it to start two PostgreSQL databases:

chmod +x db_start.sh
./db_start.sh

[!NOTE] To start more than two databases, pass the desired number as the first argument to the script, e.g. ./db_start.sh 3.

Update Flower Config¶

SuperLink connections are defined in the Flower Configuration file. If it doesn’t exist, this file is created automatically for you when you use the Flower CLI command. Open the config.toml file (usually located at $HOME/.flwr) and add a new SuperLink connection at the end:

[superlink.local-deployment]
address = "127.0.0.1:9093"
insecure = true

Run with the Deployment Engine¶

For a basic execution of this federated analytics app, activate your environment and start the SuperLink process in insecure mode:

flower-superlink --insecure

Next, start two Supernodes and connect them to the SuperLink. You will need to specify different --node-config values so that each SuperNode will connect to different PostgreSQL databases.

The db-url parameter accepts a standard PostgreSQL connection string in the format postgresql+psycopg://username:password@host:port/database. Each SuperNode should point to a different database port (5433 and 5434) corresponding to the containers started by db_start.sh. In this example, the database URLs are set individually and passed to the --node-config parameter:

flower-supernode \
     --insecure \
     --superlink 127.0.0.1:9092 \
     --clientappio-api-address 127.0.0.1:9094 \
     --node-config="db-url='postgresql+psycopg://flwrlabs:flwrlabs@localhost:5433/flwrlabs'"
flower-supernode \
     --insecure \
     --superlink 127.0.0.1:9092 \
     --clientappio-api-address 127.0.0.1:9095 \
     --node-config="db-url='postgresql+psycopg://flwrlabs:flwrlabs@localhost:5434/flwrlabs'"

Finally, run the Flower App and follow the ServerApp logs to track the execution of the run:

flwr run . local-deployment --stream

You can also override some of the settings for your ClientApp and ServerApp defined in pyproject.toml. For example:

flwr run . local-deployment --run-config "selected-features='age,bmi'" --stream

The steps above are adapted from this how-to guide. After that, you might be intersted in setting up secure TLS-enabled communications and SuperNode authentication in your federation.

If you are already familiar with how the Deployment Engine works, you may want to learn how to run it using Docker. Check out the Flower with Docker documentation.