Flower AI Summit 2026·April 15–16·London

@sainzpardo/ai4os-fedllm-medical

0
0
flwr new @sainzpardo/ai4os-fedllm-medical

FlowerTune LLM on Medical Dataset

Evaluation in the three baseline datasets:

PubMedQAMedMCQAMedQAAvg
Acc (%)72.6058.6463.3964.88

Communication budget: used 1040.31 MB (5th round).

Evaluation of the baseline model proposed

PubMedQAMedMCQAMedQAAvg
Acc (%)59.0023.6927.1036.60

Introduction

This directory conducts federated instruction tuning with a pretrained Mistral-7B model on a Medical dataset. We use Flower Datasets to download, partition and preprocess the dataset. Flower's Simulation Engine is used to simulate the LLM fine-tuning process in federated way, which allows users to perform the training on a single GPU.

Changes from baseline

  • Following the advances obtained with the approach presented by the Gachon Cognitive Computing Lab, we have used as a base model the ContactDoctor/Bio-Medical-Llama-3-8B fine tuned model.
  • We train the model during 5 rounds. Although we set num-server-rounds = 20, we take the checkpoint obtained in round 5 (peft_5).
  • We train the model locally during 5 epochs: train.training-arguments.num-train-epochs = 5
  • We use the FedAvgOpt aggregation function.

Methodology

This baseline performs federated LLM fine-tuning with LoRA using the 🤗PEFT library. The clients' models are aggregated with FedAvg strategy. This provides a baseline performance for the leaderboard of Medical challenge.

Environments setup

Project dependencies are defined in pyproject.toml. Install them in an activated Python environment with:

pip install -e .

Experimental setup

The dataset is divided into 20 partitions in an IID fashion, a partition is assigned to each ClientApp. We randomly sample a fraction (0.1) of the total nodes to participate in each round, for a total of 20 rounds (but we take the checkpoint for round 5). All settings are defined in pyproject.toml.

Running the experiment

First, login in huggingface:

huggingface-cli login

Then, run the experiment:

flwr run .

Evaluation in the three baseline datasets:

python eval.py --base-model-name-path="ContactDoctor/Bio-Medical-Llama-3-8B" --peft-path="peft_5" --batch-size=16 --quantization=4 --datasets=pubmedqa,medmcqa,medqa