FLANDERS: Protecting Federated Learning from Extreme Model Poisoning Attacks via Multidimensional Time Series Anomaly Detection

View on GitHub

Note: If you use this baseline in your work, please remember to cite the original authors of the paper as well as the Flower paper.

Paper: arxiv.org/abs/2303.16668

Authors: Edoardo Gabrielli, Gabriele Tolomei, Dimitri Belli, Vittorio Miori

Abstract: Current defense mechanisms against model poisoning attacks in federated learning (FL) systems have proven effective up to a certain threshold of malicious clients. In this work, we introduce FLANDERS, a novel pre-aggregation filter for FL resilient to large-scale model poisoning attacks, i.e., when malicious clients far exceed legitimate participants. FLANDERS treats the sequence of local models sent by clients in each FL round as a matrix-valued time series. Then, it identifies malicious client updates as outliers in this time series by comparing actual observations with estimates generated by a matrix autoregressive forecasting model maintained by the server. Experiments conducted in several non-iid FL setups show that FLANDERS significantly improves robustness across a wide spectrum of attacks when paired with standard and robust existing aggregation methods.

About this baseline

What’s implemented: The code in this directory replicates the results of FLANDERS+[baseline] on MNIST and Fashion-MNIST under all attack settings: Gaussian, LIE, OPT, and AGR-MM; with \(r=[0.2,0.6,0.8]\) (i.e., the fraction of malicious clients), specifically about tables 1, 3, 10, 11, 15, 17, 19, 20 and Figure 3.

Datasets: MNIST, FMNIST

Hardware Setup: AMD Ryzen 9, 64 GB RAM, and an NVIDIA 4090 GPU with 24 GB VRAM.

Estimated time to run: You can expect to run experiments on the given setup in 2m with MNIST and 3m with Fashion-MNIST, without attacks. With an Apple M2 Pro, 16gb RAM, each experiment with 10 clients for MNIST runs in about 24 minutes. Note that experiments with OPT (fang) and AGR-MM (minmax) can be up to 5x times slower.

Contributors: Edoardo Gabrielli, Sapienza University of Rome (GitHub, Scholar)

Experimental Setup

Please, checkout Appendix F and G of the paper for a comprehensive overview of the hyperparameters setup, however here’s a summary.

Task: Image classification

Models:

MNIST (multilabel classification, fully connected, feed forward NN):

  • Multilevel Perceptron (MLP)

  • minimizing multiclass cross-entropy loss using Adam optimizer

  • input: 784

  • hidden layer 1: 128

  • hidden layer 2: 256

Fashion-MNIST (multilabel classification, fully connected, feed forward NN):

  • Multilevel Perceptron (MLP)

  • minimizing multiclass cross-entropy loss using Adam optimizer

  • input: 784

  • hidden layer 1: 256

  • hidden layer 2: 128

  • hidden layer 3: 64

Dataset: Every dataset is partitioned into two disjoint sets: 80% for training and 20% for testing. The training set is distributed across all clients (100) by using the Dirichlet distribution with \(\alpha=0.5\), simulating a high non-i.i.d. scenario, while the testing set is uniform and held by the server to evaluate the global model.

Description

Default Value

Partitions

100

Evaluation

centralized

Training set

80%

Testing set

20%

Distribution

Dirichlet

\(\alpha\)

0.5

Training Hyperparameters:

Dataset

# of clients

Clients per round

# of rounds

Batch size

Learning rate

Optimizer

Dropout

Alpha

Beta

# of clients to keep

Sampling

MNIST

100

100

50

32

\(10^{-3}\)

Adam

0.2

0.0

0.0

\(m - b\)

500

FMNIST

100

100

50

32

\(10^{-3}\)

Adam

0.2

0.0

0.0

\(m - b\)

500

Where \(m\) is the number of clients partecipating during n-th round and \(b\) is the number of malicious clients. The variable \(sampling\) identifies how many parameters MAR analyzes.

Environment Setup

# Use a version of Python >=3.9 and <3.12.0.
pyenv local 3.10.12
poetry env use 3.10.12

# Install everything from the toml
poetry install

# Activate the env
poetry shell

Running the Experiments

Ensure that the environment is properly set up, then run:

python -m flanders.main

To execute a single experiment with the default values in conf/base.yaml.

To run custom experiments, you can override the default values like that:

python -m flanders.main dataset=mnist server.attack_fn=lie server.num_malicious=1

To run multiple custom experiments:

python -m flanders.main --multirun dataset=mnist,fmnist server.attack_fn=gaussian,lie,fang,minmax server.num_malicious=0,1,2,3,4,5

Expected Results

To run all the experiments of the paper (for MNIST and Fashion-MNIST), I’ve set up a script:

sh run.sh

This code will produce the output in the file outputs/all_results.csv. To generate the plots and tables displayed below, you can use the notebook in the plotting/ directory.

Accuracy over multiple rounds

(left) MNIST, FLANDERS+FedAvg with 80% of malicious clients (b = 80); (right) Vanilla FedAvg in the same setting:

acc_over_rounds

Precision and Recall of FLANDERS

b = 20:

alt text

b = 60:

alt text

b = 80:

alt text

Accuracy w.r.t. number of attackers:

b = 0:

alt text


b = 20:

alt text


b = 60:

alt text


b = 80:

alt text