Register or Log In
Published

Federated Datasets in Research

Photo of Adam Narożniak
Adam Narożniak
Machine Learning Engineer at Flower Labs

Share this post

In this blog post, we present an overview of the datasets used for FL research in ICLR 2024, NeurIPS 2023, and ICML 2024, and in the final section, we include a brief example showing how to use these datasets in Python with Flower Datasets.

Data collection

The search query included just the "federated" keyword. It resulted in:

  • ICLR 2024: 51 papers.
  • NeurIPS 2023: 59 papers.
  • ICML 2024: 60 papers.

It gave 170 papers in total from which we extracted the names of the datasets. Five of the papers were concerned with vertical FL; the rest were about horizontal FL, and in this analysis, we do not distinguish between them.

Data overview

In total, 217 unique datasets were used. However, 167 (which is about 75%) of them were used only once! Such high usage of unique datasets makes the comparison between different FL papers hard. Below, we present a summary of this data with distinction between the specific conferences:

Venue# FL PapersUnique Datasets# Datasets Used 1# Datasets Used >1
Combined17021716156
ICRL2024511189226
ICML202460917714
NeurIPS202359725418

Out of the 56 datasets that were used more than once, we will present the top 20 datasets now. The datasets presented there were used at least 4 times. See the plot below:

The vast majority of the datasets are in the image domain, and the most popular are the CIFARs dataset. 15 papers used custom synthetic datasets (not necessarily the same), and 8 papers were concerned only with theory and did not use datasets. The most popular tabular dataset was adult (aka census income dataset), and the most popular text dataset was sentiment140 (aka twitter sentiment dataset). The most popular audio dataset - speech commands - was below the top 20, with 3 usages.

In the table below, we show the dataset count information once again next to the links to them on 🤗 Hugging Face. Each of these datasets can be used in the Flower Datasets library with various partitioning schemes.

DatasetCountHF Name
cifar10101uoft-cs/cifar10
cifar10048uoft-cs/cifar100
mnist44ylecun/mnist
fashionmnist36zalando-datasets/fashion_mnist
svhn18ufldl-stanford/svhn
tinyimagenet18zh-plus/tiny-imagenet
femnist18flwrlabs/femnist
pacs8flwrlabs/pacs
domainnet7wltjr1007/DomainNet
shakespeare6flwrlabs/shakespeare
celeba7flwrlabs/celeba
cinic106flwrlabs/cinic10
officecaltech105-
fedisic20195flwrlabs/fed-isic2019
cifar10c4-
adult4meghana/adult_income_dataset
sentiment1404stanfordnlp/sentiment140

Below, we also present the top 10 datasets used in each conference.

Partitioning Schemes

The information about the partitioning schemes in the analyzed paper was not always complete. This leaves it up to interpretation of how the datasets were exactly partitioned, which can lead to various differences between the methods. Therefore, we don't provide the numerical results but rather present the most common choices more informally.

However, the most popular simulated partitioning schemes include:

  1. Dirichlet-based partitioning scheme,
  2. Pathological partitioning scheme (constraining the number of classes per client) or the approximation of pathological - shard partitioning.

We noticed a few other ways of dataset division, e.g., a mixture of pathological partitioning scheme (such that p of the samples belong to a selected number of classes) and iid (for the remaining 1-p samples), or in terms of time series data a division such that each client has data for only a specific day. Also, other types of heterogeneity were simulated beyond the label skew, such as image rotations or flips.

Getting started with Flower Datasets

If you'd like to know more about how to use these datasets in your Federated Learning experiments, visit Flower Datasets documentation.

Here's the basic code that can enable you to match what was done in the research:


from flwr_datasets import FederatedDataset
from flwr_datasets.partitioner import DirichletPartitioner

partitioner = DirichletPartitioner(
  num_partitions=10,
  partition_by="label",
  alpha=0.3,
)
fds = FederatedDataset(
  dataset="cifar10",
  partitioners={"train": partitioner},
)
partition = fds.load_partition(partition_id=0)

And you don't even need to know how to code to create it, visit this website to interactively create the dataset division you need.


Share this post