FederatedDataset

class FederatedDataset(*, dataset: str, subset: str | None = None, preprocessor: Callable[[DatasetDict], DatasetDict] | dict[str, tuple[str, ...]] | None = None, partitioners: dict[str, Partitioner | int], shuffle: bool = True, seed: int | None = 42, **load_dataset_kwargs: Any)[source]

Bases: object

Representation of a dataset for federated learning/evaluation/analytics.

Download, partition data among clients (edge devices), or load full dataset.

Partitions are created per-split-basis using Partitioners from flwr_datasets.partitioner specified in partitioners (see partitioners parameter for more information).

Parameters:
  • dataset (str) – The name of the dataset in the Hugging Face Hub.

  • subset (str) – Secondary information regarding the dataset, most often subset or version (that is passed to the name in datasets.load_dataset).

  • preprocessor (Optional[Union[Preprocessor, Dict[str, Tuple[str, ...]]]]) – Callable that transforms DatasetDict by resplitting, removing features, creating new features, performing any other preprocessing operation, or configuration dict for Merger. Applied after shuffling. If None, no operation is applied.

  • partitioners (Dict[str, Union[Partitioner, int]]) – A dictionary mapping the Dataset split (a str) to a Partitioner or an int (representing the number of IID partitions that this split should be partitioned into, i.e., using the default partitioner IidPartitioner). One or multiple Partitioner objects can be specified in that manner, but at most, one per split.

  • shuffle (bool) – Whether to randomize the order of samples. Applied prior to preprocessing operations, speratelly to each of the present splits in the dataset. It uses the seed argument. Defaults to True.

  • seed (Optional[int]) – Seed used for dataset shuffling. It has no effect if shuffle is False. The seed cannot be set in the later stages. If None, then fresh, unpredictable entropy will be pulled from the OS. Defaults to 42.

  • load_dataset_kwargs (Any) – Additional keyword arguments passed to datasets.load_dataset function. Currently used paramters used are dataset => path (in load_dataset), subset => name (in load_dataset). You can pass e.g., num_proc=4, trust_remote_code=True. Do not pass any parameters that modify the return type such as another type than DatasetDict is returned.

Examples

Use MNIST dataset for Federated Learning with 100 clients (edge devices):

>>> from flwr_datasets import FederatedDataset
>>>
>>> fds = FederatedDataset(dataset="mnist", partitioners={"train": 100})
>>> # Load partition for a client with ID 10.
>>> partition = fds.load_partition(10)
>>> # Use test split for centralized evaluation.
>>> centralized = fds.load_split("test")

Use CIFAR10 dataset for Federated Laerning with 100 clients:

>>> from flwr_datasets import FederatedDataset
>>> from flwr_datasets.partitioner import DirichletPartitioner
>>>
>>> partitioner = DirichletPartitioner(num_partitions=10, partition_by="label",
>>>                                    alpha=0.5, min_partition_size=10)
>>> fds = FederatedDataset(dataset="cifar10", partitioners={"train": partitioner})
>>> partition = fds.load_partition(partition_id=0)

Visualize the partitioned datasets:

>>> from flwr_datasets.visualization import plot_label_distributions
>>>
>>> _ = plot_label_distributions(
>>>     partitioner=fds.partitioners["train"],
>>>     label_name="label",
>>>     legend=True,
>>> )

Methods

load_partition(partition_id[, split])

Load the partition specified by the idx in the selected split.

load_split(split)

Load the full split of the dataset.

Attributes

partitioners

Dictionary mapping each split to its associated partitioner.

load_partition(partition_id: int, split: str | None = None) Dataset[source]

Load the partition specified by the idx in the selected split.

The dataset is downloaded only when the first call to load_partition or load_split is made.

Parameters:
  • partition_id (int) – Partition index for the selected split, idx in {0, …, num_partitions - 1}.

  • split (Optional[str]) – Name of the (partitioned) split (e.g. “train”, “test”). You can skip this parameter if there is only one partitioner for the dataset. The name will be inferred automatically. For example, if partitioners={“train”: 10}, you do not need to provide this argument, but if partitioners={“train”: 10, “test”: 100}, you need to set it to differentiate which partitioner should be used. The split names you can choose from vary from dataset to dataset. You need to check the dataset on the Hugging Face Hub`<https://huggingface.co/ datasets>_ to see which splits are available. You can resplit the dataset by using the `preprocessor parameter (to rename, merge, divide, etc. the available splits).

Returns:

partition – Single partition from the dataset split.

Return type:

Dataset

load_split(split: str) Dataset[source]

Load the full split of the dataset.

The dataset is downloaded only when the first call to load_partition or load_split is made.

Parameters:

split (str) – Split name of the downloaded dataset (e.g. “train”, “test”). The split names you can choose from vary from dataset to dataset. You need to check the dataset on the Hugging Face Hub`<https://huggingface.co/ datasets>_ to see which splits are available. You can resplit the dataset by using the `preprocessor parameter (to rename, merge, divide, etc. the available splits).

Returns:

dataset_split – Part of the dataset identified by its split name.

Return type:

Dataset

property partitioners: dict[str, Partitioner]

Dictionary mapping each split to its associated partitioner.

The returned partitioners have the splits of the dataset assigned to them.