PathologicalPartitioner

class PathologicalPartitioner(num_partitions: int, partition_by: str, num_classes_per_partition: int, class_assignment_mode: Literal['random', 'deterministic', 'first-deterministic'] = 'random', shuffle: bool = True, seed: int | None = 42)[source]

Bases: Partitioner

Partition dataset such that each partition has a chosen number of classes.

Implementation based on Federated Learning on Non-IID Data Silos: An Experimental Study https://arxiv.org/pdf/2102.02079.

The algorithm firstly determines which classe will be assigned to which partitions. For each partition num_classes_per_partition are sampled in a way chosen in class_assignment_mode. Given the information about the required classes for each partition, it is determined into how many parts the samples corresponding to this label should be divided. Such division is performed for each class.

Parameters:
  • num_partitions (int) – The total number of partitions that the data will be divided into.

  • partition_by (str) – Column name of the labels (targets) based on which partitioning works.

  • num_classes_per_partition (int) – The (exact) number of unique classes that each partition will have.

  • class_assignment_mode (Literal["random", "deterministic", "first-deterministic"]) –

    The way how the classes are assigned to the partitions. The default is “random”. The possible values are:

    • ”random”: Randomly assign classes to the partitions. For each partition choose the num_classes_per_partition classes without replacement.

    • ”first-deterministic”: Assign the first class for each partition in a deterministic way (class id is the partition_id % num_unique_classes). The rest of the classes are assigned randomly. In case the number of partitions is smaller than the number of unique classes, not all classes will be used in the first iteration, otherwise all the classes will be used (such it will be present in at least one partition).

    • ”deterministic”: Assign all the classes to the partitions in a deterministic way. Classes are assigned based on the formula: partion_id has classes identified by the index: (partition_id + i) % num_unique_classes where i in {0, …, num_classes_per_partition}. So, partition 0 will have classes 0, 1, 2, …, num_classes_per_partition-1, partition 1 will have classes 1, 2, 3, …,`num_classes_per_partition`, ….

    The list representing the unique lables is sorted in ascending order. In case of numbers starting from zero the class id corresponds to the number itself. class_assignment_mode=”first-deterministic” was used in the orginal paper, here we provide the option to use the other modes as well.

  • shuffle (bool) – Whether to randomize the order of samples. Shuffling applied after the samples assignment to partitions.

  • seed (int) – Seed used for dataset shuffling. It has no effect if shuffle is False.

Examples

In order to mimic the original behavior of the paper follow the setup below (the class_assignment_mode=”first-deterministic”):

>>> from flwr_datasets.partitioner import PathologicalPartitioner
>>> from flwr_datasets import FederatedDataset
>>>
>>> partitioner = PathologicalPartitioner(
>>>     num_partitions=10,
>>>     partition_by="label",
>>>     num_classes_per_partition=2,
>>>     class_assignment_mode="first-deterministic"
>>> )
>>> fds = FederatedDataset(dataset="mnist", partitioners={"train": partitioner})
>>> partition = fds.load_partition(0)

Methods

is_dataset_assigned()

Check if a dataset has been assigned to the partitioner.

load_partition(partition_id)

Load a partition based on the partition index.

Attributes

dataset

Dataset property.

num_partitions

Total number of partitions.

property dataset: Dataset

Dataset property.

is_dataset_assigned() bool

Check if a dataset has been assigned to the partitioner.

This method returns True if a dataset is already set for the partitioner, otherwise, it returns False.

Returns:

dataset_assigned – True if a dataset is assigned, otherwise False.

Return type:

bool

load_partition(partition_id: int) Dataset[source]

Load a partition based on the partition index.

Parameters:

partition_id (int) – The index that corresponds to the requested partition.

Returns:

dataset_partition – Single partition of a dataset.

Return type:

Dataset

property num_partitions: int

Total number of partitions.