ContinuousPartitioner

class ContinuousPartitioner(num_partitions: int, partition_by: str, strictness: float, shuffle: bool = True, seed: int | None = 42)[source]

Bases: Partitioner

Partitioner based on a real-valued dataset property with adjustable strictness.

This partitioner enables non-IID partitioning by sorting the dataset according to a continuous (i.e., real-valued, not categorical) property and introducing controlled noise to adjust the level of heterogeneity.

To interpolate between IID and non-IID partitioning, a strictness parameter (𝜎 ∈ [0, 1]) blends a standardized property vector (z ∈ ℝⁿ) with Gaussian noise (ε ~ 𝒩(0, I)), producing blended scores:

\[b = \sigma \cdot z + (1 - \sigma) \cdot ε\]

Samples are then sorted by b to assign them to partitions. When strictness is 0, partitioning is purely random (IID), while a value of 1 strictly follows the property ranking (strongly non-IID).

Parameters:
  • num_partitions (int) – Number of partitions to create.

  • partition_by (str) – Name of the continuous feature to partition the dataset on.

  • strictness (float) – Controls how strongly the feature influences partitioning (0 = iid, 1 = non-iid).

  • shuffle (bool) – Whether to shuffle the indices within each partition (default: True).

  • seed (Optional[int]) – Random seed for reproducibility. Used for initializing the random number generator (RNG), which affects the generation of the Gaussian noise (related to the strictness parameter) and dataset shuffling (if shuffle is True).

Examples

>>> from datasets import Dataset
>>> import numpy as np
>>> import pandas as pd
>>> from flwr_datasets.partitioner import ContinuousPartitioner
>>> import matplotlib.pyplot as plt
>>>
>>> # Create synthetic data
>>> df = pd.DataFrame({
>>>     "continuous": np.linspace(0, 10, 10_000),
>>>     "category": np.random.choice([0, 1, 2, 3], size=10_000)
>>> })
>>> hf_dataset = Dataset.from_pandas(df)
>>>
>>> # Partition dataset
>>> partitioner = ContinuousPartitioner(
>>>     num_partitions=5,
>>>     partition_by="continuous",
>>>     strictness=0.7,
>>>     shuffle=True
>>> )
>>> partitioner.dataset = hf_dataset
>>>
>>> # Plot partitions
>>> plt.figure(figsize=(10, 6))
>>> for i in range(5):
>>>     plt.hist(
>>>         partitioner.load_partition(i)["continuous"],
>>>         bins=64,
>>>         alpha=0.5,
>>>         label=f"Partition {i}"
>>>     )
>>> plt.legend()
>>> plt.xlabel("Continuous Value")
>>> plt.ylabel("Frequency")
>>> plt.title("Partition distributions")
>>> plt.grid(True)
>>> plt.show()

Methods

is_dataset_assigned()

Check if a dataset has been assigned to the partitioner.

load_partition(partition_id)

Load a single partition based on the partition index.

Attributes

dataset

Dataset property.

num_partitions

Total number of partitions.

partition_id_to_indices

Mapping from partition ID to dataset indices.

property dataset: Dataset

Dataset property.

is_dataset_assigned() bool

Check if a dataset has been assigned to the partitioner.

This method returns True if a dataset is already set for the partitioner, otherwise, it returns False.

Returns:

dataset_assigned – True if a dataset is assigned, otherwise False.

Return type:

bool

load_partition(partition_id: int) Dataset[source]

Load a single partition based on the partition index.

Parameters:

partition_id (int) – The index that corresponds to the requested partition.

Returns:

dataset_partition – A single dataset partition.

Return type:

Dataset

property num_partitions: int

Total number of partitions.

property partition_id_to_indices: dict[int, list[int]]

Mapping from partition ID to dataset indices.