ContinuousPartitioner¶

class ContinuousPartitioner(num_partitions: int, partition_by: str, strictness: float, shuffle: bool = True, seed: int | None = 42)[source]¶

Bases: Partitioner

Partitioner based on a real-valued dataset property with adjustable strictness.

This partitioner enables non-IID partitioning by sorting the dataset according to a continuous (i.e., real-valued, not categorical) property and introducing controlled noise to adjust the level of heterogeneity.

To interpolate between IID and non-IID partitioning, a strictness parameter (𝜎 ∈ [0, 1]) blends a standardized property vector (z ∈ ℝⁿ) with Gaussian noise (ε ~ 𝒩(0, I)), producing blended scores:

\[b = \sigma \cdot z + (1 - \sigma) \cdot ε\]

Samples are then sorted by b to assign them to partitions. When strictness is 0, partitioning is purely random (IID), while a value of 1 strictly follows the property ranking (strongly non-IID).

Parameters:

num_partitions (int) – Number of partitions to create.
partition_by (str) – Name of the continuous feature to partition the dataset on.
strictness (float) – Controls how strongly the feature influences partitioning (0 = iid, 1 = non-iid).
shuffle (bool) – Whether to shuffle the indices within each partition (default: True).
seed (Optional[int]) – Random seed for reproducibility.

Examples

>>> from datasets import Dataset
>>> import numpy as np
>>> import pandas as pd
>>> from flwr_datasets.partitioner import ContinuousPartitioner
>>> import matplotlib.pyplot as plt
>>>
>>> # Create synthetic data
>>> df = pd.DataFrame({
>>>     "continuous": np.linspace(0, 10, 10_000),
>>>     "category": np.random.choice([0, 1, 2, 3], size=10_000)
>>> })
>>> hf_dataset = Dataset.from_pandas(df)
>>>
>>> # Partition dataset
>>> partitioner = ContinuousPartitioner(
>>>     num_partitions=5,
>>>     partition_by="continuous",
>>>     strictness=0.7,
>>>     shuffle=True
>>> )
>>> partitioner.dataset = hf_dataset
>>>
>>> # Plot partitions
>>> plt.figure(figsize=(10, 6))
>>> for i in range(5):
>>>     plt.hist(
>>>         partitioner.load_partition(i)["continuous"],
>>>         bins=64,
>>>         alpha=0.5,
>>>         label=f"Partition {i}"
>>>     )
>>> plt.legend()
>>> plt.xlabel("Continuous Value")
>>> plt.ylabel("Frequency")
>>> plt.title("Partition distributions")
>>> plt.grid(True)
>>> plt.show()

Methods

`is_dataset_assigned`()	Check if a dataset has been assigned to the partitioner.
`load_partition`(partition_id)	Load a single partition based on the partition index.

Attributes

`dataset`	Dataset property.
`num_partitions`	Total number of partitions.
`partition_id_to_indices`	Mapping from partition ID to dataset indices.

property dataset: Dataset¶: Dataset property.

is_dataset_assigned() → bool¶

Check if a dataset has been assigned to the partitioner.

This method returns True if a dataset is already set for the partitioner, otherwise, it returns False.

Returns:: dataset_assigned – True if a dataset is assigned, otherwise False.
Return type:: bool

load_partition(partition_id: int) → Dataset[source]¶

Load a single partition based on the partition index.

Parameters:: partition_id (int) – The index that corresponds to the requested partition.
Returns:: dataset_partition – A single dataset partition.
Return type:: Dataset

property num_partitions: int¶: Total number of partitions.

property partition_id_to_indices: dict[int, list[int]]¶: Mapping from partition ID to dataset indices.