ContinuousPartitioner¶
- class ContinuousPartitioner(num_partitions: int, partition_by: str, strictness: float, shuffle: bool = True, seed: int | None = 42)[source]¶
Bases:
PartitionerPartitioner based on a real-valued dataset property with adjustable strictness.
This partitioner enables non-IID partitioning by sorting the dataset according to a continuous (i.e., real-valued, not categorical) property and introducing controlled noise to adjust the level of heterogeneity.
To interpolate between IID and non-IID partitioning, a strictness parameter (𝜎 ∈ [0, 1]) blends a standardized property vector (z ∈ ℝⁿ) with Gaussian noise (ε ~ 𝒩(0, I)), producing blended scores:
\[b = \sigma \cdot z + (1 - \sigma) \cdot ε\]Samples are then sorted by b to assign them to partitions. When strictness is 0, partitioning is purely random (IID), while a value of 1 strictly follows the property ranking (strongly non-IID).
- Parameters:
num_partitions (int) – Number of partitions to create.
partition_by (str) – Name of the continuous feature to partition the dataset on.
strictness (float) – Controls how strongly the feature influences partitioning (0 = iid, 1 = non-iid).
shuffle (bool) – Whether to shuffle the indices within each partition (default: True).
seed (Optional[int]) – Random seed for reproducibility. Used for initializing the random number generator (RNG), which affects the generation of the Gaussian noise (related to the strictness parameter) and dataset shuffling (if shuffle is True).
Examples
>>> from datasets import Dataset >>> import numpy as np >>> import pandas as pd >>> from flwr_datasets.partitioner import ContinuousPartitioner >>> import matplotlib.pyplot as plt >>> >>> # Create synthetic data >>> df = pd.DataFrame({ >>> "continuous": np.linspace(0, 10, 10_000), >>> "category": np.random.choice([0, 1, 2, 3], size=10_000) >>> }) >>> hf_dataset = Dataset.from_pandas(df) >>> >>> # Partition dataset >>> partitioner = ContinuousPartitioner( >>> num_partitions=5, >>> partition_by="continuous", >>> strictness=0.7, >>> shuffle=True >>> ) >>> partitioner.dataset = hf_dataset >>> >>> # Plot partitions >>> plt.figure(figsize=(10, 6)) >>> for i in range(5): >>> plt.hist( >>> partitioner.load_partition(i)["continuous"], >>> bins=64, >>> alpha=0.5, >>> label=f"Partition {i}" >>> ) >>> plt.legend() >>> plt.xlabel("Continuous Value") >>> plt.ylabel("Frequency") >>> plt.title("Partition distributions") >>> plt.grid(True) >>> plt.show()
Methods
Check if a dataset has been assigned to the partitioner.
load_partition(partition_id)Load a single partition based on the partition index.
Attributes
Dataset property.
Total number of partitions.
Mapping from partition ID to dataset indices.
- property dataset: Dataset¶
Dataset property.
- is_dataset_assigned() bool¶
Check if a dataset has been assigned to the partitioner.
This method returns True if a dataset is already set for the partitioner, otherwise, it returns False.
- Returns:
dataset_assigned – True if a dataset is assigned, otherwise False.
- Return type:
bool
- load_partition(partition_id: int) Dataset[source]¶
Load a single partition based on the partition index.
- Parameters:
partition_id (int) – The index that corresponds to the requested partition.
- Returns:
dataset_partition – A single dataset partition.
- Return type:
Dataset
- property num_partitions: int¶
Total number of partitions.
- property partition_id_to_indices: dict[int, list[int]]¶
Mapping from partition ID to dataset indices.