ContinuousPartitioner¶
- class ContinuousPartitioner(num_partitions: int, partition_by: str, strictness: float, shuffle: bool = True, seed: int | None = 42)[source]¶
Bases:
Partitioner
Partitioner based on a real-valued dataset property with adjustable strictness.
This partitioner enables non-IID partitioning by sorting the dataset according to a continuous (i.e., real-valued, not categorical) property and introducing controlled noise to adjust the level of heterogeneity.
To interpolate between IID and non-IID partitioning, a strictness parameter (𝜎 ∈ [0, 1]) blends a standardized property vector (z ∈ ℝⁿ) with Gaussian noise (ε ~ 𝒩(0, I)), producing blended scores:
\[b = \sigma \cdot z + (1 - \sigma) \cdot ε\]Samples are then sorted by b to assign them to partitions. When strictness is 0, partitioning is purely random (IID), while a value of 1 strictly follows the property ranking (strongly non-IID).
- Parameters:
num_partitions (int) – Number of partitions to create.
partition_by (str) – Name of the continuous feature to partition the dataset on.
strictness (float) – Controls how strongly the feature influences partitioning (0 = iid, 1 = non-iid).
shuffle (bool) – Whether to shuffle the indices within each partition (default: True).
seed (Optional[int]) – Random seed for reproducibility.
Examples
>>> from datasets import Dataset >>> import numpy as np >>> import pandas as pd >>> from flwr_datasets.partitioner import ContinuousPartitioner >>> import matplotlib.pyplot as plt >>> >>> # Create synthetic data >>> df = pd.DataFrame({ >>> "continuous": np.linspace(0, 10, 10_000), >>> "category": np.random.choice([0, 1, 2, 3], size=10_000) >>> }) >>> hf_dataset = Dataset.from_pandas(df) >>> >>> # Partition dataset >>> partitioner = ContinuousPartitioner( >>> num_partitions=5, >>> partition_by="continuous", >>> strictness=0.7, >>> shuffle=True >>> ) >>> partitioner.dataset = hf_dataset >>> >>> # Plot partitions >>> plt.figure(figsize=(10, 6)) >>> for i in range(5): >>> plt.hist( >>> partitioner.load_partition(i)["continuous"], >>> bins=64, >>> alpha=0.5, >>> label=f"Partition {i}" >>> ) >>> plt.legend() >>> plt.xlabel("Continuous Value") >>> plt.ylabel("Frequency") >>> plt.title("Partition distributions") >>> plt.grid(True) >>> plt.show()
Methods
Check if a dataset has been assigned to the partitioner.
load_partition
(partition_id)Load a single partition based on the partition index.
Attributes
Dataset property.
Total number of partitions.
Mapping from partition ID to dataset indices.
- property dataset: Dataset¶
Dataset property.
- is_dataset_assigned() bool ¶
Check if a dataset has been assigned to the partitioner.
This method returns True if a dataset is already set for the partitioner, otherwise, it returns False.
- Returns:
dataset_assigned – True if a dataset is assigned, otherwise False.
- Return type:
bool
- load_partition(partition_id: int) Dataset [source]¶
Load a single partition based on the partition index.
- Parameters:
partition_id (int) – The index that corresponds to the requested partition.
- Returns:
dataset_partition – A single dataset partition.
- Return type:
Dataset
- property num_partitions: int¶
Total number of partitions.
- property partition_id_to_indices: dict[int, list[int]]¶
Mapping from partition ID to dataset indices.