Use with Local Data
===================
You can partition your local files and Python objects in
``Flower Datasets`` library using any available ``Partitioner``.
This guide details how to create a `Hugging Face `_ `Dataset `_ which is the required type of input for Partitioners.
We will cover:
* local files: CSV, JSON, image, audio,
* in-memory data: dictionary, list, pd.DataFrame, np.ndarray.
General Overview
----------------
An all-in-one dataset preparation (downloading, preprocessing, partitioning) happens
using `FederatedDataset `_. However, we
will use only the `Partitioner` here since we use locally accessible data.
The rest of this guide will explain how to create a
`Dataset `_
from local files and existing (in memory) Python objects.
Local Files
-----------
CSV
^^^
.. code-block:: python
from datasets import load_dataset
from flwr_datasets.partitioner import ChosenPartitioner
# Single file
data_files = "path-to-my-file.csv"
# Multiple Files
data_files = [ "path-to-my-file-1.csv", "path-to-my-file-2.csv", ...]
dataset = load_dataset("csv", data_files=data_files)
partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
partition = partitioner.load_partition(partition_id=0)
JSON
^^^^
.. code-block:: python
from datasets import load_dataset
from flwr_datasets.partitioner import ChosenPartitioner
# Single file
data_files = "path-to-my-file.json"
# Multiple Files
data_files = [ "path-to-my-file-1.json", "path-to-my-file-2.json", ...]
dataset = load_dataset("json", data_files=data_files)
partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
partition = partitioner.load_partition(partition_id=0)
Image
^^^^^
You can create an image dataset in two ways:
1) give a path the directory
The directory needs to be structured in the following way: dataset-name/split/class/name. For example:
.. code-block::
mnist/train/1/unique_name.png
mnist/train/1/unique_name.png
mnist/train/2/unique_name.png
...
mnist/test/1/unique_name.png
mnist/test/1/unique_name.png
mnist/test/2/unique_name.png
Then, the path you can give is `./mnist`.
.. code-block:: python
from datasets import load_dataset
from flwr_datasets.partitioner import ChosenPartitioner
# Directly from a directory
dataset_dict = load_dataset("imagefolder", data_dir="/path/to/folder")
# Note that what we just loaded is a DatasetDict, we need to choose a single split
# and assign it to the partitioner.dataset
# e.g. "train" split but that depends on the structure of your directory
dataset = dataset_dict["train"]
partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
partition = partitioner.load_partition(partition_id=0)
2) create a dataset from a CSV/JSON file and cast the path column to Image.
.. code-block:: python
from datasets import Image, load_dataset
from flwr_datasets.partitioner import ChosenPartitioner
dataset = load_dataset(...)
dataset = dataset.cast_column("path", Image())
partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
partition = partitioner.load_partition(partition_id=0)
Audio
^^^^^
Analogously to the image datasets, there are two methods here:
1) give a path to the directory
.. code-block:: python
from datasets import load_dataset
from flwr_datasets.partitioner import ChosenPartitioner
dataset_dict = load_dataset("audiofolder", data_dir="/path/to/folder")
# Note that what we just loaded is a DatasetDict, we need to choose a single split
# and assign it to the partitioner.dataset
# e.g. "train" split but that depends on the structure of your directory
dataset = dataset_dict["train"]
partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
partition = partitioner.load_partition(partition_id=0)
2) create a dataset from a CSV/JSON file and cast the path column to Audio.
.. code-block:: python
from datasets import Audio, load_dataset
from flwr_datasets.partitioner import ChosenPartitioner
dataset = load_dataset(...)
dataset = dataset.cast_column("path", Audio())
partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
partition = partitioner.load_partition(partition_id=0)
In-Memory
---------
From dictionary
^^^^^^^^^^^^^^^
.. code-block:: python
from datasets import Dataset
from flwr_datasets.partitioner import ChosenPartitioner
data = {"features": [1, 2, 3], "labels": [0, 0, 1]}
dataset = Dataset.from_dict(data)
partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
partition = partitioner.load_partition(partition_id=0)
From list
^^^^^^^^^
.. code-block:: python
from datasets import Dataset
from flwr_datasets.partitioner import ChosenPartitioner
my_list = [
{"features": 1, "labels": 0},
{"features": 2, "labels": 0},
{"features": 3, "labels": 1}
]
dataset = Dataset.from_list(my_list)
partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
partition = partitioner.load_partition(partition_id=0)
From pd.DataFrame
^^^^^^^^^^^^^^^^^
.. code-block:: python
from datasets import Dataset
from flwr_datasets.partitioner import ChosenPartitioner
data = {"features": [1, 2, 3], "labels": [0, 0, 1]}
df = pd.DataFrame(data)
dataset = Dataset.from_pandas(df)
partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
partition = partitioner.load_partition(partition_id=0)
From np.ndarray
^^^^^^^^^^^^^^^
The np.ndarray will be first transformed to pd.DataFrame
.. code-block:: python
from datasets import Dataset
from flwr_datasets.partitioner import ChosenPartitioner
data = np.array([[1, 2, 3], [0, 0, 1]]).T
# You can add the column names by passing columns=["features", "labels"]
df = pd.DataFrame(data)
dataset = Dataset.from_pandas(df)
partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
partition = partitioner.load_partition(partition_id=0)
Partitioner Details
-------------------
Partitioning is triggered automatically during the first ``load_partition`` call.
You do not need to call any “do_partitioning” method.
Partitioner abstraction is designed to allow for a single dataset assignment.
.. code-block:: python
partitioner.dataset = your_dataset # (your_dataset must be of type dataset.Dataset)
If you need to do the same partitioning on a different dataset, create a new Partitioner
for that, e.g.:
.. code-block:: python
from flwr_datasets.partitioner import IidPartitioner
iid_partitioner_for_mnist = IidPartitioner(num_partitions=10)
iid_partitioner_for_mnist.dataset = mnist_dataset
iid_partitioner_for_cifar = IidPartitioner(num_partitions=10)
iid_partitioner_for_cifar.dataset = cifar_dataset
More Resources
--------------
If you are looking for more details or you have not found the format you are looking for, please visit the `HuggingFace Datasets docs `_.
This guide is based on the following ones:
* `General Information `_
* `Tabular Data `_
* `Image Data `_
* `Audio Data `_