Use with Local Data =================== You can partition your local files and Python objects in ``Flower Datasets`` library using any available ``Partitioner``. This guide details how to create a `Hugging Face `_ `Dataset `_ which is the required type of input for Partitioners. We will cover: * local files: CSV, JSON, image, audio, * in-memory data: dictionary, list, pd.DataFrame, np.ndarray. General Overview ---------------- An all-in-one dataset preparation (downloading, preprocessing, partitioning) happens using `FederatedDataset `_. However, we will use only the `Partitioner` here since we use locally accessible data. The rest of this guide will explain how to create a `Dataset `_ from local files and existing (in memory) Python objects. Local Files ----------- CSV ^^^ .. code-block:: python from datasets import load_dataset from flwr_datasets.partitioner import ChosenPartitioner # Single file data_files = "path-to-my-file.csv" # Multiple Files data_files = [ "path-to-my-file-1.csv", "path-to-my-file-2.csv", ...] dataset = load_dataset("csv", data_files=data_files) partitioner = ChosenPartitioner(...) partitioner.dataset = dataset partition = partitioner.load_partition(partition_id=0) JSON ^^^^ .. code-block:: python from datasets import load_dataset from flwr_datasets.partitioner import ChosenPartitioner # Single file data_files = "path-to-my-file.json" # Multiple Files data_files = [ "path-to-my-file-1.json", "path-to-my-file-2.json", ...] dataset = load_dataset("json", data_files=data_files) partitioner = ChosenPartitioner(...) partitioner.dataset = dataset partition = partitioner.load_partition(partition_id=0) Image ^^^^^ You can create an image dataset in two ways: 1) give a path the directory The directory needs to be structured in the following way: dataset-name/split/class/name. For example: .. code-block:: mnist/train/1/unique_name.png mnist/train/1/unique_name.png mnist/train/2/unique_name.png ... mnist/test/1/unique_name.png mnist/test/1/unique_name.png mnist/test/2/unique_name.png Then, the path you can give is `./mnist`. .. code-block:: python from datasets import load_dataset from flwr_datasets.partitioner import ChosenPartitioner # Directly from a directory dataset_dict = load_dataset("imagefolder", data_dir="/path/to/folder") # Note that what we just loaded is a DatasetDict, we need to choose a single split # and assign it to the partitioner.dataset # e.g. "train" split but that depends on the structure of your directory dataset = dataset_dict["train"] partitioner = ChosenPartitioner(...) partitioner.dataset = dataset partition = partitioner.load_partition(partition_id=0) 2) create a dataset from a CSV/JSON file and cast the path column to Image. .. code-block:: python from datasets import Image, load_dataset from flwr_datasets.partitioner import ChosenPartitioner dataset = load_dataset(...) dataset = dataset.cast_column("path", Image()) partitioner = ChosenPartitioner(...) partitioner.dataset = dataset partition = partitioner.load_partition(partition_id=0) Audio ^^^^^ Analogously to the image datasets, there are two methods here: 1) give a path to the directory .. code-block:: python from datasets import load_dataset from flwr_datasets.partitioner import ChosenPartitioner dataset_dict = load_dataset("audiofolder", data_dir="/path/to/folder") # Note that what we just loaded is a DatasetDict, we need to choose a single split # and assign it to the partitioner.dataset # e.g. "train" split but that depends on the structure of your directory dataset = dataset_dict["train"] partitioner = ChosenPartitioner(...) partitioner.dataset = dataset partition = partitioner.load_partition(partition_id=0) 2) create a dataset from a CSV/JSON file and cast the path column to Audio. .. code-block:: python from datasets import Audio, load_dataset from flwr_datasets.partitioner import ChosenPartitioner dataset = load_dataset(...) dataset = dataset.cast_column("path", Audio()) partitioner = ChosenPartitioner(...) partitioner.dataset = dataset partition = partitioner.load_partition(partition_id=0) In-Memory --------- From dictionary ^^^^^^^^^^^^^^^ .. code-block:: python from datasets import Dataset from flwr_datasets.partitioner import ChosenPartitioner data = {"features": [1, 2, 3], "labels": [0, 0, 1]} dataset = Dataset.from_dict(data) partitioner = ChosenPartitioner(...) partitioner.dataset = dataset partition = partitioner.load_partition(partition_id=0) From list ^^^^^^^^^ .. code-block:: python from datasets import Dataset from flwr_datasets.partitioner import ChosenPartitioner my_list = [ {"features": 1, "labels": 0}, {"features": 2, "labels": 0}, {"features": 3, "labels": 1} ] dataset = Dataset.from_list(my_list) partitioner = ChosenPartitioner(...) partitioner.dataset = dataset partition = partitioner.load_partition(partition_id=0) From pd.DataFrame ^^^^^^^^^^^^^^^^^ .. code-block:: python from datasets import Dataset from flwr_datasets.partitioner import ChosenPartitioner data = {"features": [1, 2, 3], "labels": [0, 0, 1]} df = pd.DataFrame(data) dataset = Dataset.from_pandas(df) partitioner = ChosenPartitioner(...) partitioner.dataset = dataset partition = partitioner.load_partition(partition_id=0) From np.ndarray ^^^^^^^^^^^^^^^ The np.ndarray will be first transformed to pd.DataFrame .. code-block:: python from datasets import Dataset from flwr_datasets.partitioner import ChosenPartitioner data = np.array([[1, 2, 3], [0, 0, 1]]).T # You can add the column names by passing columns=["features", "labels"] df = pd.DataFrame(data) dataset = Dataset.from_pandas(df) partitioner = ChosenPartitioner(...) partitioner.dataset = dataset partition = partitioner.load_partition(partition_id=0) Partitioner Details ------------------- Partitioning is triggered automatically during the first ``load_partition`` call. You do not need to call any “do_partitioning” method. Partitioner abstraction is designed to allow for a single dataset assignment. .. code-block:: python partitioner.dataset = your_dataset # (your_dataset must be of type dataset.Dataset) If you need to do the same partitioning on a different dataset, create a new Partitioner for that, e.g.: .. code-block:: python from flwr_datasets.partitioner import IidPartitioner iid_partitioner_for_mnist = IidPartitioner(num_partitions=10) iid_partitioner_for_mnist.dataset = mnist_dataset iid_partitioner_for_cifar = IidPartitioner(num_partitions=10) iid_partitioner_for_cifar.dataset = cifar_dataset More Resources -------------- If you are looking for more details or you have not found the format you are looking for, please visit the `HuggingFace Datasets docs `_. This guide is based on the following ones: * `General Information `_ * `Tabular Data `_ * `Image Data `_ * `Audio Data `_