Quickstart#

Start with Flower Datasets as fast as possible by learning the essentials.

Install Flower Datasets#

[ ]:

! pip install -q "flwr-datasets[vision]"

If you want to use audio datasets install:

! pip install -q "flwr-datasets[audio]"

Choose the dataset#

To choose the dataset, go to Hugging Face Datasets Hub and search for your dataset by name. You will pass that names to the dataset parameter of FederatedDataset. Note that the name is case-sensitive.

Choose HF dataset.

Note that once the dataset is available on HuggingFace Hub it can be immediately used in Flower Datasets (no approval from Flower team is needed, no custom code needed).

Here is how it looks for CIFAR10 dataset.

Choose HF dataset.

Partition the dataset#

To partition a dataset (in a basic scenario), you need to choose two things: 1) A dataset (identified by a name), 2) A partitioning scheme (by selecting one of the supported partitioning schemes, see all of them here, or creating a custom partitioning scheme).

1) Dataset choice

We will pass the name of the dataset to FederatedDataset(dataset="some-name", other-parameters). In this example it will be: FederatedDataset(dataset="uoft-cs/cifar10", other-parameters)

2) Partitioner choice

We will partition the dataset in an IID manner using IidPartitioner (link to the docs). Only the train split of the dataset will be processed. In general, we do FederatedDataset(dataset="some-name", partitioners={"split-name": partitioning_scheme}), which for this example looks like:

[ ]:

from flwr_datasets import FederatedDataset
from flwr_datasets.partitioner import IidPartitioner

fds = FederatedDataset(
    dataset="uoft-cs/cifar10", partitioners={"train": IidPartitioner(num_partitions=10)}
)

# Load the first partition of the "train" split
partition = fds.load_partition(0, "train")
# You can access the whole "test" split of the base dataset (it hasn't been partitioned)
centralized_dataset = fds.load_split("test")

Now we have 10 partitions created from the train split of the CIFAR10 dataset and the test split for the centralized evaluation. Later we will convert the type of the dataset from Hugging Face’s Dataset type to the format required by PyTorch/TensorFlow frameworks.

Investigate the partition#

Features#

Now we will determine the names of the features of your dataset (you can alternatively do that directly on the Hugging Face website). The names can vary along different datasets e.g. “img” or “image”, “label” or “labels”. Additionally, if the label column is of ClassLabel type, we will also see the names of labels.

[ ]:

# Note this dataset has
partition.features

{'img': Image(mode=None, decode=True, id=None),
 'label': ClassLabel(names=['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck'], id=None)}

Indexing#

To see the first sample of the partition, we can index it like a Python list.

[ ]:

partition[0]

{'img': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=32x32>,
 'label': 1}

Then we can additionally choose the specific column.

[ ]:

partition[0]["label"]

We can also use slicing (take a few samples). Let’s take the first 3 samples of the first partition:

[ ]:

partition[:3]

{'img': [<PIL.PngImagePlugin.PngImageFile image mode=RGB size=32x32>,
  <PIL.PngImagePlugin.PngImageFile image mode=RGB size=32x32>,
  <PIL.PngImagePlugin.PngImageFile image mode=RGB size=32x32>],
 'label': [1, 2, 6]}

We get a dictionary where the keys are the names of the columns and the values are list of the corresponding values of each row of the dataset. So to take the first 3 labels we can do:

[ ]:

partition[:3]["label"]

[1, 2, 6]

Note that the indexing by column first is also possible but discouraged because the whole column will be loaded into the memory.

[ ]:

partition["label"][:3]

[1, 2, 6]

You can also select a subset of the dataset and keep the same type (dataset.Dataset) instead of receiving a dictionary of values.

[ ]:

partition.select([0, 1, 2])

Dataset({
    features: ['img', 'label'],
    num_rows: 3
})

And this dataset contains the same samples as we saw before.

[ ]:

partition.select([0, 1, 2])[:]

{'img': [<PIL.PngImagePlugin.PngImageFile image mode=RGB size=32x32>,
  <PIL.PngImagePlugin.PngImageFile image mode=RGB size=32x32>,
  <PIL.PngImagePlugin.PngImageFile image mode=RGB size=32x32>],
 'label': [1, 2, 6]}

Use with PyTorch/NumPy/TensorFlow#

For more detailed instructions, go to:

PyTorch#

Transform the Dataset into the DataLoader, use the PyTorch transforms (Compose and all the others are possible).

[ ]:

! pip install -q torch torchvision

[ ]:

from torch.utils.data import DataLoader
from torchvision.transforms import ToTensor

transforms = ToTensor()


def apply_transforms(batch):
    # For CIFAR-10 the "img" column contains the images we want to apply the transforms to
    batch["img"] = [transforms(img) for img in batch["img"]]
    return batch


partition_torch = partition.with_transform(apply_transforms)
dataloader = DataLoader(partition_torch, batch_size=64)

The Dataloader created this way does not return a Tuple when iterating over it but a Dict with the names of the columns as keys and features as values. Look below for an example.

[ ]:

for batch in dataloader:
    print(f"Return type when iterating over a dataloader: {type(batch)}")
    print(batch["img"].shape)
    print(batch["label"].shape)
    break

Return type when iterating over dataloader: <class 'dict'>
torch.Size([64, 3, 32, 32])
torch.Size([64])

NumPy#

NumPy can be used as input to the TensorFlow and scikit-learn models. The transformation is very simple.

[ ]:

partition_np = partition.with_format("numpy")
X_train, y_train = partition_np["img"], partition_np["label"]

TensorFlow Dataset#

Transformation to TensorFlow Dataset is a one-liner.

[ ]:

! pip install -q tensorflow

[ ]:

tf_dataset = partition.to_tf_dataset(
    columns="img", label_cols="label", batch_size=64, shuffle=True
)

Final remarks#

Congratulations, you now know the basics of Flower Datasets and are ready to perform basic dataset preparation for Federated Learning.

Next#

This is the first quickstart tutorial from the Flower Datasets series. See other tutorials: