Open in Colab

Visualize Label Distribution#

Learn how to visualize and compare partitioned datasets when applying different Partitioners or parameters.

If you partition datasets to simulate heterogeneity through label skew and/or size skew, you can now effortlessly visualize the partitioned dataset using flwr-datasets.

All the described visualization functions are compatible with all Partitioner you can find in flwr_datasets.partitioner

Install Flower Datasets#

[ ]:
! pip install -q "flwr-datasets[vision]"

Plot Label Distribution#

Bar plot#

Let’s visualize the result of DirichletPartitioner. We will create a FederatedDataset and assign DirichletPartitioner to the train split:

[ ]:
from flwr_datasets import FederatedDataset
from flwr_datasets.partitioner import DirichletPartitioner
from flwr_datasets.visualization import plot_label_distributions


fds = FederatedDataset(
    dataset="cifar10",
    partitioners={
        "train": DirichletPartitioner(
            num_partitions=10,
            partition_by="label",
            alpha=0.3,
            seed=42,
            min_partition_size=0,
        ),
    },
)

partitioner = fds.partitioners["train"]

Once we have the partitioner with the dataset assigned, we are ready to pass it to the plotting function:

[ ]:
fig, ax, df = plot_label_distributions(
    partitioner,
    label_name="label",
    plot_type="bar",
    size_unit="absolute",
    partition_id_axis="x",
    legend=True,
    verbose_labels=True,
    title="Per Partition Labels Distribution",
)
_images/tutorial-visualize-label-distribution_9_0.png

You can configure many details directly using the function parameters. The ones that can interest you the most are:

  • size_unit to have the sizes normalized such that they sum up to 1 and express the fraction of the data in each partition,

  • legend and verbose_labels in case the dataset has more descriptive names and not numbers,

  • cmap to change the values of the bars (for an overview of the available colors, have a look at link; check out cmap="tab20b")

And for even greater control, you can specify plot_kwargs and legend_kwargs as Dict, which will be further passed to the plot and legend functions.

You can also inspect the exact numbers that were used to create this plot. Three objects are returned (see reference here). Let’s inspect the returned DataFrame.

[ ]:
df
airplane automobile bird cat deer dog frog horse ship truck
Partition ID
0 817 794 1462 2123 432 25 456 384 14 9
1 1416 6 97 5 3 0 3409 0 3 868
2 0 4 11 2 454 3 511 15 84 21
3 762 159 1100 51 120 166 2 1982 1351 2175
4 2 43 714 2 19 2400 425 1 151 477
5 67 79 170 25 2552 477 27 44 590 0
6 422 2 4 486 380 92 90 380 50 6
7 122 2811 597 2174 1038 1727 1 682 515 4
8 256 29 342 75 1 84 8 1511 2240 1417
9 1136 1073 503 57 1 26 71 1 2 23

Each row represents a unique partition ID, and the columns represent unique labels (either in the verbose version if verbose_labels=True or typically int values otherwise, representing the partition IDs). That you can index the DataFrame df[partition_id, label_id] to get the number of samples in partition_id for the specified label_id.

[ ]:
df.loc[4, "bird"]
714

Let’s see a plot with size_unit="percent", which is another excellent way to understand the partitions. In this mode, the number of datapoints for each class in a given partition are normalized, so they sum up to 100.

[ ]:
fig, ax, df = plot_label_distributions(
    partitioner,
    label_name="label",
    plot_type="bar",
    size_unit="percent",
    partition_id_axis="x",
    legend=True,
    verbose_labels=True,
    cmap="tab20b",
    title="Per Partition Labels Distribution",
)
_images/tutorial-visualize-label-distribution_16_0.png

Heatmap#

You might want to visualize the results of partitioning as a heatmap, which can be especially useful for binary labels. Here’s how:

[ ]:
fig, ax, df = plot_label_distributions(
    partitioner,
    label_name="label",
    plot_type="heatmap",
    size_unit="absolute",
    partition_id_axis="x",
    legend=True,
    verbose_labels=True,
    title="Per Partition Labels Distribution",
    plot_kwargs={"annot": True},
)
_images/tutorial-visualize-label-distribution_19_0.png

Note: we used the plot_kwargs={"annot": True} to add the number directly to the plot.

If you are a pandas fan, then you might be interested that a similar heatmap can be created with the DataFrame object for visualization in jupyter notebook:

[ ]:
df.style.background_gradient(axis=None, cmap="Greens", vmin=0)
  airplane automobile bird cat deer dog frog horse ship truck
Partition ID                    
0 817 794 1462 2123 432 25 456 384 14 9
1 1416 6 97 5 3 0 3409 0 3 868
2 0 4 11 2 454 3 511 15 84 21
3 762 159 1100 51 120 166 2 1982 1351 2175
4 2 43 714 2 19 2400 425 1 151 477
5 67 79 170 25 2552 477 27 44 590 0
6 422 2 4 486 380 92 90 380 50 6
7 122 2811 597 2174 1038 1727 1 682 515 4
8 256 29 342 75 1 84 8 1511 2240 1417
9 1136 1073 503 57 1 26 71 1 2 23

Plot Comparison of Label Distributions#

Now, once you know how to visualize a single partitioned dataset, you’ll learn how to compare a few of them on a single plot.

Let’s compare:

  • IidPartitioner,

  • DirichletPartitioner,

  • ShardPartitioner still using the cifar10 dataset.

We need to create a list of partitioners. Each partitioner needs to have a dataset assigned to it (it does not have to be the same dataset so you can also compare the same partitioning on different datasets).

[ ]:
from flwr_datasets import FederatedDataset
from flwr_datasets.partitioner import (
    IidPartitioner,
    DirichletPartitioner,
    ShardPartitioner,
)

partitioner_list = []
title_list = ["IidPartitioner", "DirichletPartitioner", "ShardPartitioner"]

## IidPartitioner
fds = FederatedDataset(
    dataset="cifar10",
    partitioners={
        "train": IidPartitioner(num_partitions=10),
    },
)
partitioner_list.append(fds.partitioners["train"])

## DirichletPartitioner
fds = FederatedDataset(
    dataset="cifar10",
    partitioners={
        "train": DirichletPartitioner(
            num_partitions=10,
            partition_by="label",
            alpha=1.0,
            min_partition_size=0,
        ),
    },
)
partitioner_list.append(fds.partitioners["train"])

## ShardPartitioner
fds = FederatedDataset(
    dataset="cifar10",
    partitioners={
        "train": ShardPartitioner(
            num_partitions=10, partition_by="label", num_shards_per_partition=2
        )
    },
)
partitioner_list.append(fds.partitioners["train"])

Now let’s visualize them side by side

[ ]:
from flwr_datasets.visualization import plot_comparison_label_distribution

fig, axes, df_list = plot_comparison_label_distribution(
    partitioner_list=partitioner_list,
    label_name="label",
    subtitle="Comparison of Partitioning Schemes on CIFAR10",
    titles=title_list,
    legend=True,
    verbose_labels=True,
)
_images/tutorial-visualize-label-distribution_27_0.png

Bonus: Natural Id Dataset#

Nothing stops you from using the NaturalIdPartitioner to visualize a dataset with the id in it and does not need the artificial partitioning but has the pre-existing partitions. For that dataset, we use NaturalIdPartitioner. Let’s look at the speech-commands dataset that has speaker_id, and there are quite a few speakers; therefore, we will show only the first 20 partitions. And since we have quite a few different labels, let’s specify legend_kwargs={"ncols": 2} to display them in two columns (we will also shift the legend slightly to the right).

You’ll be using the Google SpeechCommands dataset, which is a speech-based dataset. For this, you’ll need to install the "audio" extension for Flower Datasets. It can be easily done like this:

[ ]:
! pip install -q "flwr-datasets[audio]"

With everything ready, let’s visualize the partitions for a naturally partitioned dataset.

[ ]:
from flwr_datasets import FederatedDataset
from flwr_datasets.partitioner import NaturalIdPartitioner
from flwr_datasets.visualization import plot_label_distributions


fds = FederatedDataset(
    dataset="google/speech_commands",
    subset="v0.01",
    partitioners={
        "train": NaturalIdPartitioner(
            partition_by="speaker_id",
        ),
    },
)

partitioner = fds.partitioners["train"]

fix, ax, df = plot_label_distributions(
    partitioner=partitioner,
    label_name="label",
    max_num_partitions=20,
    plot_type="bar",
    size_unit="percent",
    partition_id_axis="x",
    legend=True,
    title="Per Partition Labels Distribution",
    verbose_labels=True,
    legend_kwargs={"ncols": 2, "bbox_to_anchor": (1.25, 0.5)},
)
_images/tutorial-visualize-label-distribution_33_0.png

More resources#

If you are looking for more resources, feel free to check:

This was the last tutorial.

Previous tutorials:


Open in Colab