How to contribute a dataset

To make a dataset available in Flower Dataset (flwr-datasets), you need to add the dataset to HuggingFace Hub .

This guide will explain the best practices we found when adding datasets ourselves and point to the HFs guides. To see the datasets added by Flower, visit https://huggingface.co/flwrlabs.

Dataset contribution process

The contribution contains three steps: first, on your development machine transform your dataset into a datasets.Dataset object, the preferred format for datasets in HF Hub; second, upload the dataset to HuggingFace Hub and detail it its readme how can be used with Flower Dataset; third, share your dataset with us and we will add it to the recommended FL dataset list

Creating a dataset locally

You can create a local dataset directly using the datasets library or load it in any custom way and transform it to the datasets.Dataset from other Python objects. To complete this step, we recommend reading our Use with Local Data guide or/and the Create a dataset guide from HF.

Tip

We recommend that you do not upload custom scripts to HuggingFace Hub; instead, create the dataset locally and upload the data, which will speed up the processing time each time the data set is downloaded.

Contribution to the HuggingFace Hub

Each dataset in the HF Hub is a Git repository with a specific structure and readme file, and HuggingFace provides an API to push the dataset and, alternatively, a user interface directly in the website to populate the information in the readme file.

Contributions to the HuggingFace Hub come down to:

  1. creating an HF repository for the dataset.

  2. uploading the dataset.

  3. filling in the information in the readme file.

To complete this step, follow this HF’s guide Share dataset to the Hub.

Note that the push of the dataset is straightforward, and here’s what it could look like:

from datasets import Dataset

# Example dataset
data = {
    'column1': [1, 2, 3],
    'column2': ['a', 'b', 'c']
}

# Create a Dataset object
dataset = Dataset.from_dict(data)

# Push the dataset to the HuggingFace Hub
dataset.push_to_hub("you-hf-username/your-ds-name")

To make the dataset easily accessible in FL we recommend adding the “Use in FL” section. Here’s an example of how it is done in one of our repos for the cinic10 dataset.

Increasing visibility of the dataset

If you want the dataset listed in our recommended FL dataset list , please send a PR or ping us in Slack #contributions channel.

That’s it! You have successfully contributed a dataset to the HuggingFace Hub and made it available for FL community. Thank you for your contribution!