@hackathon/berlin25-speech

Speech — Flower / PyTorch App (Track 2)

This app demonstrates federated learning for speech recognition using Flower and PyTorch, built around the Google Speech Commands dataset.
It’s designed to help participants explore and understand how the dataset and federated setup work.

🚀 Quickstart

1) Install

First, install dependencies in editable mode so that changes you make to the app code are immediately reflected in your environment.

pip install -e .

2) Run (Simulation Engine)

You can run the app using Flower’s Simulation Runtime, which lets you simulate multiple clients locally.
This is the easiest way to experiment and debug your app before scaling it to real devices.

Run with default configuration:

flwr run .

Tip: Your pyproject.toml file can define more than just dependencies — it can also include hyperparameters (like lr or num-server-rounds) and control which Flower Runtime is used.
By default, this app uses the Simulation Runtime, but you can switch to the Deployment Runtime when needed.
Learn more in the TOML configuration guide.

🌸 Explore Flower Datasets

The snippet below shows how to load, partition, and visualize the Google Speech Commands dataset using flwr_datasets.
Instead of IID splits, we now create natural partitions per speaker, then inspect the label distribution per client.

from flwr_datasets import FederatedDataset
from flwr_datasets.partitioner import NaturalIdPartitioner
from flwr_datasets.visualization import plot_label_distributions

# Create a federated dataset with natural (non-IID) partitions by speaker
fds = FederatedDataset(
    dataset="google/speech_commands",
    subset="v0.02",
    partitioners={
        "train": NaturalIdPartitioner(
            partition_by="speaker_id",
        ),
    },
)

# Access the partitioner used for the training split
partitioner = fds.partitioners["train"]

# Plot per-partition label distributions
fig, ax, df = plot_label_distributions(
    partitioner=partitioner,
    label_name="label",
    max_num_partitions=20,
    plot_type="bar",
    size_unit="percent",
    partition_id_axis="x",
    legend=True,
    title="Per Partition Labels Distribution",
    verbose_labels=True,
    legend_kwargs={"ncols": 2, "bbox_to_anchor": (1.25, 0.5)},
)

💡 What this does:

Partitions the Speech Commands dataset by speaker_id, creating non-IID client datasets.

Uses plot_label_distributions to visualize how labels are distributed across a subset of clients.

Helps you understand data heterogeneity, which is a key aspect of realistic federated learning setups.

🔊 Visualize Audio and Features

Speech data starts as a 1D waveform, but in practice we transform each audio clip into a 2D time--frequency representation. Below we show how to apply the same preprocessing pipeline used in this app---MFCCs or Spectrograms---using Compose, Lambda, and torchaudio transforms.

View Raw Audio Waveform

import matplotlib.pyplot as plt

waveform = partition[0]["audio"]["array"]
plt.figure(figsize=(10, 3))
plt.plot(waveform)
plt.title("Raw Audio Waveform")
plt.xlabel("Time (samples)")
plt.ylabel("Amplitude")
plt.show()

Apply Audio Transforms (MFCC or Spectrogram)

The following preprocessing functions mirror the real preprocessing pipeline used by this app. They:

Resample from 16 kHz → 8 kHz
Pad/trim to exactly 1 second (8000 samples)
Convert into MFCCs or a Spectrogram

Define Preprocessing Functions

import torch
import torch.nn.functional as F
from torchvision.transforms import Compose, Lambda
from torchaudio import transforms

def raw_audio_to_mfcc_transforms():
    ss = 8000  
    n_mfcc = 40
    window_width = 40e-3
    stride = 20e-3
    n_fft = 400

    return Compose([
        transforms.Resample(16000, ss),
        Lambda(lambda x: F.pad(x, (0, max(0, ss - x.shape[-1])))),  
        Lambda(lambda x: x[..., :ss]),                             
        transforms.MFCC(
            sample_rate=ss,
            n_mfcc=n_mfcc,
            melkwargs={
                "win_length": int(ss * window_width),
                "hop_length": int(ss * stride),
                "n_fft": n_fft,
            },
        ),
    ])


def raw_audio_to_spectogram_transforms():
    n_fft = 400
    hop_length = 40
    ss = 8000

    return Compose([
        transforms.Resample(16000, ss),
        Lambda(lambda x: F.pad(x, (0, max(0, ss - x.shape[-1])))),
        Lambda(lambda x: x[..., :ss]),
        transforms.Spectrogram(
            n_fft=n_fft,
            win_length=None,
            hop_length=hop_length,
            center=True,
            pad_mode="reflect",
            power=0.1,
        ),
    ])

Apply Transforms to the Federated Dataset

from torch.utils.data import DataLoader

def prepare_dataset(preprocess_func_to_apply):
    func = preprocess_func_to_apply()
    def apply_transforms(batch):
        audio = batch["audio"]
        for aud in audio:
            aud["array"] = func(torch.tensor(aud["array"], dtype=torch.float32))
        return batch
    return apply_transforms

# MFCC preprocessed partition
transformed_partition_mfcc = partition.with_transform(
    prepare_dataset(raw_audio_to_mfcc_transforms)
)
trainloader_mfcc = DataLoader(transformed_partition_mfcc, batch_size=4, shuffle=False)

# Spectrogram preprocessed partition
transformed_partition_spec = partition.with_transform(
    prepare_dataset(raw_audio_to_spectogram_transforms)
)
trainloader_spec = DataLoader(transformed_partition_spec, batch_size=4, shuffle=False)

Visualize MFCCs and Spectrograms

Because both DataLoaders iterate over the same partition (and shuffle=False), each batch contains the same audio samples but transformed differently.

for batch_mfcc, batch_spec in zip(trainloader_mfcc, trainloader_spec):

    print(f"{batch_mfcc['audio']['array'].shape = }")
    print(f"{batch_spec['audio']['array'].shape = }")

    fig, ax = plt.subplots(ncols=2, figsize=(8, 3))

    ax[0].matshow(batch_mfcc["audio"]["array"][0])
    ax[0].set_title("MFCC")

    ax[1].matshow(batch_spec["audio"]["array"][0])
    ax[1].set_title("Spectrogram")

    plt.tight_layout()
    plt.show()
    break

Example output shapes:

batch_mfcc['audio']['array'].shape = torch.Size([4, 40, 51]) batch_spec['audio']['array'].shape = torch.Size([4, 201, 201])

These are the true shapes produced by your preprocessing pipeline.

Explanation

💡 What this does:

The MFCC transform converts the waveform into 40×51 cepstral coefficients.

The Spectrogram transform produces a roughly 200×200 time--frequency matrix.

These representations are commonly used in speech recognition systems.

The visualization shows how the same raw audio clip appears under two different feature extractors.

🧩 For more audio examples, see Javier's preprocessing snippets here:
https://gist.github.com/jafermarq/569ecea83f43bc95fea0599db99ceac2