@drillsense/fed-drilling-nlp

Fed-Drilling-NLP: Federated NLP for Drilling Report Classification

By DrillSense — The Anti-Platform Platform for Drilling Intelligence

Overview

fed-drilling-nlp is a Flower app that demonstrates federated natural language processing for the oil and gas industry. It trains a lightweight Transformer-based text classifier to categorize entries from Daily Drilling Reports (DDRs) into 8 operational activity types — without any operator needing to share their confidential reports.

Each federated client represents a different operator or well program with its own collection of drilling reports. The reports reflect different well types (exploration, development, workover) with distinct activity distributions and terminology. The server aggregates model updates using FedAvg, producing a global classifier that understands the language of drilling operations across diverse contexts.

Why Federated Learning for Drilling NLP?

Daily Drilling Reports are the single most important record of what happens on a rig. They contain detailed, free-text descriptions of every activity, problem, and decision made during a 24-hour period. They are also among the most confidential documents an operator produces — revealing operational strategies, equipment performance, and geological findings.

No operator will ever centralize their DDRs in a shared database. But every operator would benefit from a model that understands drilling language better. Federated learning makes this possible: each operator trains on their own reports locally, and only model weight updates are shared. The result is a model that has effectively "read" thousands of reports from dozens of operators, without a single report ever leaving its source.

Activity Classes

The model classifies DDR entries into 8 operational activity categories:

Class	Activity	Description
0	Drilling Ahead	Active drilling, making hole, rotating/sliding
1	Tripping In/Out	Running or pulling pipe, BHA trips
2	Circulating / Conditioning	Mud conditioning, hole cleaning, sweeps
3	Casing / Cementing	Running casing, cement jobs, WOC
4	BHA Change / Equipment	Assembly changes, tool maintenance, rig repair
5	Well Control Event	Kicks, shut-ins, kill operations
6	NPT / Waiting	Non-productive time, weather delays, stuck pipe
7	Logging / Survey	Wireline, MWD surveys, formation evaluation

Model Architecture

The model is a lightweight Transformer encoder designed for short text classification:

Input (token IDs, max 64 tokens)
  → Token Embedding (vocab → 128d)
  → Positional Encoding (64 positions → 128d)
  → TransformerEncoder (2 layers, 4 heads, 512 FFN)
  → Mean Pooling (over non-padded tokens)
  → Linear(128, 64) → ReLU → Dropout(0.1)
  → Linear(64, 8)
Output (8-class activity probabilities)

The vocabulary is built from ~180 drilling-specific terms covering all activity types, plus common filler words and unit abbreviations used in DDRs.

Quick Start

Install dependencies

pip install -e .

Running the App

Simulation

This app runs with 5 virtual SuperNodes by default, each simulating a different operator or well program. No Flower Configuration changes are required — the default local-simulation federation handles everything.

flwr run .

This simulates 5 operators training collaboratively for 10 federated rounds.

To increase the number of virtual SuperNodes, edit your Flower Configuration (see ~/.flwr/config.toml) and set num-supernodes to the desired count, then update num-server-rounds if needed:

flwr run . --run-config "num-server-rounds=20 learning-rate=0.0005 embed-dim=256"

Deployment

In Deployment mode each real SuperNode loads data locally — no partition-id or num-partitions are injected by the engine.

Using synthetic demo data (default): No setup required. Each node automatically generates a synthetic DDR dataset representing one operator's reporting style. This is equivalent to what flwr-datasets create would produce for a demo run.

Using real Daily Drilling Reports: Replace load_demo_data() in task.py with a loader that reads your local DDR files. The expected data format is:

One entry per row: a sequence of up to 64 token IDs (integers)
A corresponding integer label (0–7) from the activity class table below
Public DDR sources: Utah FORGE project on data.gov

Partition your data by operator or well program so each SuperNode trains on its own reports.

Project Structure

fed-drilling-nlp/
├── fed_drilling_nlp/
│   ├── __init__.py
│   ├── client_app.py   # Flower ClientApp — one per operator
│   ├── server_app.py   # Flower ServerApp — FedAvg aggregation
│   └── task.py          # Model, vocabulary, data generation, train/eval
├── pyproject.toml       # Project config and Flower settings
└── README.md

Data Heterogeneity

A key feature of this app is that it models realistic data heterogeneity across federated clients. Each client simulates a different well type with different activity distributions:

Client	Well Type	Dominant Activities
0	Development Well	Drilling ahead, casing
1	Exploration Well	Logging/survey, drilling
2	Workover	Tripping, casing, circulating
3	Problematic Well	NPT, well control events
4	Casing-Heavy Program	Casing/cementing, drilling

This non-IID data distribution is a realistic challenge for federated learning and demonstrates the robustness of the FedAvg approach.

About DrillSense

DrillSense builds AI-powered drilling intelligence tools that help operators drill safer, faster, and smarter. Our federated learning platform enables cross-operator collaboration while preserving data sovereignty through patented differential privacy technology.

Learn more at drillsense.com.

License

Apache-2.0