What Open Corpora Do for Research — and What GCclean Is Building Toward

In critical care informatics, MIMIC-III changed how an entire field worked. Before it, researchers operated on private hospital records — powerful data, but siloed and unreproducible. After MIMIC, methods could be built, tested, and compared by anyone with a research protocol. A community formed around the dataset.

The same pattern holds in biomedical signal processing, where PhysioBank has anchored a generation of arrhythmia detection and signal analysis work. Open, curated, analysis-ready corpora don’t just make research easier — they define what questions get asked and who can ask them.

Sports performance time-series analysis has lacked that catalyst. That gap is what GCclean is designed to fill.

**What GCclean is**

GCclean is a cleaned, analysis-ready cycling power corpus derived from the GoldenCheetah Open Data archive — one of the largest collections of real-world athlete power files donated to open science. The raw archive is rich, but it requires substantial preparation before it can support population-level analysis. GCclean is that prepared version: a curated corpus built to be both reproducible and reusable.

A data-descriptor manuscript is currently in preparation for submission to *Scientific Data* (Nature Portfolio). It will characterise the corpus contents, the athlete population it represents, and the functional and parametric structure of the power-duration landscape across the full sample. Planned deposit artifacts include the cleaned corpus, the cleaning pipeline, pooled functional principal component bases, per-athlete career-best curves, population percentile tables, and per-athlete parametric profile estimates — all as open, machine-readable files.

**Step 1 of a larger program**

GCclean doesn’t stand alone. It is Step 1 of the LightBox research cascade — a seven-step program in which each study introduces one analytical tool, validates it on the corpus, and hands it forward as infrastructure for the next.

Downstream steps address pacing structure across the power-duration domain, the effort architecture of full rides via a grammar-constrained segmentation model, a Bayesian fitness tracker derived from how that structure shifts across a season, and durability — how performance degrades under accumulated fatigue — stratified by athlete type at population scale. Each of those papers is designed to cite GCclean for sample characterisation.

The corpus is the foundation. Get the foundation right, and everything built on it is reproducible from the ground up.

**What’s next**

The manuscript is in preparation. Deposit of the corpus and reference artifacts is planned to coincide with submission. Neither is public yet.

What this post is: a signal that the work is in motion, the corpus exists, and the program it anchors is real. If you work in sports science, exercise physiology, or performance analytics — or if you care about open data done carefully — this is worth watching.

recent posts

about

Leave a comment Cancel reply

recent posts

about

What Open Corpora Do for Research — and What GCclean Is Building Toward

Share this:

Leave a comment Cancel reply