• In critical care informatics, MIMIC-III changed how an entire field worked. Before it, researchers operated on private hospital records — powerful data, but siloed and unreproducible. After MIMIC, methods could be built, tested, and compared by anyone with a research protocol. A community formed around the dataset.

    The same pattern holds in biomedical signal processing, where PhysioBank has anchored a generation of arrhythmia detection and signal analysis work. Open, curated, analysis-ready corpora don’t just make research easier — they define what questions get asked and who can ask them.

    Sports performance time-series analysis has lacked that catalyst. That gap is what GCclean is designed to fill.

    **What GCclean is**

    GCclean is a cleaned, analysis-ready cycling power corpus derived from the GoldenCheetah Open Data archive — one of the largest collections of real-world athlete power files donated to open science. The raw archive is rich, but it requires substantial preparation before it can support population-level analysis. GCclean is that prepared version: a curated corpus built to be both reproducible and reusable.

    A data-descriptor manuscript is currently in preparation for submission to *Scientific Data* (Nature Portfolio). It will characterise the corpus contents, the athlete population it represents, and the functional and parametric structure of the power-duration landscape across the full sample. Planned deposit artifacts include the cleaned corpus, the cleaning pipeline, pooled functional principal component bases, per-athlete career-best curves, population percentile tables, and per-athlete parametric profile estimates — all as open, machine-readable files.

    **Step 1 of a larger program**

    GCclean doesn’t stand alone. It is Step 1 of the LightBox research cascade — a seven-step program in which each study introduces one analytical tool, validates it on the corpus, and hands it forward as infrastructure for the next.

    Downstream steps address pacing structure across the power-duration domain, the effort architecture of full rides via a grammar-constrained segmentation model, a Bayesian fitness tracker derived from how that structure shifts across a season, and durability — how performance degrades under accumulated fatigue — stratified by athlete type at population scale. Each of those papers is designed to cite GCclean for sample characterisation.

    The corpus is the foundation. Get the foundation right, and everything built on it is reproducible from the ground up.

    **What’s next**

    The manuscript is in preparation. Deposit of the corpus and reference artifacts is planned to coincide with submission. Neither is public yet.

    What this post is: a signal that the work is in motion, the corpus exists, and the program it anchors is real. If you work in sports science, exercise physiology, or performance analytics — or if you care about open data done carefully — this is worth watching.

  • Every time a cyclist trains with a power meter, their effort is recorded — every pedal stroke, every interval, every hour in the saddle. Most of that data goes nowhere useful. Not because it lacks information, but because the tools we use to analyse training weren’t built to see it.

    This is the starting problem for LightBox.

    **A dataset that hasn’t been looked at properly**

    The GoldenCheetah Open Data corpus contains more than 4,500 athlete-years of complete cycling power files, donated to open science by athletes and coaches who wanted their data to matter. It is one of the largest open datasets in sports science. The rides are complete — not summarised, not aggregated — raw power at every second.

    The standard approach to a dataset like this is to compute training load metrics: a formula that collapses each ride to a single number representing how hard the athlete worked. Or to extract maximal power profiles — the highest power a rider sustained for five seconds, for a minute, for twenty minutes. These are useful. They’re also a narrow window. A training load score tells you almost nothing about how the effort was structured. A maximal power profile tells you what a rider’s ceiling is, not how they got there.

    **The gap this program occupies**

    Cycling science has approached performance from two directions that don’t quite meet. Traditional sports physiology produces interpretable results — but it works on small, often elite samples and was built around the laboratory, not the power file. The data-driven turn in sports science produces powerful pattern recognition — but the outputs are often physiologically opaque: a prediction without a mechanism, a cluster without a name.

    LightBox sits in the unoccupied space between them. Every tool in the program produces outputs in physiological units — the kind a coach can act on, the kind an athlete can understand. The commitment to interpretability is not aesthetic preference. It’s a research constraint: a result that can’t be explained can’t be applied, and if it can’t be applied, it’s hard to know whether it’s right.

    **A cascade, not a collection of studies**

    The program is structured as a research cascade. Each study introduces one analytical tool, validates it on the corpus, and hands it forward as infrastructure for the next step.

    The first study asks how maximal efforts are paced across the power-duration domain — not for one athlete, but at population scale. The second introduces a model that segments the full structure of a ride into physiologically labelled phases, and derives a fitness tracker from how that structure drifts over time. The third addresses durability — how performance degrades under accumulated fatigue — stratified by athlete type across the full corpus.

    Each step builds on the last. The cascade converges on the Puchowicz Model of Exercise Segment Analysis: a unified framework for characterising effort at the ride level and across a season, in units that mean something.

    **What we’re building toward**

    This is a program introduction, not a findings report. The papers are in progress; the tools are being built and validated.

    What’s already clear is the scope of what becomes possible when the full power file is treated as signal rather than noise — when the question isn’t just “how hard did this rider work?” but “how did they work, and what does that tell us about how they perform?”

    That’s the question LightBox is built to answer.

  • Welcome to WordPress! This is your first post. Edit or delete it to take the first step in your blogging journey.