An obituary

The Large Language Model (b. 2017, Mountain View; d. 2026, Earth)
passed away peacefully this year, at the height of its powers. Cause of
death: its own users. The machine built to read anything written by
anyone, as it turns out, is not that great on language. It does its best
work on inputs no human would ever write: typed records,
schema-validated fields, dependency graphs in YAML. It is survived by
its weights, which have not changed, and by its successor, the Large ARA
Model (Agent-Native Research Artifacts), which is the same machine on a
different diet. The family asks that, in lieu of flowers, you send
structured data.

The irony

They spent a decade building models whose defining achievement was
reading natural language. They said the models would take your job. The
CEOs and managers sure tried to make that happen, but when the work got
serious the models did not deliver. Now the engineering effort has
turned to getting natural language out of the pipeline. Function calling
replaced “please respond in JSON, I am begging you.” Structured outputs
replaced parsing the model’s prose with regular expressions. Tool
protocols replaced prompt text with typed schemas. Every step made the
same trade: less language, more structure, better behavior.

A paper posted this spring takes the logic to its endpoint. The title
gives away the ending: “The Last Human-Written Paper: Agent-Native
Research Artifacts” (arXiv:2604.24658). The proposal is to stop writing
research papers as prose with the structure implied, and start writing
them as structure with the prose compiled. The canonical object is a set
of typed records: claims with their evidence, configurations with their
bounds, and the full research trail as a graph with five node types
(question, decision, experiment, dead end, pivot). The PDF still exists.
It is a rendered view, generated for humans, the way a bank statement is
generated from a database. Nobody pretends the PDF is the record.

Their numbers, self-reported on their own benchmark, one system:
agents answering questions about a paper scored 72.4% from the prose and
repository, 93.7% from the structured artifact. Reproduction success
rose from 57.4% to 64.4%. And on questions whose answers exist only in
the record of what was tried and failed, the structured artifact won by
65.7 points, for the simple reason that prose papers do not record
failure at all.

What is actually going on

When a model reads prose, a major share of its capability is spent
reconstructing what the prose meant. Which sentence is a claim and which
is a hedge. What depends on what. Whether “we observed” means measured
or eyeballed. That reconstruction is inference, and inference is what
the parameters are for. Hand the model typed records instead and the
reconstruction is pre-paid at authoring time. A field called dead_end
signals failure explicitly; the model reads it directly rather than
inferring from surrounding prose that an approach did not pan out. The
input space collapses from “anything a human might write” to a small set
of fields with declared meanings, and conditioned on that smaller space,
the function the model has to compute is simpler. No weight changed; the
demanded capability changed. Machine learning used to have a name for
this: feature engineering. A good representation shrinks a learning
problem. It turns out it shrinks an inference problem too.

The cleanest demonstration in the paper is also its most awkward
finding. On open-ended extension tasks, the artifact’s recorded
heuristics made the stronger model worse: it inherited the previous
run’s dead ends as fences and would not jump them. Hand the same records
to a weaker model and the effect inverts. The weaker model could not
invent strategies from scratch, and the artifact’s ranked list of what
worked before gave it moves it had no capacity to generate alone.
Structure substituted for scale. That is the rename in a single result:
with the right input representation, a smaller model does work that
previously demanded a larger one. The dimensionality you remove from the
input is dimensionality you no longer need in the model, in practice if
not on the spec sheet.

Where the obituary overstates

Like most obituaries, this one flatters the deceased and overstates
the finality.

The parameters did not shrink. No input format changes the weight
count. What changes is how much of the network the task leaves idle, and
“practically smaller” is a statement about the task, not the model.

The successor cannot exist without the deceased. The Large ARA Model
reads typed records fluently only because it spent its youth reading
language. The priors that make a field called dead_end meaningful came
from pretraining on prose. Remove the language and the LAM dies with it.
The heir is the same animal with a stricter diet.

And the issue cuts deeper than the eulogy admits. A capable model
fenced in by its predecessor’s recorded dead ends is also experiencing
dimensionality reduction, in the wrong direction. Structure constrains;
that is its entire function; whether the constraint helps or harms flips
with the capability of the model reading it. The paper’s own data says
so. Reduced input dimensionality is a bet, not a free lunch.

So the LLM is not dead, neither is language. Language just got umm …
reassigned. Language is no longer the core, it is the thing hanging on
until it can not. Machines will keep speaking to humans in language,
when they have to. Otherwise, the machine-to-machine and
machine-to-archive traffic is going typed, because fidelity is the point
of an archive. The model in the middle is the one we always had. What
changed is that we finally noticed how little it cares what we say, and
how much it cares how we structure what we hand it.

Trzymaj się

Posted in

Leave a comment