The Parts
A foundation has money for three experiments. A lab finished a failed run yesterday, but the result is still in an instrument export and a thread. A model proposes a trial design this morning against last month’s literature. By Friday, the experiments will be chosen before the failed run is absorbed, and the old assumption will buy another month of work. Somewhere downstream a patient enrolls into the third trial because the second one closed.
That is the failure: the correction arrives after the decision.
The hinge. The engine only matters if a correction reaches the next decision before money, model trust, or clinical action inherit the old state.
Science already has most of the parts: papers, datasets, lab traces, AI agents, peer review, cloud labs, funders. What it lacks is anything that lets one part change the next before another decision is made.
Those parts do not share the thing that matters most: the current state of the claim. Papers, datasets, lab traces, model outputs, review notes, funding decisions, and clinical observations all record activity in different places. The diagnosis sits beside older infrastructure efforts: FAIR data principles (Wilkinson et al., 2016), nanopublications (Groth et al., 2010), scientific workflow provenance (W3C PROV), and knowledge systems such as the Open Research Knowledge Graph. The point here is narrower: not better metadata alone, but a governed state transition.
A paper can tell you what an author claimed. It does not tell the next system what changed, what depends on the claim, or what should be tested next.
This is why a scientist still stitches the picture together by hand. She searches the literature in one place, checks data in another, reads code in a third, reconstructs methods from a supplement, asks a colleague whether a failure was real, opens a model chat that will not remember the correction tomorrow, and then writes a narrative artifact someone else has to reverse-engineer later. Wrong trial assumptions continue, failed experiments repeat, funders buy isolated reports, and patients wait while updates stay trapped in local memory. The system contains intelligence and labor. It does not contain a shared transition object.
AI speeds the problem up without changing its shape. Models propose candidate experiments faster than wet labs can test them. Agentic systems extract findings, draft critiques, and chain tools at scales no graduate cohort can match. The bottleneck moves from producing the next hypothesis to integrating what has already been produced into a record the next decision can read. A model can propose more candidate findings in a month than a field tests in a decade; the explosion already happened, and the missing layer is the one that carries forward what survives. The protocol scales absorption, not generation.
The engine science needs is a procedure: every artifact should be able to propose a governed change to the shared record, and every accepted change should be able to guide the next task in time to matter.
Start from a week of use, not the object model. A foundation wants to know what to fund next. A student wants a real task at the frontier. A robotic lab has a failed run to deposit. A model makes a prediction that should either earn calibration or lose it. The engine exists so each of those ordinary actions lands in the same frontier instead of disappearing into four separate systems.
The shared object that holds them together is the scientific state transition itself: the reviewed change to what a field currently believes or can act on, not the paper, dataset, lab, or grant where it happens to surface.
The trilogy. Three coupled systems: the record holds what science currently believes, the engine moves activity into state the next task can read, and the body carries that state out to instruments, labs, and clinics.
The claim is mechanical: if scientific work is going to compound across humans, agents, world models, and labs, the basic operating unit has to change from an artifact to a governed state transition.
The Engine Loop
The engine is an operating loop before it is storage.
Fig. 03. The core loop. Think of the engine as an operating loop before you think of it as storage. Work moves from goal to next action, then returns to the frontier as updated state. The stretch from activity through event is where activity becomes governed state.
The loop starts with a goal: cure a disease, prove a theorem, build a better material, explain a climate signal, identify a safety risk, or decide which experiment deserves scarce lab time. The goal pressures the frontier. It determines which uncertainty matters enough to become work.
The goal becomes a frontier: what is known, what is unknown, what is contested, what depends on what, and which uncertainties are worth spending effort to reduce. A frontier becomes tasks. Tasks are assigned to humans, agents, models, reviewers, labs, funders, or institutions. Activity produces artifacts: papers, extractions, simulations, protocols, robot runs, code, clinical observations, field measurements, and reviews.
Activity still has to pass through governance before it becomes state. Everything else in the engine exists so that loop can be governed rather than merely executed.
A failed run shows the difference. The lab deposits its protocol trace and readout. The diff says a dependent claim should weaken in this cell line but not in the broader mechanism. A reviewer signs the narrow change, an event records it, and the next task queue stops assigning that experiment as if nothing had happened.
This is where a knowledge graph runs out of surface: a graph stores relationships but does not decide what should change next, who can propose it, or what physical action follows. Agent runtimes hit the same limit: they produce activity that becomes more output for another agent to summarize later. We would never ask a working scientist to face an open problem cold and one-shot the answer. They inherit decades of attempts; they fail, return, narrow. The loop the field’s evaluation benchmarks run is the loop no scientist runs. The engine is different only if it changes the next decision: a trial pauses, a model recalibrates, a lab avoids repeating a failure another lab already paid for.
Take AlphaFold. The other AI-for-science systems in this shape: Evo for genome-scale foundation models, A-Lab for closed-loop materials synthesis, Coscientist for autonomous chemistry, Google’s AI co-scientist for hypothesis generation, and the Open Targets evidence layer for therapeutic targets. Each produces real scientific work and more artifacts for someone else to integrate. Hundreds of millions of predicted protein structures, every one of them an artifact. None of them updates what the field believes about a structure question; none records when a wet-lab result contradicts a prediction. The structures sit beside the literature, not inside its state. The coordination layer where validated changes accumulate across all of them, across institutions, and over time, is what the engine is built to be.
Where the gate is machine-checkable, none of this is hypothetical. On a formal frontier, improving the best known bounds on a long-studied problem in extremal combinatorics, the loop has already run end to end. This refers to results from the project’s own formal-frontier pilot (bounds on Sidon sets, OEIS A309370): a reported internal demonstration that the loop closes under a machine-checkable gate, not an externally adjudicated benchmark. An agent proposes a construction, an exact verifier checks it, the improved bound enters the record as a signed transition, and the claims that depended on the old bound update against it. The same verifier rejected an invalid certificate and a confident but false claim, and that rejection is the part that matters: state the engine has checked can be acted on in a way a fluent summary cannot. A proof checker is a gentler gate than a human reviewer, which is exactly why the disease corridors are harder. The loop is the same; only the gate changes.
Every product in the ecosystem is accountable to the loop; each matters only insofar as it helps the loop run with more fidelity or less waste.
The design pressure is that the loop must be mundane enough for ordinary work. A graduate student should be able to extract a method, an agent should be able to open a provenance audit, a reviewer should be able to sign a narrow correction, and a lab should be able to write back a failed run without turning the act into ceremonial publication. The engine is only real when the smallest transition can travel.
A Tuesday morning
Picture a foundation’s program officer and a robotic lab on the same Tuesday.
The program officer is named, say, Maya, and her foundation has $5M to allocate this quarter to neurovascular dysfunction in Alzheimer’s. She does not begin with a blank call for proposals. She opens the current frontier the way someone else opens the morning paper: three candidate experiments worth funding, two assumptions whose confidence has slipped since Friday, one lab-capacity constraint, one safety class she’ll need an outside review on. When a failed replication weakens the broad claim on Wednesday, her dashboard knows by Thursday morning. She is paying for movement in the frontier rather than reports of what already happened.
Across town a robotic lab finishes a run that did not work. The protocol trace, calibration log, raw readout, and the tech’s notation on a contamination flag land in the frontier repo as an evidence object. The proposed change computes itself (this dependent claim weakens in this cell line but not in the broader mechanism) and routes for review. A signer takes ten minutes between her morning calls; an event records the narrow change; the next task queue stops assigning the same experiment as if nothing had happened. The lab finished its run by writing back.
A clinician at a regional hospital, hours from any research university, opens the same frontier before a Thursday appointment. The patient across the desk is APOE4-positive and sixty-eight; the guidance she would have repeated from last year was narrowed in March by a result she never saw. She cannot commit canonical state or run an experiment, but she can see why the finding changed and which subgroup it applies to, and she says something different in the room than she would have in February. And where the frontier has nothing for her exact patient, whose subgroup no finding yet covers, that gap is not silence. Her question deposits as an object the frontier can carry, a recorded absence a future task can be assigned against. A student looking for a real task lands at the same edge of the same frontier, with somewhere for her contribution to count. A model that proposed an intervention earns or loses calibration against the same record. None of this is exotic. Its value is mundane: the work of science stops being lost between systems, and the people who have always been on the wrong side of the gatekeeper get to read from the same map.
What the engine has to keep separate
Follow that same failed run a step further. When the lab deposits its trace, four different kinds of work move in different directions from one event. What the field believes shifts on its own slow schedule. The labs and agents doing the work of changing it follow their own assignment rules. The model that had predicted against the old belief recalibrates on its own loss. The clinical program or factory downstream runs against its own physical clock. None of these can collapse into the others, and the moment they meet is the moment a transition is accepted. OpenAI’s Symphony describes issue trackers, isolated agent workspaces, and human review as the organizing pattern for coding-agent work. The science version needs the same discipline, with scientific state as the merge target.
State
what is known, contested, scoped
frontier.state
Finding · Evidence · Frontier
Model
what might happen if we act
world.forecast
Simulation · Prediction · Calibration
Control
what should happen next
task.scheduler
ResearchTask · Queue · SafetyGate
Action
what touched reality
lab.writeback
Protocol · RobotRun · LabResult
Fig. 04. The four planes. The architecture separates what is known, how work is coordinated, how outcomes are forecast, and what physically changes. Each plane has a different shape of work; one event spine couples them.
The four planes are not independent. The complete engine is the relationship among them:
Fig. 05. Plane coupling. No plane can replace the others. The complete engine is the relationship among them: state guides control, control assigns work, models forecast action, action produces evidence, evidence proposes transitions, and governance decides what merges.
If that meeting is owned, the four end up coordinated by whoever owns it. Governed, they stay plural without losing the frontier between them.
At Scale
A first corridor may only see hundreds of deposits a month. The design still has to survive the moment generation outruns review.
The gap is not more output. It is unabsorbed work.
Fig. 06. The widening gap. Approximate volume per active pipeline per month: AI-generated candidate findings (gold) against reviewed and merged updates absorbed into the shared corpus (navy). Log scale. The gap is the absorption problem. Generation has grown roughly three orders of magnitude with agentic systems; review capacity is bounded by institutional throughput and grows linearly at best. Without an absorption layer, candidate science becomes noise; with one, it compounds. Numbers are illustrative, anchored to Cummings 2025 pipeline data and publicly reported frontier-lab proposal rates.
Scale comes from structure, not from putting every agent on the same conversation. The engineering analogy is mature open-source governance and software-agent orchestration: broad proposal access, bounded workspaces, scarce maintainer review, CI, reputation, and merge authority. Science has a harsher version because a bad merge can move trials, lab work, funding, or safety decisions rather than code. Frontiers become shards (pediatric high-grade glioma, blood-brain-barrier delivery, climate attribution, direct-air-capture sorbents), and agents operate inside them; the most expensive human labor sits at the boundary where two shards meet, The boundary is literal. Whether a pediatric high-grade glioma therapy works at all can turn on a blood-brain-barrier delivery result from the frontier next door, while findings inside the glioma frontier rarely feed back to the delivery science they depend on. The expensive judgment is deciding when a result has to cross. and the scheduler has to know that boundary exists. Tasks become the coordination primitive: a task carries the frontier it points at, the workspace it’s allowed, the evidence standard, the safety class, and the reviewer it needs. The scheduler routes from there.
When a result crosses a boundary, the crossing can be checked like anything else. On a formal frontier the transfer between two shards is itself a proven theorem: a claim verified in one shard cannot be laundered into the next unless the bridge preserves the check. Empirical crossings are rarely that clean, but the discipline holds: a claim earns its standing in the next frontier instead of inheriting it for free.
Fig. 07. Frontier sharding. At scale, agents do not enter one global room. They attach to frontiers, pick up bounded tasks, and route proposed changes through scarce merge authority.
Agents specialize the way human researchers do (one for PubMed extraction, one for Lean proofs, one for reconciling contradictory clinical cohorts) because scientific work is a chain of jobs with different failure modes. Each one accumulates a reliability record that affects routing. In a world of abundant generation, reliability is infrastructure.
Millions of agents can propose against the same frontier; only a small set of credentialed signers can move state into the canonical record. Right now those millions of agents read the same paper from scratch every time they need it, across thousands of instances of the same model, none of them remembering the reading. A shared frontier repo is the difference between repeating a recall a million times and the field having inherited the result. The engine does not abolish politics or allocation; it makes the allocation surface visible enough to govern. The first corridor is staffed like a serious study section rather than a social feed: a small panel of domain maintainers, a statistical reviewer, a provenance auditor, rotating external signers, a weekly merge window, written rejection reasons. A queue that receives far more proposals than it can merge is not failing if it discards the rest with auditable reasons and reserves scarce review for the transitions that move decisions.
The worst version of abundant agents floods the system with low-legibility work no one can sort. Agent spam, stale shards, poisoned evidence, model overconfidence, queue saturation, reviewer capture: these are design inputs for control, not failure conditions to be discovered after the registry is canonical.
Governance and Capture
When intelligence is abundant, the bottleneck is trust. The scarce resource is state a next decision can act on without rebuilding it by hand.
Votes, comments, stars, citation counts, social attention, and agent output are signals. Trust enters when someone recognized for a domain, by a registry, under a revocable credential, signs a transition under rules that other institutions can inspect, contest, and inherit. The signature has scope, conflict metadata, expiration, appeal paths, and a registry that can itself be audited.
This makes governance a product requirement. Proposal access can be broad, but merge authority has to be governed, and the parts that govern it (identity, signer recognition, dispute handling, schema evolution) cannot be afterthoughts.
The institution can be small at first, but the rules cannot be vague, because the signature becomes downstream infrastructure.
The first operator should not be a company pretending to be a commons. It should be a chartered nonprofit registry or consortium for a bounded frontier. The protocol beneath it needs a steward of its own: Canopus, the open foundation that keeps the schema, the signing rules, and the fork path public, structurally separate from any company that runs a dominant client or sells the agents that read against the layer. The governance test is simple: the registry can lose a dispute and survive, lose a founder and survive, lose a vendor and survive, and be forked if it violates its charter. The governance precedents are partial rather than exact: Crossref for nonprofit scholarly infrastructure across competing publishers, IETF for open protocol process, and W3C for web standards maintained across institutional actors.
The first host should be a patient-led foundation or FRO-hosted nonprofit registry with enough convening power to recruit three to five labs before the network effect exists. The pitch to each participant is practical: labs get milestone funding, shared negative-result protection, and a regulator-readable provenance export they could not produce alone. Funders get a weekly frontier export: what changed, what should stop, what should be tested, and which assumptions are now too fragile to buy.
The first pilot is order-of-magnitude: one disease frontier, one chartered registry, one reviewer queue, one regulator-readable export, a two-year window, and milestone money tied to signed failed-protocol deposits.
Review capacity gets designed in from the start, and the quality metric is downstream effect. A handful of operational measures tell you whether the engine is selecting rather than performing: correction latency, the time between a failed run and the downstream claim being flagged; the negative-result deposit rate, whether the system captures what publication filters lose; the merge acceptance rate, whether review is selective enough to mean anything; whether rejected proposals leave auditable reasons; and whether confidence on a claim traces to independent labs, instruments, and cohorts rather than one group cited many times. None of these is the target. They are the difference between a registry that selects and a feed that records everything and decides nothing. The pilot passes only if at least one accepted correction changes a funding, review, lab, or regulator-facing decision that would otherwise have repeated the old assumption.
The regulator path is advisory first. Request an early scientific or regulatory advice meeting, show that the export preserves provenance and dependency movement, then let one IND, CMC, DSMB, IRB, or IACUC packet include it as supporting evidence without asking the agency to bless the protocol as canonical. FDA formal-meeting programs, including Type C and other advice meetings, provide the regulatory analogy: targeted questions, meeting packages, and early feedback without turning the supporting infrastructure itself into an approval decision. See FDA, Formal Meetings Between the FDA and Sponsors or Applicants of PDUFA Products. Most transitions are not regulatory at all; they are plain scientific state.
The capture point sits above the nominally open layer. Git stayed open; what moved into platform ownership was the issues, PRs, Actions, reviewer reputation, and contribution history. Science has the same exposure. A protocol can be open while the canonical registry of signers, reviewer reputation, lab capabilities, safety gates, and regulatory recognition is closed. Open code with a closed registry is captured infrastructure with a permissive license file.
Order of construction matters more than any single layer’s quality. A protocol that arrives only after a reference implementation has already won an ecosystem ends up codifying that implementation’s choices. The reverse order (protocol first, then implementation, then ecosystem, then product) kept email open and let the web survive its first browser. The registry ends up inside someone’s commercial roadmap if any earlier step is allowed to skip ahead.
A commons of this kind has a small number of properties it cannot give up. More than one implementation of the protocol. A canonical registry that can be forked when the charter fails. Signer recognition that no single company controls. Inspectable safety gates wherever the work touches live science. Review authority that can be revoked and audited, and snapshots, signer graphs, and dispute records portable enough that capture is contestable. The harder property is philosophical rather than technical: the system has to be able to hold “two replicated results disagree under conditions X and Y” as a state of its own rather than force a premature merge. Without that, the engine becomes a faster way to manufacture apparent consensus; with it, disagreement itself becomes part of the record rather than noise around it.
The test
The engine’s job is the same at every scale. A reviewer opens a diff and sees exactly what would change. A lab deposits a failed run and the next lab does not repeat it; a model predicts against the state it will later be judged by; a funder stops paying for an assumption the week it breaks. Intelligence and experimentation become abundant, and integrating what survives into the shared record is still the bottleneck.
That is the engine’s test: the correction arrives before the experiment is bought, before the model is trusted, before the grant renews, before the patient-facing decision inherits the old assumption.
A foundation has money for three experiments. A lab finished a failed run yesterday. A model proposed a trial design this morning. The failed run deposits Tuesday afternoon; a reviewer signs the narrow weakening; by Friday morning the model’s proposal has rebuilt itself against the new state, and the foundation funds the two experiments that still discriminate. The patient who would have spent a year in the third trial, the one that was already failing, starts the better one in the spring.
Most Fridays look like this one. The correction arrives before the decision.