What We Build For the Index

Matt Rathbun

The Invisible Operating System argued that human civilization runs on a vast substrate of tacit knowledge that AI does not carry. The Experiential Index explained why the break is structural: language between humans is a pointer system into shared embodied experience, and AI breaks the pointer system because the experience is not there to dereference. Both essays ended at the same question. Knowing this, what do you build in response?

This essay is my answer, after testing it in practice.

For several months I have been operating in solo-founder mode on a personal project, in the gaps of an already-full life. On the train into the city. In interstitial time between work and parenting. On a camping trip in Joshua Tree where I prepped a day’s build from the campground in the morning, let it run while we hiked, and reviewed the results after everyone else was asleep. I found a way to do this work around my life, not instead of my life — and that constraint is what forced the discipline.

The discipline became a refinement of where I had to be. First a dark factory — autonomous coding sessions running on a server I did not have to babysit. Then the dark factory made dispatchable from any device, including the Claude app on my phone, which only worked because the product itself was becoming the substrate the sessions inherited their context from. Then I pulled back further. Out of writing prompts, into writing requirements. Out of requirements, into writing decisions. Out of decisions, into writing canon. Each step, the discipline migrated upstream. Each step, my involvement got smaller. I still do the ideation, the research, the essays, the customer promises. The system I built does the rest.

The project predates the essays. Building this way produced the writing; the essays articulated what the work was teaching me. Once they existed, the lab became deliberate — every refinement a test of how far the structural moves could be pushed. What I built is the current best form of that work. It has held up long enough, and across enough kinds of work, that I think it points at a path others could follow.

Here is what surprised me. Short prompts that encode referential indexes — pointers at upstream specifications — produced more accurate builds than long prompts that tried to describe intent inline. If that result generalizes, and I think it does, then a substantial portion of what gets practiced today as context engineering — the per-session curation of what enters the context window — is solving the wrong problem in the wrong place.

The rest of this essay is the report from the lab.

The Snowflake I Did Not Realize I Was Making

The first essay identified snowflakes as a disease of organizational knowledge — every new piece of work treated as if its problem had never been encountered before. The fix, I assumed, was to write better upstream documents. I had done all of that. A research corpus the project was built on. Intent documents the project was committed to. Quality standards every session was supposed to honor.

My workflow was that I would ideate and plan with a Claude session, then ask that session to write the coding prompts for the autonomous agents that would actually do the build. I would scope, review, dispatch. The prompts that came out were rich — typically nine thousand characters — and the completion rate on first try ran around ninety-three percent.

The work drifted anyway. Two sessions starting from the same corpus, same intent, same standards produced subtly different results. Over hundreds of sessions, the differences compounded into a codebase whose pieces no longer quite agreed with each other.

It took me too long to see that each planning session was its own snowflake. Claude wrote the prompts from whatever happened to be in scope at that moment — the documents I had attached, the parts of the corpus the search surfaced, my standards as I had restated them. The connection between the upstream and the work lived in my head. Every planning session was me reaching into my head and pulling the relevant parts forward, slightly differently each time. The drift was downstream of me, even when I was not the one typing.

What pointed at a way through was Karpathy’s recent setup, the Karpathy Loop. Karpathy pointed Claude Code at his own ML training code, gave it one editable file, one scorable metric, one time budget, and went to sleep. The agent ran seven hundred experiments overnight and cut training time eleven percent. The pattern that mattered was not “AI writes code.” It was that a tight optimization loop with a scoring function, a bounded edit surface, and a version-controlled sandbox could compound improvements at machine speed in a single domain. Kevin Gu’s team at ThirdLayer extended the same architecture from training code to agent harnesses themselves. Same bones. Different surface.

What if you turned that architecture on intent engineering itself? Intent as the editable surface, derivation as the metric, drift between layers as the failure trace.

The drift was not happening because my upstream was insufficiently specified. It was happening because more description was the wrong direction entirely. Language evolved as an experiential index for a reason. The pointer is small and the referent is rich, and the system works because the listener already holds the referent. My task with AI was not to abandon the indexical structure of language by inlining every referent into longer prompts. It was to give the AI access to a stable referent space that the pointers could resolve against.

This is what became “The Cascade.”

The Lineage

The architecture has two parents.

Karpathy gave it the shape of an optimization loop: bounded edit surface, scorable metric, traces, version control. Nate Jones gave it the hierarchy of disciplines that needed optimizing: prompt craft, context engineering, intent engineering, specification engineering. Karpathy was telling me what an optimization loop should be made of. Jones was telling me what layers needed to be in the loop. Plenty of people in the field have been mixing these ingredients — Karpathy’s own AutoResearch reads program.md as a specification, Kevin Gu’s AutoAgent puts a meta-agent in front of a task agent’s harness, Spec Kit and Kiro give the ecosystem agent-readable spec conventions. The specific synthesis I made was narrower, and the value, if any, lives in the specifics.

The marriage produces the cascade: a derivation chain from intent to runtime, with six layers. Canon — what the project commits to and commits not to do. Architectural decisions that derive from canon. Technical requirements that derive from those decisions. Solution designs that satisfy those requirements. Code that implements those designs. Tests that verify intent at runtime. Each layer is a real artifact. Each layer cites the layer above it. Divergence between adjacent layers becomes the failure trace. Refinements are bounded to one adjacent-pair edit at a time. The optimization loop runs on every pair of layers, not on the whole stack at once.

The layered, citation-driven shape has its own long lineage outside the agent world: Nygard’s 2011 essay on ADRs, MADR, arc42, the UK Government’s architectural decision framework. The new generation of agent-spec conventions — Spec Kit, Kiro — picks up the same thread. Three commitments are what I think distinguish the cascade from all of it. I insisted on the citation chain rather than treating it as a recommendation. I held the upper three layers immutable rather than letting them evolve in place — once accepted, canon and decisions and requirements can be superseded but not edited. I required every new piece of work to enter by inheritance, not by invention. The rest came out of those three commitments.

By May the average prompt I was sending to an autonomous coding agent was running about thirteen hundred characters — down from roughly nine thousand in March, an 85 percent reduction. The simplest prompts had collapsed to almost nothing: “Implement per L3-076-A, model Opus 4.6, push to main.” The agent fetched the referenced specification at runtime and worked against it. Across nearly a hundred and eighty sessions over the past two weeks, completion ran at a hundred percent and regressions held at zero.

The translation had migrated. It lived in the upstream artifacts now. I no longer translated. The substrate translated, once, in a place every downstream piece of work could inherit.

Why This Matters, and Why I Am the One Writing It

I have spent twenty-five years as a security person. What I learned in that work was not how to make humans do the right thing. It was how to build systems that did not depend on humans doing the right thing, and guardrails that leveraged the invisible operating system to do the work the explicit controls could not. Assume failure in the actor. Engineer around it. Use the social architecture where it helps, never where it has to be load-bearing.

That kind of skepticism turned out to be the right kind for working with AI, once you shifted the thinking a few degrees. That shift is what AI Won’t Be Afraid of Getting Fired was about. A great deal of what makes organizations function safely is invisible social architecture — fear of consequence, desire to protect reputation, social pressure of peers — and AI does not participate in any of it. The question is not how to make AI more careful. It is what the system needs to look like when the actor cannot participate in the social architecture at all.

That is the question this essay tries to answer for engineering discipline. The discipline I had been counting on — write good prompts, set clear standards, hold people to them — was the discipline that works between humans because humans share the social substrate that makes it bind. AI has no social substrate. It is not careless. It is not careful. It is exactly as good as the structural binding between intent and execution, and no better.

The patterns we have spent decades developing — code review, change management, peer pressure, professional pride — were not load-bearing in the way we thought they were. They were load-bearing on the social substrate, and the social substrate was load-bearing on the fact that all the participants were human. When an AI joins the loop, the substrate disappears, and the patterns are left holding nothing. The work feels the same. The artifacts look the same. The drift sets in slowly enough that you do not notice it until you compare two sessions a week apart and they no longer agree.

The fix is not to make the AI more careful.

The fix is to make individual diligence no longer the binding force.

What Did Not Translate

The structural fix worked. It also has a ceiling, exactly where the second essay said it would. The lab confirmed that essay as much as it confirmed any of the moves I made afterward.

The Experiential Index laid out five levels at which language indexes experience rather than describing it. The substrate worked beautifully at Level 2 — embodied metaphor. Healthy service. Clean architecture. Appropriate response. The cascade let me translate those metaphors once, into operational criteria, and inherit the criteria forever. The translation itself was not automatic — I had to do it — but the structural translation propagated through every artifact downstream.

Level 3 — read the room, use good judgment — did not translate. The cascade could surface gaps. It could flag where a specification depended on embodied social understanding the agent did not have. What it could not do was supply the understanding. The agent still cannot read the room. The substrate can tell it that this is a room it cannot read, and route the decision to the human in the loop. That is progress. It is not automation.

Level 5 — organizational culture, how things work around here — was the layer that fought back hardest. I tried to solve it directly. I wrote a working principles document, made it canonical, and loaded it into every session. It did not work. Agents would routinely suggest approaches or produce work that did not align with the principles, and I would have to redirect them back to the document over and over. The principles were explicit. They were in scope. The agents had access to them every time. And the cultural intent still did not bind.

The cascade pushes the ceiling up. It does not lift it. Level 2 became dramatically more tractable. Level 3 stayed hard. Level 5 stayed harder. The Experiential Index essay said there would be parts of language that cannot be translated into propositional content because the content being indexed is constitutively experiential. The lab agreed.

From Context Engineering to Substrate Engineering

Most writing on context engineering treats it as something you perform per session. Anthropic defines the discipline as “the art and science of curating what will go into the limited context window from that constantly evolving universe of possible information,” and frames the curation as happening “each time we decide what to pass to the model.” That is a real discipline. It is not the discipline that compounds. It is the discipline that pays the translation tax once per session, and the translation tax grows as the project grows.

What the work actually demands is substrate engineering. The artifact you build once that every prompt and every agent and every future version of yourself can point at. Not “write better specs” — build the substrate the specs are a structured projection of, and let the projection compound as the substrate grows.

The experiential-index thesis has one more thing to say here. The reason indexical language works between humans is not that we are disciplined about maintaining the experiential substrate. We are not. Humans are not annually re-verifying that they remember what warmth feels like. The substrate is structural for humans because it is biological — the shared body keeps it alive without anyone tending it.

When we try to build an equivalent for AI, we cannot rely on biology. We have to substitute structure for biology. The maintenance has to be enforced by the system, because nothing else will keep it alive.

What I Do Not Have

A few acknowledgments to close.

I do not have a recipe for building the substrate in environments where it does not exist. The lab was greenfield. I built the cascade for a system I authored myself, on a codebase I controlled, with no legacy to migrate. The harder version of the problem — taking an organization with twenty years of un-citable decisions and gradually bringing them into a substrate — is the version I am only beginning to work on in another part of my professional life. The lab took roughly two months from “this might be a thing” to “this is the only way I work.” The harder version will take longer.

I do not have a complete answer for Level 3 or Level 5. I have flags, surfacing mechanisms, lessons learned registers, and explicit acknowledgments of where the substrate cannot translate. I do not have a substrate that translates appropriate judgment into propositional content the agent can act on. The Experiential Index was right that I will not, because the content cannot be translated. The best I can do is understand where the cascade carries the load and where it cannot, and adapt my own process to cover the rest.

I do not know how this generalizes to teams. The cascade in a single-author project has a clear authoring authority and a citation chain I can hold in my head. A team has politics, hierarchy, competing standards, and the real problem that structural discipline feels constraining to humans who run on motivational substrate. The same structure that frees the AI to do compound work may feel like bureaucracy to the humans alongside it. How that trade resolves at scale is something I have opinions about but no evidence for.

The ceiling is real. The room beneath it is larger than I thought.