Replay as a primitive

0xcircuitbreaker·May 13, 2026·7 min read

Most people first encounter replay in opentine as a debugging tool. You had a bad run. You want to understand it. You replay it, maybe with a modified prompt, and watch what changes.

That framing is too small. Replay is not a debugging feature. Replay is a primitive, and once you have it, a surprising number of things become trivially implementable on top of it.

The 0.1.x contract is intentionally narrower than the dream. Cached replay is real. Native Agent rerun is real. External harness rerun is real when you pass an explicit harness and prompt. Resume is guarded by the saved manifest. That scope is what makes replay usable instead of dangerous.

Consider what you get when replay is scoped but composable:

Eval-as-replay. You have 50 golden native-agent runs. You want to check whether a new prompt regresses on any of them. Replay the shared prefix, rerun the divergent suffix through the native Agent API, and diff the outputs against the originals. For external CLIs, do the same through an explicit harness rather than pretending the CLI can be safely resumed by magic.

Test-as-replay. You want a regression test that proves your agent still handles the "user asks for a refund" case correctly. Record a real run once, commit the .tine file, verify it, and cache-replay it in CI. If the test needs model execution, run the native Agent or a pinned harness deliberately. The key is that the artifact gives the test a real graph, not a screenshot of a trace.

Fuzz-as-replay. You want to know how sensitive your agent is to prompt variations. Fork from step N, vary the prompt, and rerun the suffix through the native runtime. The first N-1 steps are preserved. You get a distribution of outcomes without redoing the whole setup.

A/B-as-replay. You want to compare two models on a real workload. Take yesterday's runs, fork at the comparison point, rerun the suffix with the candidate model, and diff the outcomes. That is a model comparison over real agent context without re-running all of the setup work.

Audit-as-replay. A customer says "your agent did the wrong thing on my request six months ago." Pull the .tine file for that run. Verify it. Show the graph. Cache-replay the recorded work. If you need to test today's behavior, rerun through the current native model or harness and compare the branch.

All of these are the same family of mechanisms: verify, fork, cache-replay, rerun deliberately, diff. One primitive layer, five product features. This is the compounding you get from designing the primitive correctly instead of designing each feature separately.

What makes replay composable is the preserved graph prefix plus cache provenance. If replay had to re-execute the whole run every time, it would be useful for debugging and nothing else — the cost would kill every other use case. Because opentine keeps a content-addressed graph and semantic cache records, replay can reuse the prefix and only re-execute the divergent suffix when the runtime supports that safely. This is what turns "occasionally useful" into "always on."

This is the specific reason content-addressing matters so much to the design. It's not just for deduplication in storage. It is what makes replay cheap enough to be a primitive. And a cheap primitive is where all the compound value comes from.

Every framework I've looked at has a "replay" feature. Very few of them have a primitive. The difference is whether the feature can be composed — whether you can take the replay, wrap it in a loop with minor variations, and get something useful out. If every replay costs $10, you build eval harnesses to avoid it. If every replay costs 5 cents, you stop building eval harnesses and just use replay directly.

The hierarchy looks like this:

A tool has features. A platform has primitives.

opentine is trying to be a platform.