Agon:
make AI-produced work defend itself
An independent critic cross-examines what an AI produced. The producer must defend or concede every concrete attack. Only the disputes that survive reach a human.
Not a judge. A protocol.
Agon sits between an AI agent and whoever, or whatever, consumes its output. It runs an honest, competent critic against the work, forces the producing agent to defend or fix every concrete attack, and surfaces only what stays contested.
The artifact need not be code. A diff, a research write-up, an outcome analysis, a plan, a high-stakes decision: the same protocol applies. What resolves becomes a pass signal another agent can consume directly; what stays contested is what a person reviews.
Propose, attack, defend, until only the unresolved remains
A proposer and one or more independent critics, each pressing a distinct aspect. Cross-examination runs until the dispute reaches steady state or a round cap. Only what stays unresolved surfaces, ranked by a pure contention score.
The proposer answers the task: a claim, a diff, an argument.
Each critic picks its own aspect (functional logic, security, code quality, performance) and produces concrete attacks: a specific input X yielding output Y that violates Z. Not vibes.
The proposer responds to each attack: concede, and the proposer-clone applies the fix, or rebut with a specific counter-claim.
Every attack carries a stable id in an append-only ledger. Only unresolved disputes surface, ranked by how many rounds each survived. No model scores the outcome.
The surfacing layer is a pure rule: the contention score is rounds survived plus a bit for whether an attack was re-raised. There is no model in that loop, on purpose. The bounded-judge result from the adversarial-verification literature is the intuition behind this, not the runtime; its honest limits are set out under Academic foundations.
Four properties the alternatives do not replicate
One honest player suffices
A dishonest proposer must hold a consistent lie across every cross-examination round; an honest critic needs one inconsistency. Because each critic owns a distinct aspect, a weak critic on one aspect does not break coverage on the others; weak aspects get dropped from defaults, not the tool.
Vendor-neutral by construction
The default pairing is cross-family: Claude proposes, Codex critiques. Same model on both sides is the model reviewing itself, and is rejected. No model vendor will ship the neutral layer; the incentive is to sell more of its own tokens.
Channel purity
Critic output reaches the proposer as a verbatim user message, not a skill or template. The proposer defends the way it would against a person pasting a review. Wrapping it distorts the defense.
Auditable by design
Stable attack ids, an append-only ledger, contention-scored headlines by a pure rule with no model judging at the surfacing layer. A security team reads a session like a court transcript.
A machine-readable signal, not just a human headline
Inside an agent loop the contention score is a decision gate: attacks the proposer resolves are a pass the calling agent proceeds on; the contested tail above the threshold is what escalates to a human.
This is the high-stakes-decision use: agents resolve, proceed; not resolved, a person reviews. It is not a new capability; it is what the existing ledger and contention score already are when the consumer is another agent rather than a person.
Grounded in the adversarial-verification literature, and honest about the limits
Agon productizes the adversarial-verification architecture of Irving, Christiano & Amodei (2018). Its complexity-theoretic intuition, that the adversarial protocol reaches PSPACE under optimal play, strictly above NP, is suggestive, not a claim about LLMs: LLMs are not optimal players.
The closer theoretical fit is Brown-Cohen, Irving & Piliouras (2023), which extends the result to stochastic systems and to honest players with polynomial simulation budgets, both required for it to apply to LLMs at all. Their 2025 Prover-Estimator protocol addresses the obfuscated-arguments attack on the plain protocol.
Irving, Christiano & Amodei (2018).AI Safety via Debate.arXiv:1805.00899
Brown-Cohen, Irving & Piliouras (2023).Scalable AI Safety via Doubly-Efficient Debate.arXiv:2311.14125
Brown-Cohen, Irving & Piliouras (2025).Avoiding Obfuscation with Prover-Estimator Debate.arXiv:2506.13609
Research home.agents-byzantine-tolerance: the open research suite this productizes, probing adversarial verification across compute, depth, stochasticity, leaf format, obfuscation, and query-complexity scaling. github.com/changkun/agents-byzantine-tolerance
Honest framing: the formal soundness results are about the protocol under stated assumptions, not a guarantee about any particular model. Application to real LLMs is empirically motivated and hypothesis-stage. The gating metric is the per-aspect critic-found-bug rate; if a critic does not actually attack, the protocol collapses to the proposer alone. Agon does not prove your code correct, and does not claim to remove the need for trust.
One binary, an optional Stop hook
Vendor-neutral and local-first. Bring your own Claude / Codex; Agon runs the protocol and writes an auditable session to disk.
# one-liner (detects OS/arch, verifies checksum, installs the Stop hook)
curl -fsSL https://raw.githubusercontent.com/latere-ai/debate/main/install.sh | sh# from source (Go 1.26+)
go install latere.ai/x/debate/cmd/debate@latest
debate install-hook --scope user