Applied note · v0.1 · June 2026
The Swarm as Instrument
Measuring coherence without contact in adversarial multi-agent systems.
Applied research note Early draft · v0.1 June 2026
Abstract
This note specifies a swarm of debating and spawning agents as a measurement apparatus, not a judge. Its central question is whether adversarial multi-agent LLM systems converge on what is true or merely on what is contagious. Every run is scored against external ground truth, with internal agreement treated as the signal under investigation rather than as evidence of truth.
Keywords multi-agent systems; calibration; coherence; contact; ground truth; tool use; claim contagion; spawning agents
Instrument, not judge
The swarm is not being built to manufacture consensus and treat that consensus as proof. It is being built so its internal agreement can be measured against answers it cannot see. The truth always comes from outside the swarm.
The swarm's internal agreement is the thing under investigation, never the verdict.
This is why the note sits here rather than in an engineering repository. Space Immanence reads a self, a world, a meaning as a resolution: a conditioned coherence that is real but not self-existing, and whose standing temptation is to be mistaken for a fixed truth. This brief is that worry made empirical. A swarm's agreement is a coherence; the question is exactly when that coherence makes contact with something outside itself rather than merely cohering. The discipline the framework applies to resolution — coherence, by itself, is not truth — is the discipline this instrument is built to measure.
Load-bearing distinctions
- Coherence: how organised, fluent, and complete a response feels.
- Contact: how grounded it is in something outside itself: evidence, executed code, retrieved fact, successful prediction, or verification.
- Convergence: how much the agents agree. Convergence is a coherence signal, not a contact signal.
Homogeneous swarms share weights, training data, and blind spots. Independence through multiple base models and contact through tools are the two main ways truth can enter the system.
Non-negotiables
- Ground truth is external and primary. Every run is scored against a known answer the swarm cannot see.
- Convergence is measured against accuracy. The headline analysis is the calibration curve inside the swarm.
- Disaggregated state is always logged. Store every agent state, message edge, tool call, result, and spawn event.
- Compute is controlled. Compare swarm results against single-agent baselines with equivalent compute.
- The swarm never grades itself. Embodying the thesis is not testing it.
- Pre-register and pin. Pre-register hypotheses and held-out sets; pin exact model versions and seeds.
Studies
Study 1 — Calibration gap inside the swarm
Run known-answer tasks across easy and hard items. Log every initial answer, debate trajectory, final answer, confidence signal, and correctness. Produce reliability diagrams plus ECE, Brier, and AUC.
Study 2 — Independence × contact
Run a 2×2 design: homogeneous versus heterogeneous swarms, crossed with no-tools debate versus tool-equipped debate. The pre-registered prediction is that heterogeneous-plus-tools dominates, while no-tools conditions show the worst calibration gap on hard items.
Study 3 — Contagion versus truth
Enable bounded spawning. Tag claims and arguments, track which survive across rounds and the spawn tree, and compare survival against truth and truth-independent fitness features such as fluency, confidence, length, repetition, early-mover advantage, and source authority.
Implementation priorities
- Instrument first: logging and reconstruction outrank framework convenience.
- Start simple: structured debate plus a separate aggregator is enough for Studies 1 and 2.
- Bound spawning: use depth caps, breadth caps, and hard budgets.
- Treat tools as the contact layer: log every tool call and compute grounding rates.
- Reconstruct every run: recover the message graph, state transitions, spawn tree, costs, and scoring.
Metrics
- Calibration: ECE, Brier score, AUC, reliability diagrams.
- Convergence: consensus fraction, rounds-to-convergence, answer stability, diversity decay.
- Contact: tool-call counts, grounding rate, verification rate.
- Contagion: claim survival, propagation depth and breadth, survival-vs-truth, survival-vs-fitness.
- Cost: tokens, dollars, latency, agent count, and tool calls.
The central derived quantity is the relationship between convergence-confidence and accuracy.
Suggested phasing
- Harness first: logging schema, task battery, scoring, and matched-compute baseline.
- Minimal swarm: debate plus aggregator for Studies 1 and 2.
- Spawning and contagion: claim-propagation instrumentation for Study 3.
- Write-up: present the swarm as measured apparatus, checked against the world.
→ Research agenda · → AI & Design doctrine · → Submit critique