# The Swarm as Instrument

**Measuring coherence without contact in adversarial multi-agent systems**

Owner: Jacobus Kok  
Status: v0.1, for the build team  
Date: June 2026

## Read this first

We are building a swarm of agents that argue with each other, and that can spawn further agents. The point is **not** to build a machine that reaches consensus and treats the consensus as truth. The swarm is an **instrument**, not a judge. Its job is to be measured against external ground truth so we can discover whether adversarial multi-agent LLM systems converge on what is true or merely on what is contagious.

The swarm's internal agreement is the thing under investigation, never the verdict. The truth always comes from outside the swarm.

## Load-bearing distinctions

- **Coherence:** how organised, fluent, and complete a response feels.
- **Contact:** how grounded it is in something outside itself: evidence, executed code, retrieved fact, successful prediction, or other verification.
- **Convergence:** how much the agents agree. Convergence is a coherence signal, not a contact signal.

Agents spun from the same base model violate the independence assumption behind Condorcet-style intuitions: shared weights, shared training data, shared blind spots. Independence through multiple base models and contact through tools are the two main ways truth can enter the system.

## Non-negotiables

1. **Ground truth is external and primary.** Every run is scored against a known answer the swarm cannot see.
2. **Convergence is measured against accuracy, never substituted for it.** The headline analysis is the calibration curve inside the swarm: when the swarm agrees with high confidence, how often is it actually right?
3. **Disaggregated state is always logged.** Store every agent's position at every step, the full message graph, every tool call and result, and the spawn tree.
4. **Compute is controlled.** Compare every swarm result against a single-agent baseline given equivalent compute.
5. **The swarm never grades its own convergence as a result.** Embodying the thesis is not testing it.
6. **Pre-register and pin.** Pre-register hypotheses and held-out sets, and pin exact model versions and seeds.

## Studies

### Study 1: calibration gap inside the swarm

Run the swarm on questions with known answers across easy and hard items. Log every initial answer, debate trajectory, final answer, confidence signal, and correctness. Produce reliability diagrams plus ECE, Brier, and AUC.

### Study 2: independence × contact

Run a 2×2 design:

| Factor | Levels |
| --- | --- |
| Independence | homogeneous one-model swarm; heterogeneous mixed-model swarm |
| Contact | no-tools debate; tool-equipped debate with code, search, retrieval, or verifiers |

The pre-registered prediction is that heterogeneous-plus-tools dominates, while no-tools conditions show the worst calibration gap on hard items.

### Study 3: contagion vs truth

Enable bounded spawning. Tag claims and arguments, then track which survive and spread across rounds and the spawn tree. Correlate survival with truth against ground truth and with truth-independent fitness features: fluency, stated confidence, length, early-mover advantage, repetition, and source-agent authority.

## Implementation priorities

- **Instrument first.** Complete, clean, queryable logging outranks framework convenience.
- **Start simple.** Structured debate plus a separate aggregator is enough for Studies 1 and 2.
- **Bound spawning.** Use depth caps, breadth caps, and hard budgets; log spawn events as population-dynamics data.
- **Treat tools as the contact layer.** Log every tool call, input, result, and claim that traces to it. Compute a per-run grounding rate.
- **Reconstruct every run.** The storage layer must recover the message graph, state transitions, tool calls, spawn tree, costs, and final scoring.

## Task battery and ground truth

Use verifiable benchmarks across factual QA, math, code, and resolved forecasting. Include both tool-verifiable and non-tool-verifiable items, and weight toward the hard tail. Hold out a test set, pre-register, and keep ground-truth answers out of agent context.

## Metrics

- **Calibration:** ECE, Brier score, AUC, reliability diagrams.
- **Convergence:** consensus fraction, rounds-to-convergence, answer stability, diversity decay.
- **Contact:** tool-call counts, grounding rate, verification rate.
- **Contagion:** claim survival, propagation depth and breadth, survival-vs-truth, survival-vs-fitness.
- **Cost:** tokens, dollars, latency, agent count, and tool calls for matched-compute baselines.

The central derived quantity is the relationship between convergence-confidence and accuracy.

## Anti-goals

- Presenting internal consensus as truth.
- Treating a homogeneous, tool-less swarm as a truth-finder.
- Shipping without ground-truth evaluation.
- Reporting convergence without accuracy.
- Comparing against single-agent baselines at unmatched compute.
- Treating “the swarm enacts the thesis” as evidence for the thesis.

## Suggested phasing

1. **Harness first:** logging schema, task battery, scoring, and matched-compute baseline.
2. **Minimal swarm:** debate plus aggregator for Studies 1 and 2.
3. **Spawning and contagion:** claim-propagation instrumentation for Study 3.
4. **Write-up:** present the swarm as measured apparatus, checked against the world, never as a self-validating oracle.
