Swarm Benchmarking
CT-Organized Multi-Agent Systems vs Single-Instance LLMs on Standard Capability Benchmarks
CT predicts that multi-agent swarms organized with scaffold/binder/loops/walls/editors outperform single LLM instances above a complexity threshold Cmin, with the gap widening on harder problems.
This is not about raw model intelligence but about organizational coherence — the same claim CT makes about biological organisms, startups, and physical systems.
Budget regime theory places the AI industry at the Bth → Bcxtransition where organizational coherence becomes the binding constraint.
Five CT Mechanisms
Each predicts independently measurable capability gains
Separate editor agents with different prompts have less correlated blind spots than self-review. dim(R_total) > dim(R_i).
Metric: Error detection rate on known-buggy solutions.
Multiple independent solution attempts with different strategies cover more of the solution space.
Metric: Success rate on problems with multiple valid approaches.
Explicit review loops (generate -> review -> revise) catch errors that single-pass generation misses.
Metric: Accuracy improvement from review iterations.
Structured task decomposition with interface contracts reduces error propagation between sub-tasks.
Metric: Cascading error rate across multi-step solutions.
Stable role definitions (CLAUDE.md) and consistent context reduce hallucination and drift.
Metric: Consistency of outputs across repeated runs.
Experimental Design
3 benchmarks, 4 conditions, 50 problems each
Benchmarks: SWE-bench (software engineering), ARC-AGI (abstract reasoning), GAIA (general assistants)
Conditions: (1) Single instance, (2) Naive swarm (parallel without CT), (3) CT-organized swarm, (4) Mixed-model CT swarm
Estimated cost: $1,800 total
Below Cmin, single instances win (coordination is pure overhead). Above Cmin, CT-organized swarms win (organization enables capability that raw intelligence cannot reach alone). The crossover is the central empirical question.
Key Falsifiable Predictions
P1: Crossover Threshold
P2: Widening Gap
P3: CT vs Naive Swarm
Source: CT_RESEARCH_SWARM_BENCHMARKING.md · Full experimental design with 10 falsifiable predictions, cost analysis, and CT mechanism decomposition.