Home / Research / Swarm Benchmarking
SEEDPlanted 2026-04-13

Swarm Benchmarking

CT-Organized Multi-Agent Systems vs Single-Instance LLMs on Standard Capability Benchmarks

Empirical CT·Budget Regime Theory + Organism Elements
T1T2T6T7A3A5A9B3B4B6Element-IElement-IIElement-IIIElement-IVElement-VElement-VIBudget RegimesSEP
THE INSIGHT

CT predicts that multi-agent swarms organized with scaffold/binder/loops/walls/editors outperform single LLM instances above a complexity threshold Cmin, with the gap widening on harder problems.

This is not about raw model intelligence but about organizational coherence — the same claim CT makes about biological organisms, startups, and physical systems.

Budget regime theory places the AI industry at the Bth → Bcxtransition where organizational coherence becomes the binding constraint.

Five CT Mechanisms

Each predicts independently measurable capability gains

T1: EDITOR COVERAGE

Separate editor agents with different prompts have less correlated blind spots than self-review. dim(R_total) > dim(R_i).

Metric: Error detection rate on known-buggy solutions.

T2: MULTI-ROOT EXPLORATION

Multiple independent solution attempts with different strategies cover more of the solution space.

Metric: Success rate on problems with multiple valid approaches.

T7: LOOP-BASED REVIEW

Explicit review loops (generate -> review -> revise) catch errors that single-pass generation misses.

Metric: Accuracy improvement from review iterations.

ELEMENT IV: DOMAIN WALL FILTERING

Structured task decomposition with interface contracts reduces error propagation between sub-tasks.

Metric: Cascading error rate across multi-step solutions.

ELEMENT I: SCAFFOLD STABILITY

Stable role definitions (CLAUDE.md) and consistent context reduce hallucination and drift.

Metric: Consistency of outputs across repeated runs.

Experimental Design

3 benchmarks, 4 conditions, 50 problems each

Benchmarks: SWE-bench (software engineering), ARC-AGI (abstract reasoning), GAIA (general assistants)

Conditions: (1) Single instance, (2) Naive swarm (parallel without CT), (3) CT-organized swarm, (4) Mixed-model CT swarm

Estimated cost: $1,800 total

THE CROSSOVER POINT

Below Cmin, single instances win (coordination is pure overhead). Above Cmin, CT-organized swarms win (organization enables capability that raw intelligence cannot reach alone). The crossover is the central empirical question.

Key Falsifiable Predictions

AI Benchmarks
P1: Crossover Threshold
There exists a problem complexity threshold Cmin above which CT-organized swarms strictly outperform single LLM instances on accuracy.
CONFIRMS IF
CT swarm outperforms single instance above C_min
FALSIFIES IF
No crossover point exists — single instance dominates at all difficulty levels
AI Benchmarks
P2: Widening Gap
The CT swarm's advantage over single instances grows with problem difficulty: the harder the problem, the more organization matters.
CONFIRMS IF
Capability gap grows with problem difficulty
FALSIFIES IF
Gap is constant or shrinks with difficulty
AI Organization
P3: CT vs Naive Swarm
CT-organized swarms (with scaffold/binder/loops/walls/editors) outperform naive swarms (parallel instances without structure) by at least 15% on problems above Cmin.
CONFIRMS IF
CT-organized swarm outperforms naive swarm
FALSIFIES IF
Random parallel execution matches CT organization

Source: CT_RESEARCH_SWARM_BENCHMARKING.md · Full experimental design with 10 falsifiable predictions, cost analysis, and CT mechanism decomposition.