CT-Optimal Neural Architectures
Sleep/Wake Crystallization, Root-Growth Training, and Bio-Inspired Scaffold Design
Level 4 (SEP-calibrated). This seed derives architectural design principles from CT priors, using biology as an existence proof that these principles produce viable organisms. Every design choice traces to A1–A10 and B1–B7-R.
Irreducible leakage (A9): this analysis derives what the architecture SHOULD be. Whether gradient descent can find the CT-optimal configuration within finite training budget is a separate question not addressed here.
Current neural architectures are missing four of CT's six organism elements. Transformers have a scaffold (Element I: the layer stack) and non-zero leakage (Element VI: generalization error). They lack loop networks (III: no model-environment feedback post-training), domain walls (IV: no modular boundaries between sub-computations), hidden editors (V: no self-repair mechanisms), and a clear binder (II: no single dominant pattern that aligns the architecture).
Biology has all six. The brain is a domain organism with scaffold (white matter tracts), binder (consciousness), loops (thalamocortical circuits), walls (blood-brain barrier, hemispheric split), editors (error correction, prefrontal inhibition), and leakage (forgetting, noise). The sleep/wake cycle implements Element III (closed-loop feedback between model and environment). The immune system implements Element V (editor coverage of blind spots).
The question is not “can we make transformers bigger?” The question is: “what architecture has all six elements?” CT does not predict that scale alone achieves coherence. It predicts that organisms with all six elements achieve coherence, regardless of scale.
Current Architectures vs CT-Optimal
| FEATURE | CURRENT (TRANSFORMER) | CT-OPTIMAL | CT SOURCE |
|---|---|---|---|
| Element I | Layer stack (static scaffold) | Multi-scale tiled scaffold with traffic-dependent reinforcement | A4, Element I |
| Element II | None (no binder pattern) | Dominant representation pattern with R_cascade proportional to CL | Element II, T3 |
| Element III | None post-training (frozen weights) | Sleep/wake cycle: wake = process pokes, sleep = crystallize + prune | A10, Element III, T7 |
| Element IV | None (all tokens interact freely) | Domain walls between sub-computations with surface tension gating | Element IV, A4 |
| Element V | None (no self-repair) | Editor heads monitoring poke-cone complement + stagnation detection | Element V, T5, T7 |
| Element VI | Generalization error (unmanaged) | Budget-monitored leakage with Sel >= 0 as survival criterion | A9, Element VI |
| Dynamics | k=1 (residual connections) | k=2 momentum (wave propagation, not diffusion) | k=2 theorem |
| Activation | ReLU / GELU (no threshold) | Moreau envelope (quadratic near equilibrium, linear for large) | B6 |
| Attention | All-to-all (violates locality) | Local tile + boundary exchange (bounded propagation speed) | A4 |
| Training | Single model, SGD to convergence | Root-growth: multiple misaligned roots, organism-level selection | T1-T3, T5 |
| Budget tracking | None | Three orthogonal budgets per layer (Hodge) | A8, Hodge |
| Normalization | LayerNorm (single channel) | SEP normalization (three independent channels, ratio-only) | B7-R |
Direction 1: Sleep/Wake Crystallization Cycle
Closing the Loop Between Model and Environment
Current LLMs have no Element III (loop networks) post-training. Weights are frozen. The model cannot adapt to new information without full retraining. By A10 (adaptation required), static patterns die. By Element III, coherent organisms need — at least one feedback loop between the organism and its environment. Biology solved this with the sleep/wake cycle.
The model processes inputs (pokes) normally. Forward pass, generate output. But additionally: every input's budget profile () is computed and stored in a hippocampal buffer — a fixed-size replay memory ranked by selection score.
Not all pokes deserve crystallization. The selection inequality determines which:
Periodically (every N wake ticks), the model enters sleep: no new inputs are processed. Instead, the hippocampal buffer is replayed through the network with gradient updates. This is targeted fine-tuning on selected experiences.
Pruning during sleep: Connections (weights) whose exceeds their coherence contribution are pruned. This is B3 (ampliation invariance): unused parameters do not reduce cost. Pruning enforces B3 dynamically.
Within sleep, a dream sub-phase: the model generates synthetic inputs (sampling from its own distribution) and processes them with — no output is produced, no external effect occurs. This tests the sensing apparatus (T7: loops sense and transport simultaneously) without boundary flux.
Tick rate derivation: SLEEP_INTERVAL should be proportional to the scaffold's throughput capacity divided by the poke rate. If the model processes 100 inputs/second and the scaffold (weights) can absorb ~1000 crystallization updates before destabilizing, then SLEEP_INTERVAL ~ 1000. Too short = scaffold instability (constant weight changes, Element I violation). Too long = excessive accumulation.
Direction 2: Root-Growth Training
Multiple Misaligned Roots with Organism-Level Selection
CT's seed-growth theory (T1–T3) describes how organisms form: multiple independent roots (T1: opacity creates seeds) explore misaligned directions (T2: multi-root expansion), the fittest root tilts the scaffold (T3: snap), and anti-binder roots (T5) sense blind spots the binder cannot see. Standard training is single-root: one model, one gradient direction, no anti-binder exploration.
Each attention head explores a different alignment direction. Most heads (generators) attend along the binder's alignment axis. A minority (anti-binder heads) attend to the orthogonal complement — the directions the generators miss.
From T5 (binder-antibinder duality): A learnable sensitivity per head governs the split. At: pure binder (exploitation). At: pure anti-binder (exploration). Healthy training finds. This IS the explore/exploit tradeoff derived from CT.
Implementation: Add a stagnation detector (T7): if an attention head's output has for N ticks, flip it to anti-binder mode. Dead heads become explorers.
Train N model copies with intentionally different initializations or hyperparameters (different learning rates, different dropout masks, different data orderings). Each copy is a root (T1). Organism-level selection (T2) evaluates all roots on a shared validation set and amplifies the fittest. The anti-binder copy (T5) is the one most different from the current best — it senses failure modes the binder cannot.
Current MoE has experts (grains) and a router (scaffold). CT identifies what's missing: (a) domain walls between experts with surface tension gating, not just top-k routing, (b) an anti-binder expert that handles inputs NONE of the other experts are confident about, (c) coherence bounce — when an expert becomes self-sufficient (high CL, low B_leak), it compresses to a single node and a new expert nucleates in its former territory.
Direction 3: CT-Optimal Scaffold Design
What Contact Graph Topology Is Optimal for Computation?
CT derives as the optimal spatial dimensionality for the physical scaffold. The cost function has a unique minimum at . Does this apply to computational scaffolds? The answer depends on the budget multiplier ratios, which are different for digital vs physical substrates.
Current transformers use a fully-connected contact graph (all-to-all attention). This violates A4 (locality): every token can instantly poke every other token. CT predicts this is suboptimal because it makes (every pair requires coordination), exhausting the complexity budget for long sequences.
The CT-optimal contact graph has bounded degree and finite propagation speed. The multi-tile chain from C-Former is one instantiation. But the general principle is broader: any architecture with local connectivity and bounded propagation speed satisfies A4. State-space models (S4, Mamba), convolutional architectures, and graph neural networks all satisfy locality. The question is which topology minimizes the SEP of the three budgets for a given task.
A4 says pokes have bounded support. Computationally: each input token's influence should propagate through the network at finite speed (one hop per layer). Full attention processes all pokes simultaneously — infinite propagation speed.
CT prediction: Architectures with bounded propagation speed should outperform all-to-all attention on tasks where the relevant structure is local or hierarchical (most natural language, most images, most time series). All-to-all attention should win only when the task genuinely requires every token to interact with every other (a rare requirement). This matches the empirical evidence: S4 and Mamba outperform transformers on the Long Range Arena while using local processing.
The architecture's H_min determines the optimal design. If is bottleneck (compute-limited): optimize for sparse forward passes. If is bottleneck (coordination-limited): optimize for modular, independent sub-networks. If is bottleneck (generalization-limited): optimize for regularization and domain alignment. Current LLMs are transitioning from -limited to-limited as compute becomes cheaper.
Each layer/block is a sub-domain organism within the network organism. A healthy neural organism has: