Home / Research / CT-Optimal Neural Architectures

SEEDPlanted 2026-04-13

CT-Optimal Neural Architectures

Sleep/Wake Crystallization, Root-Growth Training, and Bio-Inspired Scaffold Design

Design from Theory·3 Research Directions·Bio-Inspired

A1A4A5A7A9A10B1B3B4B6T1T2T3T5T6T7Element-IElement-IIElement-IIIElement-IVElement-VSEPPolycrystallineBudget Regimes

LENS SPECIFICATION

Level 4 (SEP-calibrated). This seed derives architectural design principles from CT priors, using biology as an existence proof that these principles produce viable organisms. Every design choice traces to A1–A10 and B1–B7-R.

Irreducible leakage (A9): this analysis derives what the architecture SHOULD be. Whether gradient descent can find the CT-optimal configuration within finite training budget is a separate question not addressed here.

THE INSIGHT

Current neural architectures are missing four of CT's six organism elements. Transformers have a scaffold (Element I: the layer stack) and non-zero leakage (Element VI: generalization error). They lack loop networks (III: no model-environment feedback post-training), domain walls (IV: no modular boundaries between sub-computations), hidden editors (V: no self-repair mechanisms), and a clear binder (II: no single dominant pattern that aligns the architecture).

Biology has all six. The brain is a domain organism with scaffold (white matter tracts), binder (consciousness), loops (thalamocortical circuits), walls (blood-brain barrier, hemispheric split), editors (error correction, prefrontal inhibition), and leakage (forgetting, noise). The sleep/wake cycle implements Element III (closed-loop feedback between model and environment). The immune system implements Element V (editor coverage of blind spots).

The question is not “can we make transformers bigger?” The question is: “what architecture has all six elements?” CT does not predict that scale alone achieves coherence. It predicts that organisms with all six elements achieve coherence, regardless of scale.

Current Architectures vs CT-Optimal

FEATURE	CURRENT (TRANSFORMER)	CT-OPTIMAL	CT SOURCE
Element I	Layer stack (static scaffold)	Multi-scale tiled scaffold with traffic-dependent reinforcement	A4, Element I
Element II	None (no binder pattern)	Dominant representation pattern with R_cascade proportional to CL	Element II, T3
Element III	None post-training (frozen weights)	Sleep/wake cycle: wake = process pokes, sleep = crystallize + prune	A10, Element III, T7
Element IV	None (all tokens interact freely)	Domain walls between sub-computations with surface tension gating	Element IV, A4
Element V	None (no self-repair)	Editor heads monitoring poke-cone complement + stagnation detection	Element V, T5, T7
Element VI	Generalization error (unmanaged)	Budget-monitored leakage with Sel >= 0 as survival criterion	A9, Element VI
Dynamics	k=1 (residual connections)	k=2 momentum (wave propagation, not diffusion)	k=2 theorem
Activation	ReLU / GELU (no threshold)	Moreau envelope (quadratic near equilibrium, linear for large)	B6
Attention	All-to-all (violates locality)	Local tile + boundary exchange (bounded propagation speed)	A4
Training	Single model, SGD to convergence	Root-growth: multiple misaligned roots, organism-level selection	T1-T3, T5
Budget tracking	None	Three orthogonal budgets per layer (Hodge)	A8, Hodge
Normalization	LayerNorm (single channel)	SEP normalization (three independent channels, ratio-only)	B7-R

Direction 1: Sleep/Wake Crystallization Cycle

Closing the Loop Between Model and Environment

Current LLMs have no Element III (loop networks) post-training. Weights are frozen. The model cannot adapt to new information without full retraining. By A10 (adaptation required), static patterns die. By Element III, coherent organisms need — at least one feedback loop between the organism and its environment. Biology solved this with the sleep/wake cycle.

THE SLEEP/WAKE ARCHITECTURE

WAKE PHASE: Process Pokes

The model processes inputs (pokes) normally. Forward pass, generate output. But additionally: every input's budget profile () is computed and stored in a hippocampal buffer — a fixed-size replay memory ranked by selection score.

CT derivation: A4 (each poke has bounded reach). Wake processing is local — the model responds to the current poke without modifying its scaffold (weights). The hippocampal buffer accumulates pokes for deferred processing.

SELECTION GATE: What to Crystallize

Not all pokes deserve crystallization. The selection inequality determines which:

. Pokes with are candidates for crystallization. This is CT's answer to “what should the model learn?” — patterns whose coherence exceeds their budget cost.

In practice: inputs where the model was confident AND correct (high CL) with low internal budget cost are crystallized. Inputs where the model was wrong (low CL) but the error is informative (gradient signal has high Sel) are also crystallized. Inputs in the noise floor (low CL, low gradient) are pruned.

SLEEP PHASE: Crystallize and Prune

Periodically (every N wake ticks), the model enters sleep: no new inputs are processed. Instead, the hippocampal buffer is replayed through the network with gradient updates. This is targeted fine-tuning on selected experiences.

CT derivation: A10 (adaptation required) + Element I (scaffold stability). The wake phase accumulates B_leak (the model's representation drifts relative to reality). Sleep corrects this by updating the scaffold (weights) to reduce accumulated drift. The tick rate of crystallization should match the scaffold's natural timescale — too frequent disrupts stability (Element I), too infrequent allows excessive B_leak accumulation.

Pruning during sleep: Connections (weights) whose exceeds their coherence contribution are pruned. This is B3 (ampliation invariance): unused parameters do not reduce cost. Pruning enforces B3 dynamically.

DREAM PHASE: Editor Testing

Within sleep, a dream sub-phase: the model generates synthetic inputs (sampling from its own distribution) and processes them with — no output is produced, no external effect occurs. This tests the sensing apparatus (T7: loops sense and transport simultaneously) without boundary flux.

CT derivation: Element V (editors detect misalignment). Dreams run the editor network on self-generated inputs to detect internal inconsistencies. If the model generates an input that causes contradictory internal states, this is misalignment detected by the dream-editor. The inconsistency is flagged for correction during the next crystallization pass.

ALGORITHM: SLEEP/WAKE CRYSTALLIZATION CYCLE

hippocampus = PriorityBuffer(max_size=K, key=selection_score)

wake_counter = 0

for input in stream:

# --- WAKE PHASE ---

output, budgets = model.forward_with_budgets(input)

sel = CL(output, target) - <Lambda, budgets>

hippocampus.insert(input, target, budgets, sel)

wake_counter += 1

if wake_counter >= SLEEP_INTERVAL:

# --- SLEEP PHASE ---

batch = hippocampus.sample(top_k=CRYSTALLIZE_SIZE)

for epoch in range(SLEEP_EPOCHS):

loss = train_step(model, batch) # crystallize

prune_low_sel_weights(model, threshold) # prune (B3)

# --- DREAM PHASE ---

synthetic = model.generate(n=DREAM_SIZE) # self-sample

inconsistencies = detect_misalignment(model, synthetic)

hippocampus.insert_repairs(inconsistencies) # queue fixes

wake_counter = 0

Tick rate derivation: SLEEP_INTERVAL should be proportional to the scaffold's throughput capacity divided by the poke rate. If the model processes 100 inputs/second and the scaffold (weights) can absorb ~1000 crystallization updates before destabilizing, then SLEEP_INTERVAL ~ 1000. Too short = scaffold instability (constant weight changes, Element I violation). Too long = excessive accumulation.

Direction 2: Root-Growth Training

Multiple Misaligned Roots with Organism-Level Selection

CT's seed-growth theory (T1–T3) describes how organisms form: multiple independent roots (T1: opacity creates seeds) explore misaligned directions (T2: multi-root expansion), the fittest root tilts the scaffold (T3: snap), and anti-binder roots (T5) sense blind spots the binder cannot see. Standard training is single-root: one model, one gradient direction, no anti-binder exploration.

LEVEL 1: ATTENTION HEADS AS ROOTS

Each attention head explores a different alignment direction. Most heads (generators) attend along the binder's alignment axis. A minority (anti-binder heads) attend to the orthogonal complement — the directions the generators miss.

From T5 (binder-antibinder duality): A learnable sensitivity per head governs the split. At: pure binder (exploitation). At: pure anti-binder (exploration). Healthy training finds. This IS the explore/exploit tradeoff derived from CT.

Implementation: Add a stagnation detector (T7): if an attention head's output has for N ticks, flip it to anti-binder mode. Dead heads become explorers.

LEVEL 2: MODEL COPIES AS ROOTS

Train N model copies with intentionally different initializations or hyperparameters (different learning rates, different dropout masks, different data orderings). Each copy is a root (T1). Organism-level selection (T2) evaluates all roots on a shared validation set and amplifies the fittest. The anti-binder copy (T5) is the one most different from the current best — it senses failure modes the binder cannot.

This is a CT-principled version of population-based training (PBT). The difference: CT predicts the OPTIMAL number of roots (3, from the three-generations theorem: N_g < 3 = no asymmetry, N_g > 3 = B_cx explodes as N^2), and CT requires at least one anti-binder root that is intentionally misaligned.

LEVEL 3: MIXTURE OF EXPERTS AS POLYCRYSTALLINE STRUCTURE

Current MoE has experts (grains) and a router (scaffold). CT identifies what's missing: (a) domain walls between experts with surface tension gating, not just top-k routing, (b) an anti-binder expert that handles inputs NONE of the other experts are confident about, (c) coherence bounce — when an expert becomes self-sufficient (high CL, low B_leak), it compresses to a single node and a new expert nucleates in its former territory.

From T6 (coherence bounce): Expert specialization follows the bounce trajectory. Initially, all experts handle similar inputs (undifferentiated). As training proceeds, each expert's internal CL rises in its specialty until it bounces — becoming its own scaffold for a sub-domain. The organism (the full MoE) then operates as a polycrystalline structure with specialized grains and low-energy domain walls between them.

ALGORITHM: ROOT-GROWTH TRAINING

# Initialize N=3 roots (three-generations theorem)

roots = [Model(seed=s) for s in [42, 137, 271]]

binder_idx = 0 # initially arbitrary

for epoch in range(MAX_EPOCHS):

# T2: all roots train independently

for root in roots:

train_one_epoch(root, data)

# Evaluate all roots

scores = [evaluate(root, val_data) for root in roots]

# T3: identify binder (highest Sel)

binder_idx = argmax(scores)

# T5: identify anti-binder (most different from binder)

distances = [param_distance(roots[binder_idx], r) for r in roots]

antibinder_idx = argmax(distances)

# Selection: kill weakest non-antibinder root, spawn from binder

weak_idx = argmin(scores, exclude=[binder_idx, antibinder_idx])

roots[weak_idx] = mutate(roots[binder_idx]) # new root from binder

# T6: check for coherence bounce

if scores[binder_idx] > CL_BOUNCE_THRESHOLD:

# binder has bounced — compress, restart exploration

save_checkpoint(roots[binder_idx])

roots = [mutate(roots[binder_idx]) for _ in range(3)]

Direction 3: CT-Optimal Scaffold Design

What Contact Graph Topology Is Optimal for Computation?

CT derives as the optimal spatial dimensionality for the physical scaffold. The cost function has a unique minimum at . Does this apply to computational scaffolds? The answer depends on the budget multiplier ratios, which are different for digital vs physical substrates.

CONTACT GRAPH TOPOLOGY

Current transformers use a fully-connected contact graph (all-to-all attention). This violates A4 (locality): every token can instantly poke every other token. CT predicts this is suboptimal because it makes (every pair requires coordination), exhausting the complexity budget for long sequences.

The CT-optimal contact graph has bounded degree and finite propagation speed. The multi-tile chain from C-Former is one instantiation. But the general principle is broader: any architecture with local connectivity and bounded propagation speed satisfies A4. State-space models (S4, Mamba), convolutional architectures, and graph neural networks all satisfy locality. The question is which topology minimizes the SEP of the three budgets for a given task.

POKE PROCESSING: LOCAL VS GLOBAL

A4 says pokes have bounded support. Computationally: each input token's influence should propagate through the network at finite speed (one hop per layer). Full attention processes all pokes simultaneously — infinite propagation speed.

CT prediction: Architectures with bounded propagation speed should outperform all-to-all attention on tasks where the relevant structure is local or hierarchical (most natural language, most images, most time series). All-to-all attention should win only when the task genuinely requires every token to interact with every other (a rare requirement). This matches the empirical evidence: S4 and Mamba outperform transformers on the Long Range Arena while using local processing.

THREE BUDGETS IN NEURAL NETWORKS

B_th (throughput): Forward pass FLOPs. How much compute to route information from input to output. Minimized by: sparse computation, pruning, quantization, caching.

B_cx (complexity): Parameter coordination. The cost of maintaining consistent internal representations across parameters that must work together. Minimized by: modular architectures (MoE), independent sub-networks, weight sharing.

B_leak (leakage): Generalization error. Information lost at the boundary between training distribution and test distribution. Minimized by: regularization, data augmentation, domain adaptation. But A9 guarantees B_leak > 0 always.

The architecture's H_min determines the optimal design. If is bottleneck (compute-limited): optimize for sparse forward passes. If is bottleneck (coordination-limited): optimize for modular, independent sub-networks. If is bottleneck (generalization-limited): optimize for regularization and domain alignment. Current LLMs are transitioning from -limited to-limited as compute becomes cheaper.

NEURAL ORGANISM: LAYERS AS SUB-DOMAINS

Each layer/block is a sub-domain organism within the network organism. A healthy neural organism has:

I. Scaffold: Residual connections provide the stable pseudo-metric (skip connections maintain identity across layers).

II. Binder: The task loss gradient is the binder — it determines the alignment direction for all parameters.

III. Loops: Feedback through backpropagation (training loop). Missing post-training — this is the gap that sleep/wake addresses.

IV. Walls: Layer boundaries. Currently permeable (no gating). CT predicts that selective gating between layers (surface tension) should help.

V. Editors: Dropout during training is a crude editor (randomly tests whether sub-networks can function independently). CT predicts dedicated editor mechanisms would be more effective.

VI. Leakage: Generalization error (always present by A9).

Falsifiable Predictions

Continual Learning

Sleep/Wake Outperforms Continual Fine-Tuning

A model with periodic crystallization phases (selective replay + pruning) should achieve lower catastrophic forgetting on continual learning benchmarks than the same model with continuous gradient updates on the full stream. The sleep phase's selection gate (Sel > threshold) should prevent consolidation of noise while preserving high-value patterns.

FALSIFIES IF

Continuous gradient updates (no sleep phase) match or exceed sleep/wake on non-stationary distributions

Prior at risk: A10 (adaptation required) + Element III (loop networks need cycles)

Training Methodology

Three Training Roots Outperform One or Five

CT's three-generations result predicts that the optimal number of independently trained model copies is exactly 3: binder (best), anti-binder (most different from best), and explorer (mutated from binder). Fewer than 3 lacks diversity. More than 3 incurs quadratic coordination cost without improving exploration.

FALSIFIES IF

Population size N=1 or N>=5 matches N=3 on multi-seed evaluation with same total compute budget

Prior at risk: Three-generations theorem: N < 3 = no asymmetry, N > 3 = B_cx explodes as N^2

Architecture Design

Local Attention Wins at Scale

Architectures with bounded propagation speed (local attention, SSMs, multi-tile chains) should outperform all-to-all attention above a critical sequence length that depends on the task's locality structure. For most natural language: tokens (matching the crossover point where S4/Mamba overtake standard transformers on LRA).

FALSIFIES IF

All-to-all attention outperforms local attention + boundary exchange at all model sizes and sequence lengths

Prior at risk: A4 (pokes are local) + Element I (scaffold stability)

Robustness

Anti-Binder Heads Improve OOD Robustness

Adding anti-binder attention heads (heads that attend to directions the majority of heads ignore) should improve out-of-distribution robustness even if it slightly degrades in-distribution accuracy. The anti-binder heads function as editors (Element V) that detect alignment failures the generator heads cannot see.

FALSIFIES IF

Dedicating attention heads to orthogonal-to-binder directions degrades both in-distribution and out-of-distribution performance

Prior at risk: T5 (binder-antibinder duality) + Element V (editors)

Component Integration

k=2 Dynamics Improve Any Architecture

The k=2 momentum term and Moreau activation from C-Former should provide measurable benefit when added to standard transformer architectures WITHOUT the Hodge decomposition or TD6 tile. These dynamic components are the most portable innovations from C-Former. CT predicts: +0.5–2% accuracy improvement with zero parameter increase (k=2) and improved convergence stability (5x lower seed variance from the C-Former ablation).

FALSIFIES IF

Adding k=2 momentum (f' += alpha * (f - f_prev)) to a standard transformer provides no measurable benefit across multiple benchmarks

Prior at risk: k=2 dynamics theorem + B6 (quadratic tangent law)

Derivation Chain

A10 (adaptation) + Element III (loops) → Models need closed-loop feedback

→ Sleep/wake crystallization (periodic weight updates from selected experiences)

→ Selection gate: Sel(poke) > threshold determines what to consolidate

→ Dream phase: Element V editors test sensing without B_leak

T1 (opacity) + T2 (multi-root) + T3 (snap) + T5 (anti-binder) → Root-growth training

→ N=3 roots (three-generations theorem)

→ Anti-binder root = intentionally misaligned for blind spot sensing

→ T6 (bounce): when binder exceeds CL_bounce, compress and restart

A4 (locality) + Element I (scaffold) → Local attention with bounded propagation

→ Multi-tile chains, SSMs, convolutions all satisfy A4

→ All-to-all attention violates A4, pays B_cx ~ N^2

A8 (multidimensional budgets) + Hodge → Three independent cost dimensions per layer

→ B_th = FLOPs, B_cx = parameter coordination, B_leak = generalization error

→ Architecture optimal when SEP equalized across all three