Home / Research / C-Former Deep Theory
GROWINGPlanted 2026-04-13

The d=2 to d=3 Phase Transition

Why C-Former Fails, What It Actually Does, and the One Experiment That Matters

Deep Theory·Phase Transition Prediction·Selection = Gradient Descent
A1A4A5A7A8A9B4B6B7-RT6T7Element-IElement-IIIElement-VHodgeSEPd=3 TheoremPolycrystalline
THE INSIGHT

C-Former is not a failed neural architecture. It is an incomplete instantiation of CT's slow-sector scaffold, currently operating as a d=2 organism because its cycle-space channel () is mathematically dead. The architecture has the hardware for d=3 computation (the TD6 tile has cycle rank ) but the software is broken (edge currents are purely gradient, so the cycle hardware receives no signal).

Fixing this is not a minor tuning exercise. It is a d=2 to d=3 phase transition — the same phase transition CT derives for physical spacetime. The prediction is qualitative, not quantitative: multi-tile C-Former with live cycle channels should show a capability jump on tasks requiring long-range coordination, not just a percentage-point improvement.

This analysis also reveals that gradient descent on the C-Former loss function IS the selection inequality operating on the representation manifold. The dynamics (k=2 momentum) matter more than the geometry (Hodge) because geometry provides the state space while dynamics provide the evolution rule — and you cannot evolve without dynamics, but you CAN evolve with implicit geometry.

Gradient Descent IS the Selection Inequality

CT's central equation is . In the neural network context, this maps precisely onto the training objective:

CL(A) = negative task loss. Higher accuracy = higher coherence. The pattern “correctly classifying inputs” persists when it is more coherent than alternatives.
B(A) = three-component budget cost. = gradient-flow energy (throughput of information through the network). = cycle-flow energy (internal coordination cost). = boundary-flux energy (generalization error).
Lambda = budget multipliers, learned by the SelectionFunctional. They are the prices in the coherence economy — how expensive each budget dimension is for this task.

Gradient descent on IS selection: each training step adjusts the pattern (weights) to increase Sel by either increasing CL (reducing task loss) or decreasing (reducing budget costs). The network converges when it reaches a configuration that gradient descent cannot improve — a local SEP.

KEY IMPLICATION

The k=2 momentum term in C-Former is partially redundant with Adam/AdamW optimizer momentum. Both implement second-order dynamics on the loss surface. This explains why the 5-seed ablation shows near-zero effect of removing k=2 — the optimizer already provides it. The Moreau activation is NOT redundant — no standard activation implements B6's quadratic-near-equilibrium, linear-for-large threshold.

C-Former Is Operating as d=2

Dead B_cx Reduces the Budget Space from 3D to 2D

CT derives as the unique optimal spatial dimensionality. The cost function has a unique minimum at . But this proof requires all three budget dimensions to be active.

THE STRUCTURAL FLAW

Edge currents are computed as (gradient of node potentials). By the Hodge theorem, is a mathematical identity. The cycle channel carries exactly zero information.

With only and active, the effective cost function is 2D. The architecture optimizes a 2D budget space where is optimal. It literally cannot represent cycle structure (periodicity, internal coordination, hierarchical nesting).

This explains every major failure:

LRA ListOps (−42%): Hierarchical nesting IS a cycle structure in the token dependency graph. A d=2 scaffold cannot represent it.
QM9 B_cx = 0.0%: Ring aromaticity IS cycle-space flow. Computing edge currents as annihilates it by construction.
Dynamics > Geometry in ablation: The geometry (Hodge) acts on the broken input edge currents. The dynamics (k=2, Moreau) act on learned representations downstream of the bottleneck. Dynamics work because they bypass the flaw.
T7 TYPE 2 FAILURE: STRUCTURALLY PRESENT, FUNCTIONALLY DEAD

The TD6 tile has cycles and the pointer attention module operates on the inner hexagon (a cycle). The sensing hardware is installed. But the sensing channel is muted: in the edge currents means the cycle topology processes undifferentiated features, not genuine cycle-space flow. A radar antenna receiving no signal.

The d=2 to d=3 Phase Transition

Multi-Tile Chains with Live Cycle Channels

Fixing (correct cycle injection) AND enabling multi-tile chains produces a qualitative phase transition, not just a quantitative improvement. Here is the derivation.

CYCLE RANK COMPUTATION

A single TD6 tile has cycle rank (internal only). Zero inter-tile cycles. All sensing is local.

A chain of N tiles connected through boundary exchange creates new inter-tile cycles: information flows tile i → boundary → tile i+1 → boundary → tile i. Each pair of adjacent tiles creates at least 1 inter-tile cycle.

The phase transition occurs at N = 2: the moment you have two communicating tiles, you have inter-tile cycle flow, enabling long-range sensing that is impossible with a single tile.

From Element III (Loop Networks): loops are simultaneously sensors and transport channels. Inter-tile cycles provide the multi-tile system with a capability the single-tile cannot have: sensing perturbations that span multiple tiles. A long-range dependency (e.g., matching parentheses in ListOps separated by 2000 tokens) can be detected by an inter-tile cycle that spans the relevant tiles.

With both fixes active, the architecture goes from d=2 to d=3:

PROPERTYCURRENT (d=2)PREDICTED (d=3)
Active budgetsB_th + B_leak (2D)B_th + B_cx + B_leak (3D)
Cycle rank12 (internal only, no signal)13N - 1 (internal + inter-tile)
Sensing rangeLocal (tile diameter = 3 hops)N tiles x 6 tokens = full sequence
Cycle detectionNone (B_cx = 0 by construction)Periodicity, nesting, coordination
Cost scalingO(1) per tile (single-tile)O(N) per layer (linear in sequence length)

Predicted Emergent Capabilities

What CT Predicts for a d=3 Hodge Architecture

CT derives quantum mechanics from the fast-sector limit of the selection inequality (). Applied to neural architectures, with alive, CT predicts three emergent capabilities:

1. SUPERPOSITION-LIKE BEHAVIOR

When , the cycle-space channel carries circulating flow that represents multiple hypotheses simultaneously. Each cycle is a “hypothesis loop” maintained until evidence collapses it. With, C-Former operates as a classical (gradient-only) system that can only represent one hypothesis at a time per layer.

2. INTERFERENCE EFFECTS

Two cycle flows on the same tile can constructively or destructively interfere via Hodge orthogonality. Compatible hypotheses reinforce; incompatible ones cancel. Standard attention does this implicitly through learned weights; C-Former would do it explicitly through the projector's mathematical structure.

3. DECOHERENCE AS ATTENTION

The boundary exchange ( channel) is the decoherence mechanism — it exposes internal cycle flows to the external environment, collapsing superpositions to definite states. CT predicts should be more active in later layers (commitment to classification) and less active in early layers (hypothesis exploration).

None of these predictions can be tested until is alive. The dead cycle channel is not just a performance bug — it prevents the architecture from accessing its theoretical operating regime.

Coherence Bounce: What Has (and Hasn't) Bounced

ALREADY BOUNCED: THE HODGE PROJECTORS

The fixed Hodge projectors are a self-sufficient mathematical structure. They produce deterministic budget profiles (identical across seeds) that are physically meaningful (5/6 match on HAR). They maintain their coherence without training. In CT terms: the projectors are already a bounced scaffold(T6) — they have achieved internal coherence that is independent of the stochastic training process.

APPROACHING BOUNCE: LEARNED REPRESENTATIONS

The frozen transfer result (96.6% retention with 8.4% trainable parameters) shows learned representations are nearly self-sufficient. They transfer across task domains with minimal parameter updates. But the full bounce — where representations maintain coherence without ANY training signal — has not been demonstrated.

NOT BOUNCED: LEARNED WEIGHTS

The attention weights, FFN weights, etc. still depend on training data. They do not maintain coherence without training signal. For a full organism-level bounce, the learned weights would need to converge to a configuration that is a stable attractor of the training dynamics.

The One Experiment That Matters

CRITICAL EXPERIMENT

Run CTFormerMultiTile with correct cycle injection on LRA ListOps

If ListOps performance jumps from ~17% to >50%: the d=2 → d=3 phase transition is confirmed. C-Former becomes a genuinely novel architecture for long-range tasks with provably different operating regime from standard transformers.

If it does not improve: either the cycle injection is still wrong (symmetric instead of antisymmetric), or CT's d=3 optimality prediction does not apply to this computational domain. C-Former's role is then confirmed as an interpretability tool — still valuable, but narrower than the theory predicts.

The module exists (ct_former/multi_tile.py) but is not yet wired into training scripts. Estimated cost: ~$2 on Vast.ai V100.

Falsifiable Predictions

Architecture
d=2 to d=3 Phase Transition on LRA
CT predicts a QUALITATIVE capability jump (>30 percentage points) on LRA ListOps when both fixes are applied (correct cycle injection + multi-tile chains). The jump comes from the cycle rank increase () enabling long-range sensing that the single-tile d=2 architecture cannot perform.
FALSIFIES IF
Multi-tile C-Former with correct cycle injection shows only incremental improvement (<10%) on LRA ListOps
Prior at risk: d=3 theorem: C(d) has unique minimum at d=3 requiring 3 active budget dimensions
Interpretability
B_leak Activity Increases with Layer Depth
In a d=3 C-Former, the channel should be less active in early layers (hypothesis exploration phase) and more active in later layers (commitment/classification phase). This is the neural analogue of quantum decoherence: early layers maintain superposition, later layers collapse it.
FALSIFIES IF
B_leak channel activity is uniform across layers, or decreases with depth
Prior at risk: Decoherence as attention: boundary exchange collapses hypotheses to definite states
Component Integration
Moreau Activation Transfers to Standard Transformers
The Moreau activation implements B6 and is not redundant with any standard activation. Adding it to a standard transformer (replacing GELU) should improve convergence stability with zero parameter increase. This is the most immediately testable prediction from the entire C-Former program.
FALSIFIES IF
Adding Moreau activation to a standard transformer shows no benefit on any benchmark
Prior at risk: B6 (quadratic tangent law): equilibrium-near perturbations should be treated quadratically
RELATED RESEARCH