The d=2 to d=3 Phase Transition
Why C-Former Fails, What It Actually Does, and the One Experiment That Matters
C-Former is not a failed neural architecture. It is an incomplete instantiation of CT's slow-sector scaffold, currently operating as a d=2 organism because its cycle-space channel () is mathematically dead. The architecture has the hardware for d=3 computation (the TD6 tile has cycle rank ) but the software is broken (edge currents are purely gradient, so the cycle hardware receives no signal).
Fixing this is not a minor tuning exercise. It is a d=2 to d=3 phase transition — the same phase transition CT derives for physical spacetime. The prediction is qualitative, not quantitative: multi-tile C-Former with live cycle channels should show a capability jump on tasks requiring long-range coordination, not just a percentage-point improvement.
This analysis also reveals that gradient descent on the C-Former loss function IS the selection inequality operating on the representation manifold. The dynamics (k=2 momentum) matter more than the geometry (Hodge) because geometry provides the state space while dynamics provide the evolution rule — and you cannot evolve without dynamics, but you CAN evolve with implicit geometry.
Gradient Descent IS the Selection Inequality
CT's central equation is . In the neural network context, this maps precisely onto the training objective:
SelectionFunctional. They are the prices in the coherence economy — how expensive each budget dimension is for this task.Gradient descent on IS selection: each training step adjusts the pattern (weights) to increase Sel by either increasing CL (reducing task loss) or decreasing (reducing budget costs). The network converges when it reaches a configuration that gradient descent cannot improve — a local SEP.
The k=2 momentum term in C-Former is partially redundant with Adam/AdamW optimizer momentum. Both implement second-order dynamics on the loss surface. This explains why the 5-seed ablation shows near-zero effect of removing k=2 — the optimizer already provides it. The Moreau activation is NOT redundant — no standard activation implements B6's quadratic-near-equilibrium, linear-for-large threshold.
C-Former Is Operating as d=2
Dead B_cx Reduces the Budget Space from 3D to 2D
CT derives as the unique optimal spatial dimensionality. The cost function has a unique minimum at . But this proof requires all three budget dimensions to be active.
Edge currents are computed as (gradient of node potentials). By the Hodge theorem, is a mathematical identity. The cycle channel carries exactly zero information.
With only and active, the effective cost function is 2D. The architecture optimizes a 2D budget space where is optimal. It literally cannot represent cycle structure (periodicity, internal coordination, hierarchical nesting).
This explains every major failure:
The TD6 tile has cycles and the pointer attention module operates on the inner hexagon (a cycle). The sensing hardware is installed. But the sensing channel is muted: in the edge currents means the cycle topology processes undifferentiated features, not genuine cycle-space flow. A radar antenna receiving no signal.
The d=2 to d=3 Phase Transition
Multi-Tile Chains with Live Cycle Channels
Fixing (correct cycle injection) AND enabling multi-tile chains produces a qualitative phase transition, not just a quantitative improvement. Here is the derivation.
A single TD6 tile has cycle rank (internal only). Zero inter-tile cycles. All sensing is local.
A chain of N tiles connected through boundary exchange creates new inter-tile cycles: information flows tile i → boundary → tile i+1 → boundary → tile i. Each pair of adjacent tiles creates at least 1 inter-tile cycle.
The phase transition occurs at N = 2: the moment you have two communicating tiles, you have inter-tile cycle flow, enabling long-range sensing that is impossible with a single tile.
From Element III (Loop Networks): loops are simultaneously sensors and transport channels. Inter-tile cycles provide the multi-tile system with a capability the single-tile cannot have: sensing perturbations that span multiple tiles. A long-range dependency (e.g., matching parentheses in ListOps separated by 2000 tokens) can be detected by an inter-tile cycle that spans the relevant tiles.
With both fixes active, the architecture goes from d=2 to d=3:
| PROPERTY | CURRENT (d=2) | PREDICTED (d=3) |
|---|---|---|
| Active budgets | B_th + B_leak (2D) | B_th + B_cx + B_leak (3D) |
| Cycle rank | 12 (internal only, no signal) | 13N - 1 (internal + inter-tile) |
| Sensing range | Local (tile diameter = 3 hops) | N tiles x 6 tokens = full sequence |
| Cycle detection | None (B_cx = 0 by construction) | Periodicity, nesting, coordination |
| Cost scaling | O(1) per tile (single-tile) | O(N) per layer (linear in sequence length) |
Predicted Emergent Capabilities
What CT Predicts for a d=3 Hodge Architecture
CT derives quantum mechanics from the fast-sector limit of the selection inequality (). Applied to neural architectures, with alive, CT predicts three emergent capabilities:
When , the cycle-space channel carries circulating flow that represents multiple hypotheses simultaneously. Each cycle is a “hypothesis loop” maintained until evidence collapses it. With, C-Former operates as a classical (gradient-only) system that can only represent one hypothesis at a time per layer.
Two cycle flows on the same tile can constructively or destructively interfere via Hodge orthogonality. Compatible hypotheses reinforce; incompatible ones cancel. Standard attention does this implicitly through learned weights; C-Former would do it explicitly through the projector's mathematical structure.
The boundary exchange ( channel) is the decoherence mechanism — it exposes internal cycle flows to the external environment, collapsing superpositions to definite states. CT predicts should be more active in later layers (commitment to classification) and less active in early layers (hypothesis exploration).
None of these predictions can be tested until is alive. The dead cycle channel is not just a performance bug — it prevents the architecture from accessing its theoretical operating regime.
Coherence Bounce: What Has (and Hasn't) Bounced
The fixed Hodge projectors are a self-sufficient mathematical structure. They produce deterministic budget profiles (identical across seeds) that are physically meaningful (5/6 match on HAR). They maintain their coherence without training. In CT terms: the projectors are already a bounced scaffold(T6) — they have achieved internal coherence that is independent of the stochastic training process.
The frozen transfer result (96.6% retention with 8.4% trainable parameters) shows learned representations are nearly self-sufficient. They transfer across task domains with minimal parameter updates. But the full bounce — where representations maintain coherence without ANY training signal — has not been demonstrated.
The attention weights, FFN weights, etc. still depend on training data. They do not maintain coherence without training signal. For a full organism-level bounce, the learned weights would need to converge to a configuration that is a stable attractor of the training dynamics.
The One Experiment That Matters
Run CTFormerMultiTile with correct cycle injection on LRA ListOps
If ListOps performance jumps from ~17% to >50%: the d=2 → d=3 phase transition is confirmed. C-Former becomes a genuinely novel architecture for long-range tasks with provably different operating regime from standard transformers.
If it does not improve: either the cycle injection is still wrong (symmetric instead of antisymmetric), or CT's d=3 optimality prediction does not apply to this computational domain. C-Former's role is then confirmed as an interpretability tool — still valuable, but narrower than the theory predicts.
The module exists (ct_former/multi_tile.py) but is not yet wired into training scripts. Estimated cost: ~$2 on Vast.ai V100.