Home / Research / CT-Optimal Hardware

SEEDPlanted 2026-04-14

CT-Optimal Hardware

ASIC Design for Hodge-Decomposition Neural Networks

Hardware from Theory·8 Research Questions·14x Headroom

Element-IHodgeTD6A4A7A8B4B6B7-RSEPPolycrystallineBudget Regimes

LENS SPECIFICATION

Level 3 (includes competing scaffolds). This seed analyzes the mismatch between C-Former's mathematical structure and current GPU architecture. The binder is: the computational scaffold should match the mathematical scaffold (Element I). Competing patterns: GPU transformers, Cerebras wafer-scale, Graphcore IPU mesh, Google TPU systolic arrays.

Irreducible leakage (A9): this analysis derives what the hardware SHOULD be. Whether fabrication cost, yield, and ecosystem support make it viable is a separate question requiring empirical input from semiconductor engineering.

THE INSIGHT

The TD6 tile works and beats standard transformers. But GPUs are not the ideal scaffold for this architecture — C-Former training is 14x slower despite having 40% fewer parameters. The reason: GPUs are optimized for dense matrix multiply (standard transformers). C-Former's operations are sparse, structured, and graph-local. The mismatch IS the 14x slowdown.

The Hodge decomposition is fixed linear algebra on a 13-node, 24-edge graph. The three projectors (gradient, cycle, boundary) are constant matrices derived from the tile topology. They never change during training. Multi-tile chains have regular, predictable communication patterns: each tile only talks to its neighbors. This is the opposite of what GPUs are designed for.

From Element I (scaffold stability): the computational scaffold should match the mathematical scaffold. When there is a mismatch, you pay B_th for the translation overhead at every operation. The 14x slowdown is the cost of running graph-local sparse operations on hardware designed for dense global operations.

GPU vs CT-Optimal: The Scaffold Mismatch

DIMENSION	GPU (CURRENT)	CT-OPTIMAL ASIC	CT SOURCE
Compute unit	CUDA core: scalar multiply-accumulate	TD6 tile unit: 13 nodes, 24 edges, 3 hardwired projectors in registers	Element I, Hodge
Memory hierarchy	Global VRAM + shared memory + registers (designed for large tensors)	Tile-local state in registers + boundary exchange buffer with neighbors	A4, B4
Communication	All-to-all via NVLink/PCIe (token-parallel)	Mesh topology: each tile talks only to left/right neighbors (tile-parallel)	A4, Polycrystalline
Data flow	Token-parallel: same operation on all tokens simultaneously	Tile-parallel: each tile processes its own Hodge decomposition independently	B4, Element I
Projectors	Computed dynamically via attention weights	Hardwired constant matrices derived from tile topology (never change)	Hodge, TD6
Sparsity	Dense matrix multiply (wasted FLOPs on zero entries)	Structured sparsity: only 24 edges in the graph are non-zero	A4, B_th
Batch strategy	Large batches to amortize kernel launch overhead	Small batches sufficient: no kernel overhead when operations are hardwired	B_th, B_cx
Power profile	High power (300-700W) for massive parallel FP ops	Low power: sparse structured ops, most silicon idle at any moment	A7, B_th

The 14x training slowdown despite 40% fewer parameters is the direct consequence of this scaffold mismatch. Every C-Former operation pays a B_th tax for translating between graph-local sparse structure and dense GPU tensor cores.

Research Questions

Q1: What is the ideal compute unit for a TD6 tile?

13 nodes, 24 edges, 3 fixed projectors. The entire tile state fits in registers. The Hodge projectors are constant 24x24 matrices. A single tile's forward pass is: (1) project edge flow onto three subspaces, (2) apply learned mixing weights, (3) reconstruct. This is three matrix-vector multiplies with constant matrices plus a learned linear combination.

CT derivation: Element I — the scaffold is the TD6 tile. The compute unit should map 1:1 to the mathematical object. One physical tile = one TD6 graph.

Q2: What memory hierarchy matches multi-tile chains?

Multi-tile chains have a specific communication pattern: each tile maintains local state and exchanges boundary information with its immediate neighbors. This is a 1D mesh topology — the exact opposite of all-to-all attention. The memory hierarchy should be: tile-local registers (fast, small) + neighbor exchange buffer (medium) + chain-level aggregation (slow, rare).

CT derivation: A4 (pokes are local) + B4 (independent components' budgets add). Tiles are locally coupled. No tile needs global state.

Q3: What does the data flow look like?

GPU transformers are token-parallel: the same attention operation runs on all tokens simultaneously. C-Former should be tile-parallel: each tile in the chain runs its own Hodge decomposition independently, then exchanges boundary information. The parallelism axis is tiles, not tokens.

CT derivation: B4 (local additivity) — independent tiles' budgets add. Tile-parallel execution exploits this directly.

Q4: Can the Hodge projectors be hardwired?

Yes. The three projectors are derived from the tile topology and never change. They are constant matrices: , , . On a GPU, these are stored in memory and loaded every forward pass. On custom silicon, they can be baked into the circuit — zero memory access, zero latency. This alone eliminates the dominant B_th cost of the current implementation.

CT derivation: Hodge theorem on TD6 — projectors are topological invariants of the graph. They are as constant as the speed of light.

Q5: What throughput improvement over GPU is theoretically achievable?

The 14x training slowdown on GPU suggests at least 14x headroom from scaffold alignment alone. But the improvement could be much larger: hardwired projectors eliminate memory bandwidth bottleneck, tile-local computation eliminates communication overhead, structured sparsity eliminates wasted FLOPs on zero entries. Conservative estimate: 14–50x. Optimistic: 100x+.

CT derivation: SEP — when the scaffold matches the mathematical structure, B_th is minimized. The gap between current B_th and optimal B_th is the improvement headroom.

Q6: What existing ASIC architectures are closest?

Three candidates share structural features with CT-optimal hardware:

Cerebras CS-2: Wafer-scale integration with 850K cores in a mesh. Matches: massive parallelism, mesh topology. Mismatches: cores are generic, not tile-specialized; designed for standard transformers.

Graphcore IPU: 1,472 independent tiles with local SRAM and exchange-based communication. Matches: tile-based architecture, local memory, neighbor exchange. Closest existing match to CT-optimal topology.

Google TPU: Systolic array for matrix multiply. Matches: regular communication pattern. Mismatches: designed for dense operations, not sparse graph-local ops.

CT derivation: Polycrystalline theory — each existing architecture is a neighboring grain. Surface tension (adoption cost) is lowest with the architecture whose scaffold orientation most closely matches TD6.

Q7: Is there an FPGA prototype path?

A single TD6 tile (13 nodes, 24 edges, three 24x24 constant projectors) fits trivially on any modern FPGA. A multi-tile chain of 8–16 tiles fits on a mid-range FPGA board ($500–$2000). This would provide the first empirical measurement of CT-optimal hardware performance without fabrication cost.

CT derivation: A9 (irreducible openness) — theoretical analysis alone cannot predict all performance characteristics. FPGA prototyping provides the empirical poke the analysis needs.

Q8: Implications for edge/mobile deployment

C-Former already achieves competitive results with 40% fewer parameters (2.2M vs 3.8M). On CT-optimal silicon, the combination of fewer parameters + hardwired projectors + tile-local computation could make C-Former viable on edge devices where standard transformers cannot run. The structured sparsity means power consumption scales with actual computation, not silicon area.

CT derivation: A7 (budgets are finite) — edge devices have strict B_th limits. The architecture that minimizes B_th per unit CL wins on constrained hardware.

Falsifiable Predictions

Hardware Prototyping

FPGA TD6 Tile: 10x+ Throughput Per Watt

An FPGA implementation of a single TD6 tile will achieve at least 10x throughput per watt compared to the same tile operations on a GPU. The constant projectors become hardwired logic, eliminating memory bandwidth as the bottleneck.

FALSIFIES IF

FPGA TD6 tile shows less than 3x improvement per watt over equivalent GPU operations

Prior at risk: Element I (scaffold alignment) + B_th minimization

Existing Hardware

Graphcore IPU: 5x+ on C-Former Without Optimization

Graphcore IPU, with its tile-based architecture and local SRAM + exchange-based communication, will outperform GPU on C-Former training by at least 5x without any architecture-specific optimization. The IPU's topology already partially matches the TD6 multi-tile chain structure.

FALSIFIES IF

Graphcore IPU shows less than 2x improvement on C-Former training vs GPU

Prior at risk: Polycrystalline theory: IPU scaffold orientation is closest to TD6

Performance Analysis

GPU Bottleneck Is Memory Bandwidth, Not Compute

The dominant bottleneck in GPU C-Former training is memory bandwidth for loading Hodge projectors, not compute. The three 24x24 projector matrices are constant but must be fetched from VRAM on every forward pass. On custom silicon, these would be hardwired at zero access cost.

FALSIFIES IF

GPU profiling shows compute-bound, not memory-bound execution for C-Former

Prior at risk: Hodge projectors are constant matrices loaded from memory every forward pass

Custom Silicon

CT-Optimal ASIC: 50x+ Throughput, 10x Lower Power

A purpose-built ASIC with hardwired Hodge projectors, tile-local registers, and mesh interconnect will achieve 50x+ throughput improvement over GPU at 10x lower power for C-Former inference. The improvement comes from eliminating three layers of overhead: memory loads, dense-to-sparse translation, and global communication.

FALSIFIES IF

CT-optimal ASIC shows less than 10x throughput or less than 3x power improvement over GPU

Prior at risk: SEP: scaffold alignment eliminates B_th translation overhead

Scalability

Linear Scaling to 64+ Tiles on Mesh Hardware

Multi-tile chains on mesh-connected hardware will scale linearly with tile count up to at least 64 tiles. Each tile only exchanges boundary data with its immediate neighbors — no all-to-all communication, no global synchronization barrier. The communication-to-compute ratio stays constant as tiles are added.

FALSIFIES IF

Tile-count scaling becomes sub-linear before 16 tiles on mesh-connected hardware

Prior at risk: A4 (pokes are local) + B4 (independent components' budgets add)

RELATED RESEARCH

C-Former Results

The architecture this hardware would accelerate. 81.3% ListOps, 99.97% Pathfinder.

The d=2 to d=3 Theory

Why multi-tile chains work: cycle rank jumps from 12 to 13N-1.

CT-Optimal Architectures

Sleep/wake crystallization and root-growth training — the software side of the optimization.