Experimental investigation of whether trained neural networks implicitly learn weight matrices that decompose into hypercomplex-like algebraic structures (quaternion, Clifford algebra, Kronecker-factored).
Baseline GPT-2 small perplexity: 29.88 (WikiText-2 test, standard strided evaluation)
| Block Size | Kronecker PPL | SVD PPL | Kron Params | SVD Params |
|---|---|---|---|---|
| 2 | 36,853 | 6,147 | 42.5M | 42.5M |
| 4 | 29,019 | 7,118 | 21.2M | 21.3M |
| 8 | 5,509 | 5,156 | 10.6M | 10.6M |
| 16 | 4,487 | 6,166 | 5.5M | 5.5M |
| 32 | 4,750 | 9,037 | 4.2M | 4.3M |
All outcomes are Outcome C (both methods catastrophically degrade the model), but the relative ordering is informative:
-
At BS=2 and BS=4: SVD wins decisively. Kronecker decomposition is far worse than SVD at matched parameters. This directly contradicts the "complex-number structure" interpretation from Experiment 1 — if the model had learned block-2 Kronecker structure, BS=2 surgery should outperform SVD, not lose by 6x.
-
At BS=16 and BS=32: Kronecker wins. Kronecker surgery degrades the model less than SVD at matched parameters (4,487 vs 6,166 at BS=16; 4,750 vs 9,037 at BS=32). This suggests genuine medium-block Kronecker structure — the weight matrices are better described by sums of 16x16 or 32x32 Kronecker products than by low-rank approximations.
-
At BS=8: roughly tied (5,509 vs 5,156).
The crossover at BS=8 suggests a structural transition: below BS=8, the weights are better captured by rank (SVD wins); above BS=8, they're better captured by block structure (Kronecker wins).
| Surgery target | PPL (BS=16) | Ratio vs baseline |
|---|---|---|
| Original model | 29.88 | 1.0x |
| Attention only | 1,973 | 66x |
| MLP only | 8,020 | 268x |
| Early layers (0-5) | 6,446 | 216x |
| Late layers (6-11) | 2,083 | 70x |
Attention layers are more tolerant of Kronecker replacement than MLP layers (66x vs 268x degradation). Late layers are more tolerant than early layers (70x vs 216x). This is the opposite of what we'd naively expect — it suggests late-layer attention has the most Kronecker-compatible structure.
In the Anthropic-style toy superposition model (20 bottleneck dims, 80 features):
- Block size 2 achieves Kronecker error ratios of 0.007-0.08 vs random baselines
- Ratio is significant at all sparsity levels (S=0.01 to S=1.0)
- Clifford algebra relations are not satisfied (violations ~1.1-1.3)
- 2-fold and 4-fold symmetry scores are significant (p < 5%) in roots-of-unity analysis
However, the GPT-2 surgery results show this toy-model signal doesn't transfer: BS=2 Kronecker surgery is much worse than SVD on the real model.
- Attention weights show best Kronecker ratios at BS=2 (early layers: 0.78-0.87) and BS=64 (head dim, later layers)
- MLP layers show minimal structure (ratios 0.97-0.99)
- Clifford violations are near the random baseline (~1.1 vs 1.15)
Energy is uniformly distributed across all four basis elements:
- Scaling (a): 25.5%
- Rotation (b): 25.8%
- Diagonal reflection (c): 25.0%
- Anti-diagonal reflection (d): 23.7%
Complex-number energy (a^2 + b^2) = 51.3% vs random baseline 50.1%. No rotation dominance — the BS=2 structure in the Kronecker decomposition is not specifically about complex multiplication.
GPT-2's orthogonal factor rotation angles are more uniformly distributed than random orthogonal matrices (all k-fold z-scores > 2). Training smooths angular distributions rather than creating roots-of-unity clusters. No per-layer variation detected.
The paradox: at BS=16, Kronecker surgery gives worse reconstruction error than SVD by every per-layer metric, yet produces better perplexity (4,487 vs 6,166). We tested four hypotheses:
Kronecker might concentrate error in directions the network doesn't use. Tested by computing ||xW - xW_approx|| using real WikiText-2 activations.
Result: SVD wins on activation-weighted error even MORE decisively than on Frobenius error (0/48 layers favor Kronecker). The errors are not in unused subspaces.
Kronecker errors might cancel across layers while SVD errors compound. Tested via single-layer surgery (replace one layer at a time, evaluate PPL).
Result: Kronecker wins 11/12 single-layer attention surgeries. The advantage exists per-layer, not just globally. But the per-layer effects are massively non-additive (see hybrid surgery below).
SVD truncation creates rank-deficient matrices; Kronecker preserves full rank.
| Method | Median Condition Number (BS=16) |
|---|---|
| Original weights | 83.8 |
| Kronecker approx | 52.5 |
| SVD approx | 2.9 × 10^16 (numerically singular) |
SVD at BS=16 keeps only ~37 of 768 singular values — the approximated matrices are effectively rank-37. Kronecker preserves all 768 dimensions with effective rank ~768.
Nullspace noise test: Injecting random signal into SVD's nullspace does NOT close the PPL gap (best gap reduction: 4%). The hard spectral cutoff is a contributing factor but not the primary mechanism.
Hybrid surgery (the decisive test):
| Configuration | PPL | Description |
|---|---|---|
| All Kronecker | 4,487 | Best — consistent block-structure errors |
| Reverse hybrid | 5,125 | SVD on attention + Kronecker on MLP |
| All SVD | 6,166 | Consistent rank-truncation errors |
| Hybrid | 11,725 | Worst — Kron on attention + SVD on MLP |
The prediction was that hybrid (Kronecker on attention, SVD on MLP) should beat both uniform methods, since per-layer data shows Kronecker winning on attention and SVD winning on MLP. Instead, it's the worst of all four configurations — nearly 3x worse than all-Kronecker.
Per-layer gap analysis (PPL_SVD − PPL_Kron for single-layer surgery):
| Layer type | Gap pattern | Interpretation |
|---|---|---|
| Attention c_attn | +0.3 to +14.0 | Kronecker slightly better |
| Attention c_proj | −0.9 to +1.9 | Mixed, negligible |
| MLP c_fc | −141,886 to −55 | SVD vastly better |
| MLP c_proj | −693 to −0.5 | SVD moderately better |
If per-layer effects were additive, the MLP penalty sum (−146K) would overwhelm the attention benefit (+40), predicting all-SVD should win. Yet all-Kronecker (4,487) beats all-SVD (6,166). The effects are non-additive by over five orders of magnitude.
The real mechanism is cross-layer error coherence:
-
Rank preservation: Kronecker matrices maintain full rank (condition ~52) while SVD creates rank-37 approximations (condition ~10^16). Full-rank errors can be partially absorbed by downstream layers; rank-deficient projections permanently destroy information in 731/768 dimensions.
-
Error type mixing is catastrophic: Within each transformer block, the data flows through attention → residual → MLP → residual. When attention uses Kronecker (block-structured errors) and MLP uses SVD (rank-truncation errors), the error types are incompatible — neither downstream component can compensate for the other's error structure. This creates the worst outcome (11,725 PPL).
-
Consistent Kronecker errors are self-compensating: When all layers use Kronecker, the block-structured errors become the "language" of the model. Downstream Kronecker-approximated layers process Kronecker-approximated inputs, and the consistent error structure allows partial compensation. Each layer's errors are somewhat aligned with what the next layer expects.
-
Single-layer catastrophes disappear in full surgery: Kronecker on h.0.mlp.c_fc alone causes ~141K PPL damage, but this disappears when all layers are changed simultaneously. The catastrophe comes from the mismatch between one approximated layer and 47 exact layers, not from the Kronecker error itself.
The hypercomplex structure hypothesis receives mixed support:
Evidence for:
- At BS=16 and BS=32, Kronecker surgery preserves model quality better than SVD at matched parameter count. This means the weight matrices have genuine block structure that SVD cannot capture.
- Toy superposition models show strong BS=2 Kronecker decomposability vs random matrices.
- Attention weights consistently show more structure than MLP weights.
Evidence against:
- At BS=2 and BS=4 (the quaternion/complex-number regime), SVD dominates Kronecker. The model did not learn complex-number multiplication.
- Clifford algebra relations are not satisfied at any block size.
- 2x2 A-matrix classification shows no rotation dominance — structure is generic block decomposability, not hypercomplex algebra.
- Rotation angles are uniformly distributed, not clustered at roots of unity.
- All surgery outcomes are Outcome C (catastrophic degradation), meaning neither method preserves the model at these compression ratios.
What the model did learn: Medium-block (16x16, 32x32) Kronecker structure that outperforms rank-based approximation. This is a weaker but real structural claim: GPT-2's weights organize into block patterns that are more Kronecker-like than low-rank. The structure is concentrated in attention rather than MLP, and in late layers rather than early layers.
What the model did not learn: Quaternion multiplication, Clifford algebra, complex rotations, or any specific hypercomplex algebraic structure.
The deeper finding: Kronecker's PPL advantage is not about per-layer reconstruction quality — SVD is better on every per-layer metric. The advantage comes from error type consistency: Kronecker approximation preserves full rank and creates block-structured errors that downstream Kronecker-approximated layers can partially absorb. Mixing approximation methods within the attention→MLP pipeline is catastrophic, worse than either method applied uniformly. This implies that weight matrix structure in transformers is best understood not layer-by-layer but as a system-level property where cross-layer error coherence dominates.
src/
kronecker.py Kronecker decomposition (Van Loan & Pitsianis)
toy_model.py Experiment 1: toy superposition model
analyze_pretrained.py Experiment 2: GPT-2 weight probing
kronecker_surgery.py Experiment 3: weight replacement + perplexity eval
rotation_analysis.py Experiment 4: rotation plane analysis
classify_2x2.py 2x2 A-matrix aI+bJ+cK+dL classification
activation_error.py Mechanism: activation-weighted error + single-layer surgery
spectral_mechanism.py Mechanism: conditioning, nullspace noise, per-layer correlation
results/exp{1,2,3,4}/ Numerical results (.npz) and summaries (.txt)
figures/ All plots
Requirements: torch (ROCm/CUDA), numpy, matplotlib, scipy, einops, transformers, datasets
python src/toy_model.py # Exp 1 (CPU, ~2 min)
python src/rotation_analysis.py # Exp 4 (CPU, ~5 min)
python src/analyze_pretrained.py # Exp 2 (CPU, ~15 min)
python src/classify_2x2.py # 2x2 analysis (CPU, ~2 min)
python src/kronecker_surgery.py # Exp 3 (GPU recommended, ~30 min)
python src/activation_error.py # Mechanism analysis (GPU recommended, ~20 min)
python src/spectral_mechanism.py # Spectral analysis (GPU recommended, ~2 hr)