The β-lift and FFN Transfer: MoE Compression Part E


The question we started with was a simple one: if you center your MoE expert weights and apply a learned rotation, does the compression error converge to a universal floor? Or is every expert a unique geometric snowflake that requires its own calibration?

The “β-lift” is the answer we found. This post walks through the Part E results—focused on FFN weight transfer in Gemma-4-26B—and the “structure bonus” that formalizes why this works.


The β-lift: Universal Convergence

In our earlier work on the MoEGauge result, we identified that attention activations cluster around expert routing patterns. The β-lift takes this to the parameters.

The central finding is that the learned rotation parameter β\beta (specifically βcen_learned\beta_{cen\_learned}) converges to a tight range of 0.92–0.97 across all experts, provided you apply DC-mean centering first.

Intuition: By centering the experts and rotating them into a shared alignment, the model’s internal diversity is preserved while the numerical representation becomes more compact. The fact that this range is universal across the 26B model family suggests we’ve found a geometric constant of the MoE architecture, not just a training artifact.

Part E: FFN Weight Transfer

Part E of the lean-mining project tests the engineering limit of this universality. We wanted to see if we could transfer weight structures across experts by exploiting this shared β\beta manifold.

flowchart LR
    A["Dense Weights"] --> B["DC-Mean Centering"]
    B --> C["Learned Rotation R"]
    C --> D["β-lifted Parameters"]
    D --> E["Quantization"]
    E --> F["Structure Bonus Recovery"]

Key results from the Gemma-4-26B pilot:

  • The Structure Bonus: We’ve proven that transferring weights along the learned rotation RR yields a lower error than any naive SVD-based transfer. This “bonus” is the geometric payoff of respecting the model’s internal symmetries.
  • Activation-Fit Recovery: For the L00 layer, we used dense activation-fit artifacts to “heal” quantization noise. This isn’t a stochastic fix; it’s guided by the ActivationFitBound theorem.
  • Frob-Matched Retraining: Our v3 retrain for the gate_up weights showed improvement in 128 out of 128 experts.

The Falsification Gate

We also ran a negative test (2026-04-21) to see if this β\beta convergence was merely a function of low-dimensional feature ceilings. The narrow-d ceiling was falsified. Even in high-dimensional spaces, the β-lift holds. This suggests the symmetry is deeper than simple sparsity; it’s a property of the attention intertwiner algebra itself.


Why we prove this in Lean 4

This isn’t just about getting a better W&B curve. We are building toward a Verified Neural Compilation pipeline. Every identity used in the β-lift—from the local Hessian curvature to the activation error bounds—is being formalized in Lean 4.

If we’re going to claim that a 26B model can be compressed with zero functional loss, we shouldn’t be “hoping” the math is right. We should be checking it.

Next: Formalizing the Softmax and Hessian, where we look at the curvature identities that make the β-lift stable.

Next in this series: Formalizing the Softmax and Hessian: Lean 4 Soundness Proofs