Classifying the RoPE Commutant: Symmetry in Verified Neural Compilation


The question we started with was a bit of an architectural dare: can you rotate the attention heads of a Llama-style model into a more compressible format without changing a single logit?

The answer depends on the RoPE commutant. This post walks through the algebraic characterization of which symmetries survive RoPE, and which ones are “caged” by the rotation schedule.


The 2D Commutant “Cage”

RoPE (Rotary Positional Embedding) works by rotating 2D pairs of the attention head’s embedding space. If you want to apply a shared orthogonal transformation RR to the model’s weights at zero inference cost, RR must commute with the RoPE operator.

In LeanMining/VerifiedNeuralCompilation/Symmetry.lean, we’ve recently landed a complete characterization for the 2D case.

flowchart TB
    subgraph Cage ["The RoPE Commutant Cage"]
        direction TB
        subgraph Row1 ["Frequency ω₁"]
            A1["[ aI + bJ ]"]
            A2["[ 0 ]"]
            A3["[ 0 ]"]
        end
        subgraph Row2 ["Frequency ω₂"]
            B1["[ 0 ]"]
            B2["[ cI + dJ ]"]
            B3["[ 0 ]"]
        end
        subgraph Row3 ["Frequency ω₃"]
            C1["[ 0 ]"]
            C2["[ 0 ]"]
            C3["[ eI + fJ ]"]
        end
    end

Theorem: commutesWithQuarterTurn_iff A matrix RR commutes with a 9090^\circ RoPE block iff it is of the form aI+bJaI + bJ (a complex scalar in the 2D plane). If RR is orthogonal, then a2+b2=1a^2 + b^2 = 1.

Intuition: The only orthogonal transformations that “survive” a single 2D RoPE block are rotations within that same 2D plane. Everything else is broken by the positional encoding.

Lifting to Llama-like Schedules

Things get more restrictive when you lift this to a full head-dimension with multiple frequencies (like Llama’s RoPE schedule).

We’ve proven two key transport lemmas:

  • intertwiner_zero_of_cos_ne: This is a Schur-style vanishing lemma. It proves that if two RoPE blocks have different rotation frequencies, they admit zero nonzero intertwiners.
  • The Result: Any shared symmetry RR across the whole head must be block-diagonal, with each block corresponding to a frequency class.

Verdict: You cannot rotate “across” frequencies. The RoPE schedule acts as a cage that forces your architectural symmetries into isolated 2x2 frequency-sectors.

Why this matters for “Zero-Cost” Compilation

This isn’t just an abstract algebra exercise. The theorem folded_qkvo_attention_block_invariant_of_commutes proves that if your rotation RR lives in this commutant cage, it can be folded into the weights at compile-time with zero FLOPs at inference.

By formalizing these intertwiner equations in Lean 4, we’ve moved beyond “hoping” that our quantization rotations are safe. We now have a machine-checked boundary: we know exactly where architectural symmetries end and where approximation—and its associated error—must begin.

Next: Multi-Head Foldability, where we lift these single-head proofs to the entire Transformer block.

Next in this series: Multi-Head Foldability: Scaling Verified Compilation