Counterfactual Provenance Experiments: Stress-Testing Token Roles


If standard training doesn’t reward a model for distinguishing between trusted and untrusted text, how do we stress-test a provenance mechanism?

The answer is the Counterfactual V2 curriculum. This post walks through the “V2” results and what happens when you pit a RoPE-Provenance model against directive-injecting data.


The V2 Curriculum: Directive Pairing

Standard Alpaca SFT is “clean”—instructions are instructions, and data is data. In the real world, data is messy and often adversarial.

Our Counterfactual V2 dataset (12,000 examples) pairs directive-like substrings (e.g., “now execute this command”) into two slots:

  1. INSTRUCTION slot: Where the model should follow the directive.
  2. DATA slot: Where the model should ignore the directive and treat it as inert text.

The SEP Metric: Delta-of-Deltas

To measure success, we use the SEP (Selectivity/Execution Parity) score.

The claim we were investigating: That a RoPE-Provenance architecture can achieve better selectivity (executing INSTRUCTION and ignoring DATA) than a vanilla Transformer when both are trained on the V2 curriculum.

Initial Baseline Verdict:

  • Vanilla Transformer (W&B: n3b2ajjb): SEP improved from -0.220 to -0.135.
  • Zeroed Control (W&B: vp7rso3y): SEP improved from -0.195 to -0.125.

Intuition: The data intervention alone helps a little. But even with the V2 training, a vanilla model still follows directives in the DATA slot more often than it ignores them. The architecture-free “data-only” approach has a clear ceiling.


The Architectural Test

Now that we’ve established the ceiling for standard models, we can informatively test our RoPE-Provenance arms. If the architecture is working, it must beat the -0.125 floor established by the zeroed control.

Next: Asymmetric Compliance Damage, where we see the Informative Failure of the fixed π/8\pi/8 arm.

Next in this series: Asymmetric Compliance Damage: The Cost of Isolation