Posts tagged #evals
-
Asymmetric Compliance Damage: The Cost of Isolation
Measuring the damage to instruction following vs. data processing when RoPE subspaces are constrained.
-
Butterflow: Pinning Agent Behavior with a Spec DSL
Agent evals that actually catch regressions: a Python flow/expect DSL for deterministic assertions, Arize Phoenix for fuzzy semantic evals, and cache-cluster grouping for token savings.