Agent evals without framework lock-in

Write flows. Reuse prefixes. Spend fewer tokens.

Butterflow turns human-readable Python specs into deterministic checks for nested agent systems. It normalizes traces across frameworks, then schedules similar flows together so shared prompt prefixes stay cacheable.

Start with a flow See prefix caching

cli

butterflow ingest examples/
butterflow plan examples/ --show-cache-clusters
butterflow run examples/ --subset happy

Flow concepts

A flow is the behavioral unit.

Each flow names a user path, declares the input that triggers it, and states deterministic expectations against normalized events. The same spec can run through a framework adapter, a JSONL trace, or the built-in Python harness.

Intent

Human-readable purpose for reports, docs, and request fingerprints.

Input

The user message or test stimulus that starts the agent run.

Expectations

Assertions over selected agents, tool calls, state, artifacts, and final responses.

Prefix concepts

The cacheable part is the stable prefix.

Butterflow fingerprints every flow and groups flows whose request-shaped text starts the same way. Shared intent and repeated expectation structure become the prefix; changing inputs and case-specific assertions are variables. The planner runs cluster members consecutively so provider prompt caches can reuse the shared prefix.

cache cluster

cluster-4
  shared_prefix_tokens: 46
  flows:
    token plan: refund lookup
    token plan: issue refund
    token plan: partial credit
  cache_variables:
    input, expectations

Docs

Build the suite from first flow to CI gate.

QuickstartInstall, write a first flow, dry-run, and plan. Spec AuthoringThe Python flow API and expectation model. Token SavingsPrefix detection, clusters, and scheduling. Event ModelThe normalized schema adapters emit. Adapter CompatibilityFramework coverage and capability levels. CI SetupDry-run gates and token-aware scheduling in GitHub Actions.