Agent evals without framework lock-in
Write flows. Reuse prefixes. Spend fewer tokens.
Butterflow turns human-readable Python specs into deterministic checks for nested agent systems. It normalizes traces across frameworks, then schedules similar flows together so shared prompt prefixes stay cacheable.
butterflow ingest examples/
butterflow plan examples/ --show-cache-clusters
butterflow run examples/ --subset happy
Flow concepts
A flow is the behavioral unit.
Each flow names a user path, declares the input that triggers it, and states deterministic expectations against normalized events. The same spec can run through a framework adapter, a JSONL trace, or the built-in Python harness.
Intent
Human-readable purpose for reports, docs, and request fingerprints.
Input
The user message or test stimulus that starts the agent run.
Expectations
Assertions over selected agents, tool calls, state, artifacts, and final responses.
Prefix concepts
The cacheable part is the stable prefix.
Butterflow fingerprints every flow and groups flows whose request-shaped text starts the same way. Shared intent and repeated expectation structure become the prefix; changing inputs and case-specific assertions are variables. The planner runs cluster members consecutively so provider prompt caches can reuse the shared prefix.
cluster-4
shared_prefix_tokens: 46
flows:
token plan: refund lookup
token plan: issue refund
token plan: partial credit
cache_variables:
input, expectations
Docs