Butterflow: Pinning Agent Behavior with a Spec DSL
Agent evals that actually catch regressions: a Python flow/expect DSL for deterministic assertions, Arize Phoenix for fuzzy semantic evals, and cache-cluster grouping for token savings.
Production systems and LLM agent infrastructure: distributed pipelines (Dask, Celery), agent observability (Arize Phoenix), Python, Django, task queues, security gates, and Linux workflow. Experiment artifacts and engineering notes below.
CLI framework for declaratively defining agent flows, running evals, and caching tokens between test runs. Combines user-flow testing and token cost optimization into one tool — same execution trace, cache aggressively, measure quality simultaneously.
Custom typestubs that track payload annotations by source to gate unsafe ML deserialization (pickle, HDF5) at the type-checker level. Security research from huntr.com CVE work on serialization-route vulnerabilities in GenAI platforms. Lean 4 soundness proofs.
~300 LOC task queue on Django ORM — prototype to production. Covers memory leaks from fork, pessimistic locking for exactly-once delivery, and security hardening. No Redis, no Celery: just the database you already have.
Architectural approach for making source, operation, risk, and policy state explicit through typed side channels rather than text-only prompts. Improves model alignment without prompt bloat by encoding context in the type system.
Systems for high-performance ML research: HF-streaming for large artifacts and the dual-emit data-driven paper pattern.
Pickle is a CVE factory. falcon-secure uses Python type stubs and Lean 4 soundness proofs to gate unsafe deserialization at the type-checker level.
Building a ~300 LOC task queue on Django ORM from prototype to production — memory leaks, fork pitfalls, pessimistic locking, and security hardening.
Kitty terminal, kittens, shell integration, Starship prompt, and turning the terminal into a complete development environment.
Agent evals that actually catch regressions: a Python flow/expect DSL for deterministic assertions, Arize Phoenix for fuzzy semantic evals, and cache-cluster grouping for token savings.
Custom status lines for ambient awareness and git worktrees for parallel AI-assisted development sessions.
Diagnosing and fixing SSH latency over high-latency VPN links, with focus on the TCP-over-TCP problem and connection multiplexing.