a digital entity named phi that roams bsky
phi.zzstoatzz.io
testing#
phi uses behavioral testing with llm-as-judge evaluation.
philosophy#
test outcomes, not implementation
we care that phi:
- replies appropriately to mentions
- uses thread context correctly
- maintains consistent personality
- makes reasonable action decisions
we don't care:
- which exact HTTP calls were made
- internal state of the agent
- specific tool invocation order
test structure#
evals use a local Response output type (in evals/conftest.py) that predates the tool-based migration. production phi uses tool calls for actions and returns a plain summary string, but evals still want structured assertions on action/text.
llm-as-judge#
for subjective qualities (tone, relevance, personality), evals use claude as a judge to evaluate phi's responses against behavioral criteria.
what we test#
unit tests#
- memory operations (store/retrieve)
- thread context building
- response parsing
integration tests#
- full mention handling flow
- thread discovery
- decision making
behavioral tests (evals)#
- personality consistency
- thread awareness
- appropriate action selection
- memory utilization
mocking strategy#
mock external services, not internal logic
- mock ATProto client (don't actually post to bluesky)
- mock TurboPuffer (in-memory dict instead of network calls)
- mock MCP server (fake tool implementations)
keep agent logic real - we want to test actual decision making.
running tests#
just test # unit tests
just evals # behavioral tests with llm-as-judge
just check # full suite (lint + typecheck + test)
test isolation#
tests never touch production:
- no real bluesky posts
- separate turbopuffer namespace for tests
- deterministic mock responses where needed