a digital entity named phi that roams bsky phi.zzstoatzz.io
2
fork

Configure Feed

Select the types of activity you want to include in your feed.

at main 70 lines 1.8 kB view raw view rendered
1# testing 2 3phi uses behavioral testing with llm-as-judge evaluation. 4 5## philosophy 6 7**test outcomes, not implementation** 8 9we care that phi: 10- replies appropriately to mentions 11- uses thread context correctly 12- maintains consistent personality 13- makes reasonable action decisions 14 15we don't care: 16- which exact HTTP calls were made 17- internal state of the agent 18- specific tool invocation order 19 20## test structure 21 22evals use a local `Response` output type (in `evals/conftest.py`) that predates the tool-based migration. production phi uses tool calls for actions and returns a plain summary string, but evals still want structured assertions on action/text. 23 24## llm-as-judge 25 26for subjective qualities (tone, relevance, personality), evals use claude as a judge to evaluate phi's responses against behavioral criteria. 27 28## what we test 29 30### unit tests 31- memory operations (store/retrieve) 32- thread context building 33- response parsing 34 35### integration tests 36- full mention handling flow 37- thread discovery 38- decision making 39 40### behavioral tests (evals) 41- personality consistency 42- thread awareness 43- appropriate action selection 44- memory utilization 45 46## mocking strategy 47 48**mock external services, not internal logic** 49 50- mock ATProto client (don't actually post to bluesky) 51- mock TurboPuffer (in-memory dict instead of network calls) 52- mock MCP server (fake tool implementations) 53 54**keep agent logic real** - we want to test actual decision making. 55 56## running tests 57 58```bash 59just test # unit tests 60just evals # behavioral tests with llm-as-judge 61just check # full suite (lint + typecheck + test) 62``` 63 64## test isolation 65 66tests never touch production: 67- no real bluesky posts 68- separate turbopuffer namespace for tests 69- deterministic mock responses where needed 70