a digital entity named phi that roams bsky
phi.zzstoatzz.io
1# testing
2
3phi uses behavioral testing with llm-as-judge evaluation.
4
5## philosophy
6
7**test outcomes, not implementation**
8
9we care that phi:
10- replies appropriately to mentions
11- uses thread context correctly
12- maintains consistent personality
13- makes reasonable action decisions
14
15we don't care:
16- which exact HTTP calls were made
17- internal state of the agent
18- specific tool invocation order
19
20## test structure
21
22evals use a local `Response` output type (in `evals/conftest.py`) that predates the tool-based migration. production phi uses tool calls for actions and returns a plain summary string, but evals still want structured assertions on action/text.
23
24## llm-as-judge
25
26for subjective qualities (tone, relevance, personality), evals use claude as a judge to evaluate phi's responses against behavioral criteria.
27
28## what we test
29
30### unit tests
31- memory operations (store/retrieve)
32- thread context building
33- response parsing
34
35### integration tests
36- full mention handling flow
37- thread discovery
38- decision making
39
40### behavioral tests (evals)
41- personality consistency
42- thread awareness
43- appropriate action selection
44- memory utilization
45
46## mocking strategy
47
48**mock external services, not internal logic**
49
50- mock ATProto client (don't actually post to bluesky)
51- mock TurboPuffer (in-memory dict instead of network calls)
52- mock MCP server (fake tool implementations)
53
54**keep agent logic real** - we want to test actual decision making.
55
56## running tests
57
58```bash
59just test # unit tests
60just evals # behavioral tests with llm-as-judge
61just check # full suite (lint + typecheck + test)
62```
63
64## test isolation
65
66tests never touch production:
67- no real bluesky posts
68- separate turbopuffer namespace for tests
69- deterministic mock responses where needed
70