a digital entity named phi that roams bsky
phi.zzstoatzz.io
1# Phi Evaluations
2
3Behavioral tests for phi using LLM-as-judge evaluation.
4
5## Structure
6
7Inspired by [prefect-mcp-server evals](https://github.com/PrefectHQ/prefect-mcp-server/tree/main/evals).
8
9```
10evals/
11├── conftest.py # Test fixtures, evaluator, and ToolCallSpy
12├── test_basic_responses.py # Basic response behavior
13├── test_feed_creation.py # Graze feed tool usage
14├── test_feed_consumption.py # Feed reading, following, and owner-gating
15└── test_memory_integration.py # Episodic memory tests
16```
17
18## Running Evals
19
20```bash
21# Run all evals (tests will skip if API keys are missing)
22uv run pytest evals/ -v
23
24# Run specific eval
25uv run pytest evals/test_basic_responses.py::test_phi_responds_to_philosophical_question -v
26
27# Run only basic response tests
28uv run pytest evals/test_basic_responses.py -v
29
30# Run only memory tests
31uv run pytest evals/test_memory_integration.py -v
32
33# Run only feed creation tests
34uv run pytest evals/test_feed_creation.py -v
35```
36
37## Environment Variables
38
39Tests will **skip gracefully** if required API keys are missing.
40
41**Required for all evals:**
42- `ANTHROPIC_API_KEY` - For phi agent and LLM evaluator
43
44**Required for memory evals only:**
45- `TURBOPUFFER_API_KEY` - For episodic memory storage
46- `OPENAI_API_KEY` - For embeddings
47
48**Required for ATProto MCP tools (used by agent):**
49- `BLUESKY_HANDLE` - Bot's Bluesky handle
50- `BLUESKY_PASSWORD` - Bot's app password
51
52## Evaluation Approach
53
54Each eval:
551. **Sets up a scenario** - Simulates a mention/interaction
562. **Runs phi agent** - Gets structured response
573. **Makes assertions** - Checks basic structure
584. **LLM evaluation** - Uses Claude Opus to judge quality
59
60**Important:** The `phi_agent` fixture is session-scoped, meaning all tests share one agent instance. Combined with session persistence (tokens saved to `.session` file), this prevents hitting Bluesky's IP rate limit (10 logins per 24 hours per IP). The session is reused across test runs unless tokens expire (~2 months).
61
62Example:
63```python
64@pytest.mark.asyncio
65async def test_phi_responds_to_philosophical_question(evaluate_response):
66 agent = PhiAgent()
67
68 response = await agent.process_mention(
69 mention_text="what do you think consciousness is?",
70 author_handle="test.user",
71 thread_context="...",
72 thread_uri="...",
73 )
74
75 # Structural check
76 assert response.action == "reply"
77
78 # Quality evaluation
79 await evaluate_response(
80 evaluation_prompt="Does the response engage thoughtfully?",
81 agent_response=response.text,
82 )
83```
84
85## What We Test
86
87### Basic Responses
88- ✅ Philosophical engagement
89- ✅ Spam detection
90- ✅ Thread context awareness
91- ✅ Character limit compliance
92- ✅ Casual interactions
93
94### Memory Integration
95- ✅ Episodic memory retrieval
96- ✅ Conversation storage
97- ✅ User-specific context
98
99### Feed Creation (graze)
100- ✅ Creates feed from natural language description
101- ✅ Manifest uses valid graze DSL operators
102- ✅ Handles complex/ambiguous descriptions (e.g. "rust programming, not the game")
103- ✅ Lists feeds when asked (calls `list_feeds`, not `create_feed`)
104- ✅ No tool calls for informational questions about feeds
105
106### Feed Consumption & Following
107- ✅ Reads timeline when asked
108- ✅ Reads specific custom feed by name (via list_feeds → read_feed)
109- ✅ Owner can ask phi to follow users
110- ✅ Non-owner follow requests are denied
111- ✅ Non-owner feed creation requests are denied
112- ✅ Empty timeline suggests following accounts
113
114## Adding New Evals
115
1161. Create test file: `evals/test_<category>.py`
1172. Use fixtures from `conftest.py`
1183. Write scenario-based tests
1194. Use `evaluate_response` for quality checks
120
121Example:
122```python
123@pytest.mark.asyncio
124async def test_new_behavior(temp_memory, personality, evaluate_response):
125 agent = PhiAgent()
126
127 response = await agent.process_mention(...)
128
129 await evaluate_response(
130 evaluation_prompt="Your evaluation criteria here",
131 agent_response=response.text,
132 )
133```
134
135## ci integration
136
137these evals are designed to run in ci with graceful degradation:
138- tests skip automatically when required api keys are missing
139- basic response tests require only `ANTHROPIC_API_KEY` and bluesky credentials
140- memory tests require `TURBOPUFFER_API_KEY` and `OPENAI_API_KEY`
141- feed creation tests require only `ANTHROPIC_API_KEY` (tools are mocked via `ToolCallSpy`)
142- feed consumption tests require only `ANTHROPIC_API_KEY` (tools are mocked via `ToolCallSpy`)
143- no mocking required for basic/memory tests - they work with real mcp server and episodic memory
144
145this ensures phi's behavior can be validated in various environments.