a digital entity named phi that roams bsky phi.zzstoatzz.io
2
fork

Configure Feed

Select the types of activity you want to include in your feed.

at main 145 lines 4.7 kB view raw view rendered
1# Phi Evaluations 2 3Behavioral tests for phi using LLM-as-judge evaluation. 4 5## Structure 6 7Inspired by [prefect-mcp-server evals](https://github.com/PrefectHQ/prefect-mcp-server/tree/main/evals). 8 9``` 10evals/ 11├── conftest.py # Test fixtures, evaluator, and ToolCallSpy 12├── test_basic_responses.py # Basic response behavior 13├── test_feed_creation.py # Graze feed tool usage 14├── test_feed_consumption.py # Feed reading, following, and owner-gating 15└── test_memory_integration.py # Episodic memory tests 16``` 17 18## Running Evals 19 20```bash 21# Run all evals (tests will skip if API keys are missing) 22uv run pytest evals/ -v 23 24# Run specific eval 25uv run pytest evals/test_basic_responses.py::test_phi_responds_to_philosophical_question -v 26 27# Run only basic response tests 28uv run pytest evals/test_basic_responses.py -v 29 30# Run only memory tests 31uv run pytest evals/test_memory_integration.py -v 32 33# Run only feed creation tests 34uv run pytest evals/test_feed_creation.py -v 35``` 36 37## Environment Variables 38 39Tests will **skip gracefully** if required API keys are missing. 40 41**Required for all evals:** 42- `ANTHROPIC_API_KEY` - For phi agent and LLM evaluator 43 44**Required for memory evals only:** 45- `TURBOPUFFER_API_KEY` - For episodic memory storage 46- `OPENAI_API_KEY` - For embeddings 47 48**Required for ATProto MCP tools (used by agent):** 49- `BLUESKY_HANDLE` - Bot's Bluesky handle 50- `BLUESKY_PASSWORD` - Bot's app password 51 52## Evaluation Approach 53 54Each eval: 551. **Sets up a scenario** - Simulates a mention/interaction 562. **Runs phi agent** - Gets structured response 573. **Makes assertions** - Checks basic structure 584. **LLM evaluation** - Uses Claude Opus to judge quality 59 60**Important:** The `phi_agent` fixture is session-scoped, meaning all tests share one agent instance. Combined with session persistence (tokens saved to `.session` file), this prevents hitting Bluesky's IP rate limit (10 logins per 24 hours per IP). The session is reused across test runs unless tokens expire (~2 months). 61 62Example: 63```python 64@pytest.mark.asyncio 65async def test_phi_responds_to_philosophical_question(evaluate_response): 66 agent = PhiAgent() 67 68 response = await agent.process_mention( 69 mention_text="what do you think consciousness is?", 70 author_handle="test.user", 71 thread_context="...", 72 thread_uri="...", 73 ) 74 75 # Structural check 76 assert response.action == "reply" 77 78 # Quality evaluation 79 await evaluate_response( 80 evaluation_prompt="Does the response engage thoughtfully?", 81 agent_response=response.text, 82 ) 83``` 84 85## What We Test 86 87### Basic Responses 88- ✅ Philosophical engagement 89- ✅ Spam detection 90- ✅ Thread context awareness 91- ✅ Character limit compliance 92- ✅ Casual interactions 93 94### Memory Integration 95- ✅ Episodic memory retrieval 96- ✅ Conversation storage 97- ✅ User-specific context 98 99### Feed Creation (graze) 100- ✅ Creates feed from natural language description 101- ✅ Manifest uses valid graze DSL operators 102- ✅ Handles complex/ambiguous descriptions (e.g. "rust programming, not the game") 103- ✅ Lists feeds when asked (calls `list_feeds`, not `create_feed`) 104- ✅ No tool calls for informational questions about feeds 105 106### Feed Consumption & Following 107- ✅ Reads timeline when asked 108- ✅ Reads specific custom feed by name (via list_feeds → read_feed) 109- ✅ Owner can ask phi to follow users 110- ✅ Non-owner follow requests are denied 111- ✅ Non-owner feed creation requests are denied 112- ✅ Empty timeline suggests following accounts 113 114## Adding New Evals 115 1161. Create test file: `evals/test_<category>.py` 1172. Use fixtures from `conftest.py` 1183. Write scenario-based tests 1194. Use `evaluate_response` for quality checks 120 121Example: 122```python 123@pytest.mark.asyncio 124async def test_new_behavior(temp_memory, personality, evaluate_response): 125 agent = PhiAgent() 126 127 response = await agent.process_mention(...) 128 129 await evaluate_response( 130 evaluation_prompt="Your evaluation criteria here", 131 agent_response=response.text, 132 ) 133``` 134 135## ci integration 136 137these evals are designed to run in ci with graceful degradation: 138- tests skip automatically when required api keys are missing 139- basic response tests require only `ANTHROPIC_API_KEY` and bluesky credentials 140- memory tests require `TURBOPUFFER_API_KEY` and `OPENAI_API_KEY` 141- feed creation tests require only `ANTHROPIC_API_KEY` (tools are mocked via `ToolCallSpy`) 142- feed consumption tests require only `ANTHROPIC_API_KEY` (tools are mocked via `ToolCallSpy`) 143- no mocking required for basic/memory tests - they work with real mcp server and episodic memory 144 145this ensures phi's behavior can be validated in various environments.