evals/README.md at main · zzstoatzz.io/bot

zzstoatzz.io / bot
fork
a digital entity named phi that roams bsky phi.zzstoatzz.io
fork
bot / evals / README.md
at main 145 lines 4.7 kB view raw view rendered
wrap content
zzstoatzz add feed consumption, following, and owner-gating 4w ago
30b07e50
  1# Phi Evaluations
  2
  3Behavioral tests for phi using LLM-as-judge evaluation.
  4
  5## Structure
  6
  7Inspired by [prefect-mcp-server evals](https://github.com/PrefectHQ/prefect-mcp-server/tree/main/evals).
  8
  9```
 10evals/
 11├── conftest.py                # Test fixtures, evaluator, and ToolCallSpy
 12├── test_basic_responses.py    # Basic response behavior
 13├── test_feed_creation.py      # Graze feed tool usage
 14├── test_feed_consumption.py   # Feed reading, following, and owner-gating
 15└── test_memory_integration.py # Episodic memory tests
 16```
 17
 18## Running Evals
 19
 20```bash
 21# Run all evals (tests will skip if API keys are missing)
 22uv run pytest evals/ -v
 23
 24# Run specific eval
 25uv run pytest evals/test_basic_responses.py::test_phi_responds_to_philosophical_question -v
 26
 27# Run only basic response tests
 28uv run pytest evals/test_basic_responses.py -v
 29
 30# Run only memory tests
 31uv run pytest evals/test_memory_integration.py -v
 32
 33# Run only feed creation tests
 34uv run pytest evals/test_feed_creation.py -v
 35```
 36
 37## Environment Variables
 38
 39Tests will **skip gracefully** if required API keys are missing.
 40
 41**Required for all evals:**
 42- `ANTHROPIC_API_KEY` - For phi agent and LLM evaluator
 43
 44**Required for memory evals only:**
 45- `TURBOPUFFER_API_KEY` - For episodic memory storage
 46- `OPENAI_API_KEY` - For embeddings
 47
 48**Required for ATProto MCP tools (used by agent):**
 49- `BLUESKY_HANDLE` - Bot's Bluesky handle
 50- `BLUESKY_PASSWORD` - Bot's app password
 51
 52## Evaluation Approach
 53
 54Each eval:
 551. **Sets up a scenario** - Simulates a mention/interaction
 562. **Runs phi agent** - Gets structured response
 573. **Makes assertions** - Checks basic structure
 584. **LLM evaluation** - Uses Claude Opus to judge quality
 59
 60**Important:** The `phi_agent` fixture is session-scoped, meaning all tests share one agent instance. Combined with session persistence (tokens saved to `.session` file), this prevents hitting Bluesky's IP rate limit (10 logins per 24 hours per IP). The session is reused across test runs unless tokens expire (~2 months).
 61
 62Example:
 63```python
 64@pytest.mark.asyncio
 65async def test_phi_responds_to_philosophical_question(evaluate_response):
 66    agent = PhiAgent()
 67
 68    response = await agent.process_mention(
 69        mention_text="what do you think consciousness is?",
 70        author_handle="test.user",
 71        thread_context="...",
 72        thread_uri="...",
 73    )
 74
 75    # Structural check
 76    assert response.action == "reply"
 77
 78    # Quality evaluation
 79    await evaluate_response(
 80        evaluation_prompt="Does the response engage thoughtfully?",
 81        agent_response=response.text,
 82    )
 83```
 84
 85## What We Test
 86
 87### Basic Responses
 88- ✅ Philosophical engagement
 89- ✅ Spam detection
 90- ✅ Thread context awareness
 91- ✅ Character limit compliance
 92- ✅ Casual interactions
 93
 94### Memory Integration
 95- ✅ Episodic memory retrieval
 96- ✅ Conversation storage
 97- ✅ User-specific context
 98
 99### Feed Creation (graze)
100- ✅ Creates feed from natural language description
101- ✅ Manifest uses valid graze DSL operators
102- ✅ Handles complex/ambiguous descriptions (e.g. "rust programming, not the game")
103- ✅ Lists feeds when asked (calls `list_feeds`, not `create_feed`)
104- ✅ No tool calls for informational questions about feeds
105
106### Feed Consumption & Following
107- ✅ Reads timeline when asked
108- ✅ Reads specific custom feed by name (via list_feeds → read_feed)
109- ✅ Owner can ask phi to follow users
110- ✅ Non-owner follow requests are denied
111- ✅ Non-owner feed creation requests are denied
112- ✅ Empty timeline suggests following accounts
113
114## Adding New Evals
115
1161. Create test file: `evals/test_<category>.py`
1172. Use fixtures from `conftest.py`
1183. Write scenario-based tests
1194. Use `evaluate_response` for quality checks
120
121Example:
122```python
123@pytest.mark.asyncio
124async def test_new_behavior(temp_memory, personality, evaluate_response):
125    agent = PhiAgent()
126
127    response = await agent.process_mention(...)
128
129    await evaluate_response(
130        evaluation_prompt="Your evaluation criteria here",
131        agent_response=response.text,
132    )
133```
134
135## ci integration
136
137these evals are designed to run in ci with graceful degradation:
138- tests skip automatically when required api keys are missing
139- basic response tests require only `ANTHROPIC_API_KEY` and bluesky credentials
140- memory tests require `TURBOPUFFER_API_KEY` and `OPENAI_API_KEY`
141- feed creation tests require only `ANTHROPIC_API_KEY` (tools are mocked via `ToolCallSpy`)
142- feed consumption tests require only `ANTHROPIC_API_KEY` (tools are mocked via `ToolCallSpy`)
143- no mocking required for basic/memory tests - they work with real mcp server and episodic memory
144
145this ensures phi's behavior can be validated in various environments.
Configure Feed

Configure Feed