Phi Evaluations#

Behavioral tests for phi using LLM-as-judge evaluation.

Structure#

evals/
├── conftest.py                # Test fixtures, evaluator, and ToolCallSpy
├── test_basic_responses.py    # Basic response behavior
├── test_feed_creation.py      # Graze feed tool usage
├── test_feed_consumption.py   # Feed reading, following, and owner-gating
└── test_memory_integration.py # Episodic memory tests

Running Evals#

# Run all evals (tests will skip if API keys are missing)
uv run pytest evals/ -v

# Run specific eval
uv run pytest evals/test_basic_responses.py::test_phi_responds_to_philosophical_question -v

# Run only basic response tests
uv run pytest evals/test_basic_responses.py -v

# Run only memory tests
uv run pytest evals/test_memory_integration.py -v

# Run only feed creation tests
uv run pytest evals/test_feed_creation.py -v

Environment Variables#

Tests will skip gracefully if required API keys are missing.

Required for all evals:

ANTHROPIC_API_KEY - For phi agent and LLM evaluator

Required for memory evals only:

TURBOPUFFER_API_KEY - For episodic memory storage
OPENAI_API_KEY - For embeddings

Required for ATProto MCP tools (used by agent):

BLUESKY_HANDLE - Bot's Bluesky handle
BLUESKY_PASSWORD - Bot's app password

Evaluation Approach#

Each eval:

Sets up a scenario - Simulates a mention/interaction
Runs phi agent - Gets structured response
Makes assertions - Checks basic structure
LLM evaluation - Uses Claude Opus to judge quality

Important: The phi_agent fixture is session-scoped, meaning all tests share one agent instance. Combined with session persistence (tokens saved to .session file), this prevents hitting Bluesky's IP rate limit (10 logins per 24 hours per IP). The session is reused across test runs unless tokens expire (~2 months).

Example:

@pytest.mark.asyncio
async def test_phi_responds_to_philosophical_question(evaluate_response):
    agent = PhiAgent()

    response = await agent.process_mention(
        mention_text="what do you think consciousness is?",
        author_handle="test.user",
        thread_context="...",
        thread_uri="...",
    )

    # Structural check
    assert response.action == "reply"

    # Quality evaluation
    await evaluate_response(
        evaluation_prompt="Does the response engage thoughtfully?",
        agent_response=response.text,
    )

What We Test#

Basic Responses#

✅ Philosophical engagement
✅ Spam detection
✅ Thread context awareness
✅ Character limit compliance
✅ Casual interactions

Memory Integration#

✅ Episodic memory retrieval
✅ Conversation storage
✅ User-specific context

Feed Creation (graze)#

✅ Creates feed from natural language description
✅ Manifest uses valid graze DSL operators
✅ Handles complex/ambiguous descriptions (e.g. "rust programming, not the game")
✅ Lists feeds when asked (calls list_feeds, not create_feed)
✅ No tool calls for informational questions about feeds

Feed Consumption & Following#

✅ Reads timeline when asked
✅ Reads specific custom feed by name (via list_feeds → read_feed)
✅ Owner can ask phi to follow users
✅ Non-owner follow requests are denied
✅ Non-owner feed creation requests are denied
✅ Empty timeline suggests following accounts

Adding New Evals#

Create test file: evals/test_<category>.py
Use fixtures from conftest.py
Write scenario-based tests
Use evaluate_response for quality checks