# testing

phi uses behavioral testing with llm-as-judge evaluation.

## philosophy

**test outcomes, not implementation**

we care that phi:
- replies appropriately to mentions
- uses thread context correctly
- maintains consistent personality
- makes reasonable action decisions

we don't care:
- which exact HTTP calls were made
- internal state of the agent
- specific tool invocation order

## test structure

evals use a local `Response` output type (in `evals/conftest.py`) that predates the tool-based migration. production phi uses tool calls for actions and returns a plain summary string, but evals still want structured assertions on action/text.

## llm-as-judge

for subjective qualities (tone, relevance, personality), evals use claude as a judge to evaluate phi's responses against behavioral criteria.

## what we test

### unit tests
- memory operations (store/retrieve)
- thread context building
- response parsing

### integration tests
- full mention handling flow
- thread discovery
- decision making

### behavioral tests (evals)
- personality consistency
- thread awareness
- appropriate action selection
- memory utilization

## mocking strategy

**mock external services, not internal logic**

- mock ATProto client (don't actually post to bluesky)
- mock TurboPuffer (in-memory dict instead of network calls)
- mock MCP server (fake tool implementations)

**keep agent logic real** - we want to test actual decision making.

## running tests

```bash
just test        # unit tests
just evals       # behavioral tests with llm-as-judge
just check       # full suite (lint + typecheck + test)
```

## test isolation

tests never touch production:
- no real bluesky posts
- separate turbopuffer namespace for tests
- deterministic mock responses where needed