a digital entity named phi that roams bsky phi.zzstoatzz.io
2
fork

Configure Feed

Select the types of activity you want to include in your feed.

testing#

phi uses behavioral testing with llm-as-judge evaluation.

philosophy#

test outcomes, not implementation

we care that phi:

  • replies appropriately to mentions
  • uses thread context correctly
  • maintains consistent personality
  • makes reasonable action decisions

we don't care:

  • which exact HTTP calls were made
  • internal state of the agent
  • specific tool invocation order

test structure#

evals use a local Response output type (in evals/conftest.py) that predates the tool-based migration. production phi uses tool calls for actions and returns a plain summary string, but evals still want structured assertions on action/text.

llm-as-judge#

for subjective qualities (tone, relevance, personality), evals use claude as a judge to evaluate phi's responses against behavioral criteria.

what we test#

unit tests#

  • memory operations (store/retrieve)
  • thread context building
  • response parsing

integration tests#

  • full mention handling flow
  • thread discovery
  • decision making

behavioral tests (evals)#

  • personality consistency
  • thread awareness
  • appropriate action selection
  • memory utilization

mocking strategy#

mock external services, not internal logic

  • mock ATProto client (don't actually post to bluesky)
  • mock TurboPuffer (in-memory dict instead of network calls)
  • mock MCP server (fake tool implementations)

keep agent logic real - we want to test actual decision making.

running tests#

just test        # unit tests
just evals       # behavioral tests with llm-as-judge
just check       # full suite (lint + typecheck + test)

test isolation#

tests never touch production:

  • no real bluesky posts
  • separate turbopuffer namespace for tests
  • deterministic mock responses where needed