feat: add architecture eval runner — end-to-end CRUD verification
Automated test harness that bootstraps the todo-app from scratch,
starts the server, runs 10 HTTP tests (create, list, get, update,
delete, validation, 404), and scores pass rate.
Score: 10/10 (100%) — reproducible across clean bootstraps.
This is the autoresearch eval function for architecture prompt tuning.