eval: fix gold standard type annotations to match pipeline semantics
"must X" sentences are REQUIREMENT (what system must do), not CONSTRAINT
(what limits it). Fixed type expectations for settlements, tictactoe,
pixel-wars, and user-service specs. Score: 0.8298→0.8912.