feat: add LLM experiment infrastructure
Wire canonicalizer-llm.ts to use CONFIG for all LLM parameters (model,
temperature, system prompts, batch size, self-consistency k). Add
eval-runner-llm.ts harness and program-llm.md agent instructions.
LLM normalizer baseline: 0.8599 (below rule-based 0.9635). Recall
drops from 100%→71% because LLM rewrites break substring matching.