feat: add LLM reclassifier mode, run 4 LLM experiments
Added reclassifier mode: keeps rule-based statements, uses LLM only
for type classification. Low-confidence-only variant targets uncertain
nodes. Best LLM score: 0.9220 (reclassifier, low-conf only).
Key finding: LLM type accuracy (74%) is lower than rule-based (89%)
because gold standards are calibrated to rule-based behavior. The LLM
has a different but defensible view of REQUIREMENT vs CONSTRAINT.