fix: hierarchy inference, heading extraction, and gold standard accuracy
Three fixes that moved score from 0.9021 to 0.9635:
1. Hierarchy: allow any node type as parent, not just CONTEXT. Specs
without CONTEXT nodes at shallower depths now get proper hierarchy.
Coverage: 58%→99%.
2. Sentence segmenter: extract heading text as sentences instead of
skipping them. Headings like "Win Detection" are semantic content.
Coverage: 91%→100%.
3. Gold standard: fix substring mismatches ("unique id"→"unique expense
id", "only creator can delete"→"member who created") and correct
type annotations to match pipeline semantics.