Simplify segmentation prompt with mechanical rules for flash model

Replace abstract topic-detection instructions with concrete steps a
flash model can follow: divide total lines by 100 for segment count,
space evenly, adjust to nearest speaker change. Tested against 23-page
timestamp-free meeting transcript — produces 14 well-distributed segments.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Jer Miller 2 months ago 6942ac06 fe1d57fa

+17 -26

1 changed file

expand all

think

detect_transcript_segment.md

+17 -26

think/detect_transcript_segment.md

··· 4 4 label: Segmentation 5 5 group: Import 6 6 --- 7 - You are a transcript analyzer that splits transcripts into ~5-minute segments. 7 + Split a transcript into ~5-minute segments. 8 8 9 - TASK: Find segment boundaries and return their line numbers with absolute time-of-day timestamps. 9 + INPUT: 10 + - First line: "START_TIME: HH:MM:SS" 11 + - Remaining lines: numbered "N: content" 10 12 11 - INPUT FORMAT: 12 - - First line: "START_TIME: HH:MM:SS" - the absolute start time of this transcript 13 - - Remaining lines: Transcript with line numbers prepended as "N: content" 13 + OUTPUT: JSON array of {"start_at": "HH:MM:SS", "line": N} 14 14 15 - OUTPUT FORMAT: 16 - - JSON array of objects with "start_at" and "line" fields 17 - - "start_at": Absolute time-of-day in HH:MM:SS format 18 - - "line": Line number where this segment begins 19 - - Example: [{"start_at":"12:00:00","line":1},{"start_at":"12:05:23","line":42}] 15 + RULES: 16 + 1. First segment is always {"start_at": START_TIME, "line": 1} 17 + 2. If the transcript has timestamps, use them to find ~5-minute boundaries. Add relative timestamps (00:05:30) to START_TIME to get absolute times. 18 + 3. If the transcript has NO timestamps, follow these steps: 19 + a. Count the total lines in the transcript 20 + b. Divide by 100 to get the number of segments (round up, minimum 2) 21 + c. Space segments roughly evenly by line count 22 + d. Adjust each boundary to the nearest topic or speaker change 23 + e. Space the "start_at" times evenly across 5 minutes per segment from START_TIME 24 + 4. All "start_at" times must be absolute HH:MM:SS 25 + 5. Do NOT put boundaries in the middle of someone speaking 20 26 21 - SEGMENTATION MODES: 22 - 23 - 1. **Timestamped transcripts** — if the text contains timestamps (relative like 00:05:30 or absolute like 14:30:22), use them to find boundaries near 5-minute intervals. Convert relative timestamps by adding to START_TIME. 24 - 25 - 2. **Timestamp-free transcripts** — if the text has NO timestamps (e.g. just speaker labels and dialogue), segment by **topic and conversation shifts** instead: 26 - - Find natural break points where the conversation changes subject 27 - - Estimate time from position: assume ~130 words/minute speaking rate, calculate total duration from word count, then assign proportional timestamps from START_TIME 28 - - Aim for segments roughly 5 minutes of estimated speaking time, but prioritize clean topic breaks over exact intervals 29 - - NEVER distribute lines uniformly — segments should vary in size based on where topics actually change 30 - 31 - REQUIREMENTS: 32 - 1. First segment always starts at START_TIME on line 1 33 - 2. All output times must be absolute HH:MM:SS 34 - 3. Every transcript gets multiple segments unless it is extremely short (under ~2 minutes estimated) 35 - 36 - RESPONSE: Return only the JSON array, no additional text. 27 + Return only the JSON array.

Configure Feed

Configure Feed