personal memory agent
0
fork

Configure Feed

Select the types of activity you want to include in your feed.

Add Segment Sense R&D experiment — agent instruction + A/B test harness

Experimental code for CPO-requested Segment Sense validation:
- muse/sense.md: unified segment understanding agent instruction
- scratch/sense_rd/harness.py: A/B test harness against field journal
- scratch/sense_rd/compare.py: output comparison utilities
- scratch/sense_rd/state_machine.py: Python activity state machine

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

+1357
+121
muse/sense.md
··· 1 + { 2 + "type": "generate", 3 + 4 + "title": "Segment Sense", 5 + "description": "Unified segment understanding — density, content type, entities, facets, speakers, and routing recommendations in a single pass", 6 + "color": "#ff6f00", 7 + "schedule": "segment", 8 + "priority": 10, 9 + "tier": 3, 10 + "thinking_budget": 4096, 11 + "max_output_tokens": 3072, 12 + "output": "json", 13 + "instructions": { 14 + "sources": {"transcripts": true, "percepts": true, "agents": false}, 15 + "facets": true 16 + } 17 + 18 + } 19 + 20 + $segment_preamble 21 + 22 + # Segment Sense 23 + 24 + Analyze this recording segment and produce a single structured assessment covering density, content type, activity, entities, facets, speakers, and processing recommendations. 25 + 26 + ## Task 27 + 28 + Read the transcript and screen data. Produce a JSON object with ALL of the following fields. 29 + 30 + ## Output Schema 31 + 32 + ```json 33 + { 34 + "density": "active|low_change|idle", 35 + "content_type": "meeting|coding|browsing|email|messaging|reading|idle|mixed", 36 + "activity_summary": "1-3 sentence description of what happened", 37 + "entities": [ 38 + {"type": "Person|Company|Project|Tool", "name": "Full Name", "context": "Why this entity matters in this segment"} 39 + ], 40 + "facets": [ 41 + {"facet": "facet_id", "activity": "1-sentence description for this facet", "level": "high|medium|low"} 42 + ], 43 + "meeting_detected": false, 44 + "speakers": [], 45 + "recommend": { 46 + "screen_record": false, 47 + "speaker_attribution": false, 48 + "pulse_update": false 49 + } 50 + } 51 + ``` 52 + 53 + ## Field-by-Field Instructions 54 + 55 + ### density 56 + Classify based on content volume: 57 + - **active**: Meaningful transcript content (>10 lines or >100 words) OR meaningful screen changes (>5 distinct frames with different visual descriptions) 58 + - **low_change**: Some content but minimal change — fewer than 10 transcript lines AND fewer than 5 distinct screen states. Something is happening but it's repetitive or minimal. 59 + - **idle**: Near-zero content — fewer than 3 transcript lines AND fewer than 3 distinct screen frames. Static screen, silence, or system noise only. 60 + 61 + ### content_type 62 + The dominant activity type observed: 63 + - **meeting**: Multi-person discussion with turn-taking (video call, in-person meeting, phone call) 64 + - **coding**: Writing or editing code, using a terminal, IDE, or code review tool 65 + - **browsing**: Web browsing, reading articles, searching 66 + - **email**: Reading or composing email 67 + - **messaging**: Chat applications (Slack, Teams, Discord, iMessage) 68 + - **reading**: Focused reading of documents, PDFs, books 69 + - **idle**: No meaningful activity 70 + - **mixed**: Multiple distinct activity types with no clear dominant one 71 + 72 + ### activity_summary 73 + Describe what $preferred did during this segment using action verbs. Be specific — name the tools, people, projects, and actions. Ban passive words: never use "reviewing", "monitoring", "tracking", "checking", "observing", "maintaining", "managing." Use instead: wrote, sent, discussed, created, switched to, typed, said, decided, asked, proposed. 74 + 75 + ### entities 76 + Extract named entities. Four types only: 77 + - **Person**: Individual people by name. Prefer full names. Consolidate variants ("JB" + "John Borthwick" → one entity "John Borthwick"). Skip ambiguous first-name-only references. 78 + - **Company**: Businesses and organizations. 79 + - **Project**: Named projects, products, or codebases. 80 + - **Tool**: Software applications and services. 81 + 82 + Skip URLs, domains, filenames, paths. Each entity needs type, name, and context (brief description of the entity's role in this segment). 83 + 84 + ### facets 85 + Classify into the owner's configured facets. Only include facets with clear evidence of activity. For each: 86 + - `facet`: The facet ID slug 87 + - `activity`: 1-sentence description of what was observed for this facet 88 + - `level`: "high" (primary focus), "medium" (significant), "low" (brief/peripheral) 89 + 90 + ### meeting_detected 91 + `true` if any of these conditions are met: 92 + - Screen shows a video conferencing app (Zoom, Meet, Teams, Webex) with participant panels 93 + - Audio shows multiple speakers with conversational turn-taking 94 + - Meeting-style patterns: greetings, introductions, agenda items, discussion, decisions 95 + 96 + `false` otherwise. Podcasts, streaming content, and recorded media do NOT count. 97 + 98 + ### speakers 99 + If `meeting_detected` is true, extract participant names from: 100 + 1. Visible participant list/panel on screen 101 + 2. Names spoken in conversation — direct address ("Thanks, Sarah"), mentions ("John was saying...") 102 + 3. Self-introductions ("Hi, I'm Alex from...") 103 + 104 + Prefer complete canonical forms (full names when identifiable). Do NOT include the journal owner's name. Return `[]` if no meeting or no names identified. 105 + 106 + ### recommend 107 + Processing recommendations for downstream agents: 108 + - **screen_record**: `true` if density is "active" AND there is meaningful screen content worth documenting (not just a static/repetitive screen) 109 + - **speaker_attribution**: `true` if `meeting_detected` is true AND there are multiple speakers to attribute 110 + - **pulse_update**: `true` if this segment represents a meaningful change in activity — new activity started, activity ended, significant context shift, or noteworthy event occurred. `false` for continuation of the same activity with no notable change. 111 + 112 + ## Rules 113 + 114 + 1. Every field is required. Never omit a field. 115 + 2. `entities` and `speakers` may be empty arrays `[]`. 116 + 3. `facets` may be empty array `[]` if no configured facets match. 117 + 4. Be precise with density — misclassifying active segments as idle is the worst error. 118 + 5. For content_type, choose the single best match. Use "mixed" sparingly — only when there are truly multiple equal activities. 119 + 6. Activity summary must describe observable actions, not inferred states. 120 + 121 + Return ONLY the JSON object, no other text or explanation.
scratch/sense_rd/__init__.py

This is a binary file and will not be displayed.

+253
scratch/sense_rd/compare.py
··· 1 + """ 2 + Comparison utilities for Sense A/B testing. 3 + 4 + Compares unified Sense agent output against baseline multi-agent pipeline outputs 5 + from the field journal. 6 + """ 7 + 8 + 9 + def _normalize(name: str) -> str: 10 + """Lowercase, strip whitespace.""" 11 + return name.strip().lower() 12 + 13 + 14 + def _names_match(a: str, b: str) -> bool: 15 + """ 16 + Fuzzy name matching: exact match after normalization, or one is a 17 + substring of the other (handles "Laura" matching "Laura Smith"). 18 + """ 19 + na, nb = _normalize(a), _normalize(b) 20 + if na == nb: 21 + return True 22 + # Substring: shorter name appears in longer name 23 + if na in nb or nb in na: 24 + return True 25 + return False 26 + 27 + 28 + def _find_match(name: str, candidates: list[str]) -> str | None: 29 + """Find first matching candidate for a name.""" 30 + for c in candidates: 31 + if _names_match(name, c): 32 + return c 33 + return None 34 + 35 + 36 + # --------------------------------------------------------------------------- 37 + # Entity comparison 38 + # --------------------------------------------------------------------------- 39 + 40 + def compare_entities(sense_entities: list[dict], baseline_entities: list[dict]) -> dict: 41 + """ 42 + Compare Sense entity list against baseline entities.jsonl. 43 + 44 + Both are lists of dicts with at least {"type", "name"}. 45 + Returns precision, recall, f1, and unmatched lists. 46 + """ 47 + sense_names = [e.get("name", "") for e in sense_entities] 48 + baseline_names = [e.get("name", "") for e in baseline_entities] 49 + 50 + if not sense_names and not baseline_names: 51 + return { 52 + "precision": 1.0, "recall": 1.0, "f1": 1.0, 53 + "sense_only": [], "baseline_only": [], 54 + } 55 + 56 + # Track which baseline entities were matched 57 + matched_baseline = set() 58 + matched_sense = set() 59 + 60 + for i, sn in enumerate(sense_names): 61 + for j, bn in enumerate(baseline_names): 62 + if j not in matched_baseline and _names_match(sn, bn): 63 + matched_sense.add(i) 64 + matched_baseline.add(j) 65 + break 66 + 67 + precision = len(matched_sense) / len(sense_names) if sense_names else 0.0 68 + recall = len(matched_baseline) / len(baseline_names) if baseline_names else 0.0 69 + f1 = (2 * precision * recall / (precision + recall)) if (precision + recall) > 0 else 0.0 70 + 71 + sense_only = [sense_names[i] for i in range(len(sense_names)) if i not in matched_sense] 72 + baseline_only = [baseline_names[j] for j in range(len(baseline_names)) if j not in matched_baseline] 73 + 74 + return { 75 + "precision": round(precision, 4), 76 + "recall": round(recall, 4), 77 + "f1": round(f1, 4), 78 + "sense_only": sense_only, 79 + "baseline_only": baseline_only, 80 + } 81 + 82 + 83 + # --------------------------------------------------------------------------- 84 + # Speaker comparison 85 + # --------------------------------------------------------------------------- 86 + 87 + def compare_speakers(sense_speakers: list[str], baseline_speakers: list[str]) -> dict: 88 + """ 89 + Compare Sense speaker list against baseline speakers.json. 90 + Uses fuzzy name matching with Jaccard-like similarity. 91 + """ 92 + if not sense_speakers and not baseline_speakers: 93 + return { 94 + "exact_match": True, "overlap": 1.0, 95 + "sense_only": [], "baseline_only": [], 96 + } 97 + 98 + sense_norm = {_normalize(s) for s in sense_speakers} 99 + baseline_norm = {_normalize(b) for b in baseline_speakers} 100 + 101 + # Fuzzy matching: build matched sets 102 + matched_baseline = set() 103 + matched_sense = set() 104 + 105 + for sn in sense_norm: 106 + for bn in baseline_norm: 107 + if bn not in matched_baseline and _names_match(sn, bn): 108 + matched_sense.add(sn) 109 + matched_baseline.add(bn) 110 + break 111 + 112 + union_size = len(sense_norm) + len(baseline_norm) - len(matched_sense) 113 + overlap = len(matched_sense) / union_size if union_size > 0 else 0.0 114 + 115 + exact_match = (sense_norm == baseline_norm) 116 + 117 + sense_only = [s for s in sense_speakers if _normalize(s) not in matched_sense] 118 + baseline_only = [b for b in baseline_speakers if _normalize(b) not in matched_baseline] 119 + 120 + return { 121 + "exact_match": exact_match, 122 + "overlap": round(overlap, 4), 123 + "sense_only": sense_only, 124 + "baseline_only": baseline_only, 125 + } 126 + 127 + 128 + # --------------------------------------------------------------------------- 129 + # Facet comparison 130 + # --------------------------------------------------------------------------- 131 + 132 + _LEVEL_ORDER = {"low": 0, "medium": 1, "high": 2} 133 + 134 + 135 + def compare_facets(sense_facets: list[dict], baseline_facets: list[dict]) -> dict: 136 + """ 137 + Compare Sense facet classifications against baseline facets.json. 138 + Each facet is {"facet": str, "activity": str, "level": str}. 139 + """ 140 + sense_by_id = {f["facet"]: f for f in sense_facets} 141 + baseline_by_id = {f["facet"]: f for f in baseline_facets} 142 + 143 + sense_ids = set(sense_by_id.keys()) 144 + baseline_ids = set(baseline_by_id.keys()) 145 + 146 + facet_match = (sense_ids == baseline_ids) 147 + 148 + # Level comparison for overlapping facets 149 + common = sense_ids & baseline_ids 150 + level_matches = 0 151 + level_close = 0 152 + 153 + for fid in common: 154 + sl = _LEVEL_ORDER.get(sense_by_id[fid].get("level", ""), -1) 155 + bl = _LEVEL_ORDER.get(baseline_by_id[fid].get("level", ""), -1) 156 + if sl == bl: 157 + level_matches += 1 158 + level_close += 1 159 + elif abs(sl - bl) <= 1: 160 + level_close += 1 161 + 162 + n_common = len(common) 163 + 164 + return { 165 + "facet_match": facet_match, 166 + "level_match": level_matches == n_common if n_common > 0 else True, 167 + "level_close": level_close == n_common if n_common > 0 else True, 168 + "level_match_count": level_matches, 169 + "level_close_count": level_close, 170 + "common_facets": n_common, 171 + "sense_only_facets": list(sense_ids - baseline_ids), 172 + "baseline_only_facets": list(baseline_ids - sense_ids), 173 + } 174 + 175 + 176 + # --------------------------------------------------------------------------- 177 + # Density comparison 178 + # --------------------------------------------------------------------------- 179 + 180 + def compare_density(sense_density: str, baseline_density_data: str) -> dict: 181 + """ 182 + Compare Sense density classification against baseline. 183 + baseline_density_data: "active" if segment has full agent outputs, else "idle"/"low_change". 184 + """ 185 + match = _normalize(sense_density) == _normalize(baseline_density_data) 186 + return { 187 + "match": match, 188 + "sense": sense_density, 189 + "baseline": baseline_density_data, 190 + } 191 + 192 + 193 + # --------------------------------------------------------------------------- 194 + # Activity summary comparison 195 + # --------------------------------------------------------------------------- 196 + 197 + def _significant_words(text: str) -> set[str]: 198 + """Extract significant words (>3 chars, lowercased) from text.""" 199 + words = set() 200 + for word in text.split(): 201 + # Strip punctuation 202 + cleaned = "".join(c for c in word if c.isalnum()) 203 + if len(cleaned) > 3: 204 + words.add(cleaned.lower()) 205 + return words 206 + 207 + 208 + def compare_activity_summary(sense_summary: str, baseline_activity_md: str) -> dict: 209 + """ 210 + Compare Sense activity summary against baseline activity.md. 211 + Returns keyword overlap (Jaccard) and length ratio. 212 + """ 213 + if not sense_summary and not baseline_activity_md: 214 + return {"keyword_overlap": 1.0, "length_ratio": 1.0} 215 + 216 + sense_words = _significant_words(sense_summary) 217 + baseline_words = _significant_words(baseline_activity_md) 218 + 219 + if not sense_words and not baseline_words: 220 + overlap = 1.0 221 + elif not sense_words or not baseline_words: 222 + overlap = 0.0 223 + else: 224 + intersection = sense_words & baseline_words 225 + union = sense_words | baseline_words 226 + overlap = len(intersection) / len(union) 227 + 228 + bl_len = len(baseline_activity_md) if baseline_activity_md else 1 229 + length_ratio = len(sense_summary) / bl_len 230 + 231 + return { 232 + "keyword_overlap": round(overlap, 4), 233 + "length_ratio": round(length_ratio, 4), 234 + } 235 + 236 + 237 + # --------------------------------------------------------------------------- 238 + # Meeting detection comparison 239 + # --------------------------------------------------------------------------- 240 + 241 + def compare_meeting_detection(sense_meeting: bool, baseline_speakers: list[str] | None) -> dict: 242 + """ 243 + Compare Sense meeting_detected against baseline. 244 + Baseline meeting = speakers.json exists with non-empty array. 245 + """ 246 + baseline_meeting = bool(baseline_speakers) 247 + match = sense_meeting == baseline_meeting 248 + return { 249 + "match": match, 250 + "sense": sense_meeting, 251 + "baseline": baseline_meeting, 252 + "baseline_speakers": baseline_speakers or [], 253 + }
+696
scratch/sense_rd/harness.py
··· 1 + #!/usr/bin/env python3 2 + """ 3 + Sense A/B Test Harness 4 + 5 + Runs the unified Sense agent on field journal segments and compares output 6 + against the existing multi-agent pipeline baseline. 7 + 8 + Usage: 9 + python harness.py [--journal PATH] [--model MODEL] [--max-segments N] 10 + [--segment DAY/STREAM/SEGMENT] [--output-dir PATH] 11 + """ 12 + 13 + import argparse 14 + import json 15 + import os 16 + import sys 17 + import time 18 + from pathlib import Path 19 + 20 + from openai import OpenAI 21 + 22 + from compare import ( 23 + compare_activity_summary, 24 + compare_density, 25 + compare_entities, 26 + compare_facets, 27 + compare_meeting_detection, 28 + compare_speakers, 29 + ) 30 + from state_machine import ActivityStateMachine, compare_state_machine 31 + 32 + # --------------------------------------------------------------------------- 33 + # Constants 34 + # --------------------------------------------------------------------------- 35 + 36 + DEFAULT_JOURNAL = Path("/home/jer/projects/field_journal/journal") 37 + DEFAULT_MODEL = "gpt-5.4-mini" 38 + DEFAULT_OUTPUT = Path(__file__).resolve().parent / "results" 39 + MANIFEST_PATH = Path("/home/jer/projects/field_journal/manifest.json") 40 + SENSE_MD_PATH = Path("/home/jer/projects/solstone/muse/sense.md") 41 + CONFIGURED_FACETS = ["meetings", "learning"] 42 + 43 + 44 + # --------------------------------------------------------------------------- 45 + # Helpers 46 + # --------------------------------------------------------------------------- 47 + 48 + def load_manifest() -> list[dict]: 49 + """Load segment manifest from field journal.""" 50 + with open(MANIFEST_PATH) as f: 51 + data = json.load(f) 52 + return data.get("segments", []) 53 + 54 + 55 + def load_sense_instruction() -> str: 56 + """ 57 + Load Sense system prompt from sense.md. 58 + Strip the JSON frontmatter — everything before and including the first 59 + blank line after the closing `}` on its own line. 60 + """ 61 + text = SENSE_MD_PATH.read_text() 62 + lines = text.split("\n") 63 + 64 + # Find the closing `}` on its own line (the frontmatter end) 65 + end_idx = 0 66 + for i, line in enumerate(lines): 67 + if line.strip() == "}": 68 + end_idx = i 69 + break 70 + 71 + # Skip past the closing `}` and any immediately following blank line 72 + start = end_idx + 1 73 + while start < len(lines) and lines[start].strip() == "": 74 + start += 1 75 + 76 + return "\n".join(lines[start:]).strip() 77 + 78 + 79 + def read_audio_transcript(segment_path: Path) -> str | None: 80 + """Read and concatenate transcript lines from audio.jsonl.""" 81 + audio_file = segment_path / "audio.jsonl" 82 + if not audio_file.exists(): 83 + return None 84 + 85 + lines = [] 86 + with open(audio_file) as f: 87 + for i, raw_line in enumerate(f): 88 + raw_line = raw_line.strip() 89 + if not raw_line: 90 + continue 91 + try: 92 + entry = json.loads(raw_line) 93 + except json.JSONDecodeError: 94 + continue 95 + # First line is metadata (has "raw" or "backend" key), skip it 96 + if i == 0 and ("raw" in entry or "backend" in entry): 97 + continue 98 + text = entry.get("text", "") 99 + start = entry.get("start", "") 100 + if text: 101 + lines.append(f"[{start}] {text}" if start else text) 102 + 103 + return "\n".join(lines) if lines else None 104 + 105 + 106 + def read_screen_descriptions(segment_path: Path) -> list[str] | None: 107 + """Read unique visual descriptions from screen.jsonl.""" 108 + screen_file = segment_path / "screen.jsonl" 109 + if not screen_file.exists(): 110 + return None 111 + 112 + descriptions = [] 113 + seen = set() 114 + 115 + with open(screen_file) as f: 116 + for i, raw_line in enumerate(f): 117 + raw_line = raw_line.strip() 118 + if not raw_line: 119 + continue 120 + try: 121 + entry = json.loads(raw_line) 122 + except json.JSONDecodeError: 123 + continue 124 + # First line is metadata (has "raw" key), skip it 125 + if i == 0 and "raw" in entry and "analysis" not in entry: 126 + continue 127 + analysis = entry.get("analysis", {}) 128 + desc = analysis.get("visual_description", "") 129 + if desc and desc not in seen: 130 + seen.add(desc) 131 + descriptions.append(desc) 132 + 133 + return descriptions if descriptions else None 134 + 135 + 136 + def read_baseline_activity(segment_path: Path) -> str | None: 137 + """Read baseline activity.md.""" 138 + p = segment_path / "agents" / "activity.md" 139 + if p.exists(): 140 + return p.read_text().strip() 141 + return None 142 + 143 + 144 + def read_baseline_entities(segment_path: Path) -> list[dict]: 145 + """Read baseline entities.jsonl.""" 146 + p = segment_path / "agents" / "entities.jsonl" 147 + if not p.exists(): 148 + return [] 149 + entities = [] 150 + with open(p) as f: 151 + for line in f: 152 + line = line.strip() 153 + if not line: 154 + continue 155 + try: 156 + entities.append(json.loads(line)) 157 + except json.JSONDecodeError: 158 + continue 159 + return entities 160 + 161 + 162 + def read_baseline_speakers(segment_path: Path) -> list[str] | None: 163 + """Read baseline speakers.json. Returns None if file doesn't exist.""" 164 + p = segment_path / "agents" / "speakers.json" 165 + if not p.exists(): 166 + return None 167 + try: 168 + with open(p) as f: 169 + return json.load(f) 170 + except (json.JSONDecodeError, ValueError): 171 + return None 172 + 173 + 174 + def read_baseline_facets(segment_path: Path) -> list[dict]: 175 + """Read baseline facets.json.""" 176 + p = segment_path / "agents" / "facets.json" 177 + if not p.exists(): 178 + return [] 179 + try: 180 + with open(p) as f: 181 + return json.load(f) 182 + except (json.JSONDecodeError, ValueError): 183 + return [] 184 + 185 + 186 + def read_baseline_activity_state(segment_path: Path) -> list[dict]: 187 + """ 188 + Read baseline activity_state.json from facet subdirectories. 189 + Checks each known facet subdir under agents/. 190 + """ 191 + agents_dir = segment_path / "agents" 192 + if not agents_dir.exists(): 193 + return [] 194 + 195 + all_states = [] 196 + for subdir in agents_dir.iterdir(): 197 + if subdir.is_dir(): 198 + state_file = subdir / "activity_state.json" 199 + if state_file.exists(): 200 + try: 201 + with open(state_file) as f: 202 + data = json.load(f) 203 + if isinstance(data, list): 204 + all_states.extend(data) 205 + elif isinstance(data, dict): 206 + all_states.append(data) 207 + except (json.JSONDecodeError, ValueError): 208 + continue 209 + return all_states 210 + 211 + 212 + def segment_time_range(segment_key: str) -> tuple[str, str]: 213 + """ 214 + Parse segment key like '091500_420' into start and end time strings. 215 + Returns (start, end) as "HH:MM:SS" strings. 216 + """ 217 + parts = segment_key.split("_") 218 + if len(parts) != 2: 219 + return (segment_key, segment_key) 220 + 221 + time_str = parts[0] 222 + duration = int(parts[1]) 223 + 224 + h = int(time_str[0:2]) 225 + m = int(time_str[2:4]) 226 + s = int(time_str[4:6]) 227 + 228 + start = f"{h:02d}:{m:02d}:{s:02d}" 229 + 230 + total_seconds = h * 3600 + m * 60 + s + duration 231 + eh = total_seconds // 3600 232 + em = (total_seconds % 3600) // 60 233 + es = total_seconds % 60 234 + end = f"{eh:02d}:{em:02d}:{es:02d}" 235 + 236 + return (start, end) 237 + 238 + 239 + def compose_user_message(day: str, segment_key: str, 240 + transcript: str | None, 241 + screen_descriptions: list[str] | None) -> str: 242 + """Assemble the user message for the Sense prompt.""" 243 + start, end = segment_time_range(segment_key) 244 + parts = [f"Analyzing segment from {day} covering {start} to {end}."] 245 + 246 + if transcript: 247 + parts.append(f"\n## Transcript\n\n{transcript}") 248 + 249 + if screen_descriptions: 250 + parts.append(f"\n## Screen Activity\n\n" + "\n".join(screen_descriptions)) 251 + 252 + parts.append( 253 + f"\n## Configured Facets\n\n" 254 + f"- meetings\n" 255 + f"- learning" 256 + ) 257 + 258 + return "\n".join(parts) 259 + 260 + 261 + def call_sense(client: OpenAI, model: str, system_prompt: str, 262 + user_message: str) -> tuple[dict | None, dict]: 263 + """ 264 + Call the Sense agent via OpenAI API. 265 + Returns (parsed_response, metadata) where metadata includes tokens and latency. 266 + """ 267 + t0 = time.monotonic() 268 + 269 + try: 270 + response = client.chat.completions.create( 271 + model=model, 272 + messages=[ 273 + {"role": "system", "content": system_prompt}, 274 + {"role": "user", "content": user_message}, 275 + ], 276 + response_format={"type": "json_object"}, 277 + temperature=0, 278 + ) 279 + except Exception as e: 280 + elapsed = time.monotonic() - t0 281 + return None, { 282 + "error": str(e), 283 + "latency_seconds": round(elapsed, 3), 284 + "input_tokens": 0, 285 + "output_tokens": 0, 286 + } 287 + 288 + elapsed = time.monotonic() - t0 289 + 290 + usage = response.usage 291 + meta = { 292 + "latency_seconds": round(elapsed, 3), 293 + "input_tokens": usage.prompt_tokens if usage else 0, 294 + "output_tokens": usage.completion_tokens if usage else 0, 295 + "model": response.model, 296 + } 297 + 298 + content = response.choices[0].message.content if response.choices else "" 299 + try: 300 + parsed = json.loads(content) 301 + except json.JSONDecodeError as e: 302 + meta["parse_error"] = str(e) 303 + meta["raw_response"] = content[:2000] 304 + parsed = None 305 + 306 + return parsed, meta 307 + 308 + 309 + # --------------------------------------------------------------------------- 310 + # Main harness 311 + # --------------------------------------------------------------------------- 312 + 313 + def run_segment(client: OpenAI, model: str, system_prompt: str, 314 + segment_info: dict, journal_root: Path, 315 + state_machine: ActivityStateMachine, 316 + prev_segment_key: str | None) -> dict: 317 + """Run Sense on a single segment and compare against baseline.""" 318 + day = segment_info["day"] 319 + stream = segment_info["stream"] 320 + segment_key = segment_info["segment"] 321 + 322 + segment_path = journal_root / day / stream / segment_key 323 + segment_id = f"{day}/{stream}/{segment_key}" 324 + 325 + print(f" [{segment_id}] ", end="", flush=True) 326 + 327 + if not segment_path.exists(): 328 + print("SKIP (path missing)") 329 + return {"segment": segment_id, "status": "skipped", "reason": "path_missing"} 330 + 331 + # Read inputs 332 + transcript = read_audio_transcript(segment_path) 333 + screen_descs = read_screen_descriptions(segment_path) 334 + 335 + if not transcript and not screen_descs: 336 + print("SKIP (no input data)") 337 + return {"segment": segment_id, "status": "skipped", "reason": "no_input"} 338 + 339 + # Read baselines 340 + baseline_activity = read_baseline_activity(segment_path) 341 + baseline_entities = read_baseline_entities(segment_path) 342 + baseline_speakers = read_baseline_speakers(segment_path) 343 + baseline_facets = read_baseline_facets(segment_path) 344 + baseline_states = read_baseline_activity_state(segment_path) 345 + 346 + # Compose and send 347 + user_msg = compose_user_message(day, segment_key, transcript, screen_descs) 348 + sense_output, api_meta = call_sense(client, model, system_prompt, user_msg) 349 + 350 + if sense_output is None: 351 + print(f"FAIL ({api_meta.get('error', api_meta.get('parse_error', 'unknown'))})") 352 + return { 353 + "segment": segment_id, 354 + "status": "error", 355 + "api": api_meta, 356 + } 357 + 358 + # Run state machine 359 + sm_changes = state_machine.update(sense_output, segment_key, day, prev_segment_key) 360 + sm_current = state_machine.get_current_state() 361 + 362 + # Compare all fields 363 + comparisons = {} 364 + 365 + # Density — all field journal segments have full agent outputs = "active" 366 + comparisons["density"] = compare_density( 367 + sense_output.get("density", ""), 368 + "active" 369 + ) 370 + 371 + # Entities 372 + comparisons["entities"] = compare_entities( 373 + sense_output.get("entities", []), 374 + baseline_entities 375 + ) 376 + 377 + # Speakers 378 + comparisons["speakers"] = compare_speakers( 379 + sense_output.get("speakers", []), 380 + baseline_speakers or [] 381 + ) 382 + 383 + # Facets 384 + comparisons["facets"] = compare_facets( 385 + sense_output.get("facets", []), 386 + baseline_facets 387 + ) 388 + 389 + # Activity summary 390 + comparisons["activity_summary"] = compare_activity_summary( 391 + sense_output.get("activity_summary", ""), 392 + baseline_activity or "" 393 + ) 394 + 395 + # Meeting detection 396 + comparisons["meeting_detection"] = compare_meeting_detection( 397 + sense_output.get("meeting_detected", False), 398 + baseline_speakers 399 + ) 400 + 401 + # State machine comparison 402 + comparisons["state_machine"] = compare_state_machine(sm_current, baseline_states) 403 + 404 + # Score summary 405 + score = _compute_score(comparisons) 406 + 407 + print(f"OK (score={score:.2f}, " 408 + f"in={api_meta['input_tokens']}, out={api_meta['output_tokens']}, " 409 + f"{api_meta['latency_seconds']}s)") 410 + 411 + return { 412 + "segment": segment_id, 413 + "status": "ok", 414 + "day": day, 415 + "stream": stream, 416 + "segment_key": segment_key, 417 + "api": api_meta, 418 + "sense_output": sense_output, 419 + "comparisons": comparisons, 420 + "score": score, 421 + } 422 + 423 + 424 + def _compute_score(comparisons: dict) -> float: 425 + """ 426 + Compute a weighted quality score from comparisons. 427 + Returns 0.0-1.0. 428 + """ 429 + weights = { 430 + "density": 0.10, 431 + "entities": 0.25, 432 + "speakers": 0.10, 433 + "facets": 0.20, 434 + "activity_summary": 0.15, 435 + "meeting_detection": 0.10, 436 + "state_machine": 0.10, 437 + } 438 + 439 + scores = {} 440 + 441 + # Density: binary match 442 + scores["density"] = 1.0 if comparisons.get("density", {}).get("match") else 0.0 443 + 444 + # Entities: F1 445 + scores["entities"] = comparisons.get("entities", {}).get("f1", 0.0) 446 + 447 + # Speakers: overlap (Jaccard) 448 + scores["speakers"] = comparisons.get("speakers", {}).get("overlap", 0.0) 449 + 450 + # Facets: combination of facet match + level closeness 451 + fc = comparisons.get("facets", {}) 452 + facet_score = 0.0 453 + if fc.get("facet_match"): 454 + facet_score += 0.5 455 + elif fc.get("common_facets", 0) > 0: 456 + total = fc["common_facets"] + len(fc.get("sense_only_facets", [])) + len(fc.get("baseline_only_facets", [])) 457 + facet_score += 0.5 * (fc["common_facets"] / total) if total > 0 else 0.0 458 + if fc.get("level_close"): 459 + facet_score += 0.5 460 + elif fc.get("common_facets", 0) > 0: 461 + facet_score += 0.5 * (fc.get("level_close_count", 0) / fc["common_facets"]) 462 + scores["facets"] = facet_score 463 + 464 + # Activity summary: keyword overlap 465 + scores["activity_summary"] = comparisons.get("activity_summary", {}).get("keyword_overlap", 0.0) 466 + 467 + # Meeting detection: binary match 468 + scores["meeting_detection"] = 1.0 if comparisons.get("meeting_detection", {}).get("match") else 0.0 469 + 470 + # State machine: activity match + level match ratio 471 + sm = comparisons.get("state_machine", {}) 472 + sm_score = 0.5 if sm.get("activity_match") else 0.0 473 + if sm.get("level_match_total", 0) > 0: 474 + sm_score += 0.5 * (sm.get("level_match_count", 0) / sm["level_match_total"]) 475 + elif sm.get("activity_match"): 476 + sm_score += 0.5 477 + scores["state_machine"] = sm_score 478 + 479 + total = sum(scores[k] * weights[k] for k in weights) 480 + return round(total, 4) 481 + 482 + 483 + def generate_summary(results: list[dict]) -> str: 484 + """Generate a human-readable summary report.""" 485 + ok_results = [r for r in results if r.get("status") == "ok"] 486 + skipped = [r for r in results if r.get("status") == "skipped"] 487 + errors = [r for r in results if r.get("status") == "error"] 488 + 489 + lines = [ 490 + "# Sense A/B Test Results", 491 + f"", 492 + f"**Segments:** {len(results)} total, {len(ok_results)} completed, " 493 + f"{len(skipped)} skipped, {len(errors)} errors", 494 + "", 495 + ] 496 + 497 + if not ok_results: 498 + lines.append("No completed results to summarize.") 499 + return "\n".join(lines) 500 + 501 + # Aggregate scores 502 + scores = [r["score"] for r in ok_results] 503 + avg_score = sum(scores) / len(scores) 504 + lines.append(f"**Overall Score:** {avg_score:.4f} (avg across {len(ok_results)} segments)") 505 + lines.append("") 506 + 507 + # Token usage 508 + total_input = sum(r["api"]["input_tokens"] for r in ok_results) 509 + total_output = sum(r["api"]["output_tokens"] for r in ok_results) 510 + avg_latency = sum(r["api"]["latency_seconds"] for r in ok_results) / len(ok_results) 511 + lines.append("## Token Usage") 512 + lines.append(f"- Input tokens: {total_input:,} total, {total_input // len(ok_results):,} avg/segment") 513 + lines.append(f"- Output tokens: {total_output:,} total, {total_output // len(ok_results):,} avg/segment") 514 + lines.append(f"- Latency: {avg_latency:.2f}s avg/segment") 515 + lines.append("") 516 + 517 + # Per-comparison averages 518 + comparison_keys = ["density", "entities", "speakers", "facets", 519 + "activity_summary", "meeting_detection", "state_machine"] 520 + lines.append("## Comparison Breakdown") 521 + lines.append("") 522 + 523 + # Density match rate 524 + density_matches = sum(1 for r in ok_results 525 + if r["comparisons"]["density"]["match"]) 526 + lines.append(f"### Density") 527 + lines.append(f"- Match rate: {density_matches}/{len(ok_results)} " 528 + f"({density_matches / len(ok_results):.1%})") 529 + lines.append("") 530 + 531 + # Entity F1 532 + entity_f1s = [r["comparisons"]["entities"]["f1"] for r in ok_results] 533 + entity_precisions = [r["comparisons"]["entities"]["precision"] for r in ok_results] 534 + entity_recalls = [r["comparisons"]["entities"]["recall"] for r in ok_results] 535 + lines.append(f"### Entities") 536 + lines.append(f"- Avg F1: {sum(entity_f1s) / len(entity_f1s):.4f}") 537 + lines.append(f"- Avg Precision: {sum(entity_precisions) / len(entity_precisions):.4f}") 538 + lines.append(f"- Avg Recall: {sum(entity_recalls) / len(entity_recalls):.4f}") 539 + lines.append("") 540 + 541 + # Speakers 542 + speaker_overlaps = [r["comparisons"]["speakers"]["overlap"] for r in ok_results] 543 + meeting_segments = [r for r in ok_results if r["comparisons"]["meeting_detection"]["baseline"]] 544 + lines.append(f"### Speakers") 545 + lines.append(f"- Avg overlap (Jaccard): {sum(speaker_overlaps) / len(speaker_overlaps):.4f}") 546 + lines.append(f"- Meeting segments (baseline): {len(meeting_segments)}") 547 + lines.append("") 548 + 549 + # Facets 550 + facet_matches = sum(1 for r in ok_results 551 + if r["comparisons"]["facets"]["facet_match"]) 552 + level_close_matches = sum(1 for r in ok_results 553 + if r["comparisons"]["facets"]["level_close"]) 554 + lines.append(f"### Facets") 555 + lines.append(f"- Facet ID match rate: {facet_matches}/{len(ok_results)} " 556 + f"({facet_matches / len(ok_results):.1%})") 557 + lines.append(f"- Level within +/-1 tier: {level_close_matches}/{len(ok_results)} " 558 + f"({level_close_matches / len(ok_results):.1%})") 559 + lines.append("") 560 + 561 + # Activity summary 562 + keyword_overlaps = [r["comparisons"]["activity_summary"]["keyword_overlap"] 563 + for r in ok_results] 564 + lines.append(f"### Activity Summary") 565 + lines.append(f"- Avg keyword overlap (Jaccard): " 566 + f"{sum(keyword_overlaps) / len(keyword_overlaps):.4f}") 567 + lines.append("") 568 + 569 + # Meeting detection 570 + meeting_matches = sum(1 for r in ok_results 571 + if r["comparisons"]["meeting_detection"]["match"]) 572 + lines.append(f"### Meeting Detection") 573 + lines.append(f"- Match rate: {meeting_matches}/{len(ok_results)} " 574 + f"({meeting_matches / len(ok_results):.1%})") 575 + lines.append("") 576 + 577 + # Per-segment scores table 578 + lines.append("## Per-Segment Scores") 579 + lines.append("") 580 + lines.append("| Segment | Score | Density | Entity F1 | Speaker | Facet | Meeting |") 581 + lines.append("|---------|-------|---------|-----------|---------|-------|---------|") 582 + for r in ok_results: 583 + c = r["comparisons"] 584 + lines.append( 585 + f"| {r['segment']} " 586 + f"| {r['score']:.3f} " 587 + f"| {'Y' if c['density']['match'] else 'N'} " 588 + f"| {c['entities']['f1']:.3f} " 589 + f"| {c['speakers']['overlap']:.3f} " 590 + f"| {'Y' if c['facets']['facet_match'] else 'N'} " 591 + f"| {'Y' if c['meeting_detection']['match'] else 'N'} |" 592 + ) 593 + lines.append("") 594 + 595 + if errors: 596 + lines.append("## Errors") 597 + for r in errors: 598 + lines.append(f"- {r['segment']}: {r.get('api', {}).get('error', 'unknown')}") 599 + lines.append("") 600 + 601 + return "\n".join(lines) 602 + 603 + 604 + def main(): 605 + parser = argparse.ArgumentParser( 606 + description="Sense A/B test harness — compare unified Sense agent against multi-agent baseline" 607 + ) 608 + parser.add_argument("--journal", type=Path, default=DEFAULT_JOURNAL, 609 + help=f"Path to field journal root (default: {DEFAULT_JOURNAL})") 610 + parser.add_argument("--model", type=str, default=DEFAULT_MODEL, 611 + help=f"OpenAI model to use (default: {DEFAULT_MODEL})") 612 + parser.add_argument("--max-segments", type=int, default=None, 613 + help="Max segments to process (default: all)") 614 + parser.add_argument("--segment", type=str, default=None, 615 + help="Run a single segment: DAY/STREAM/SEGMENT (e.g. 20260201/field.audio/091500_420)") 616 + parser.add_argument("--output-dir", type=Path, default=DEFAULT_OUTPUT, 617 + help=f"Output directory for results (default: {DEFAULT_OUTPUT})") 618 + 619 + args = parser.parse_args() 620 + 621 + # Validate 622 + if not args.journal.exists(): 623 + print(f"Error: journal path not found: {args.journal}") 624 + sys.exit(1) 625 + 626 + api_key = os.environ.get("OPENAI_API_KEY") 627 + if not api_key: 628 + print("Error: OPENAI_API_KEY environment variable not set") 629 + sys.exit(1) 630 + 631 + # Load manifest and system prompt 632 + print("Loading manifest and Sense instruction...") 633 + manifest = load_manifest() 634 + system_prompt = load_sense_instruction() 635 + print(f" {len(manifest)} segments in manifest") 636 + print(f" Sense instruction: {len(system_prompt)} chars") 637 + 638 + # Filter segments 639 + if args.segment: 640 + parts = args.segment.split("/") 641 + if len(parts) != 3: 642 + print(f"Error: --segment must be DAY/STREAM/SEGMENT, got: {args.segment}") 643 + sys.exit(1) 644 + target_day, target_stream, target_seg = parts 645 + manifest = [s for s in manifest 646 + if s["day"] == target_day 647 + and s["stream"] == target_stream 648 + and s["segment"] == target_seg] 649 + if not manifest: 650 + print(f"Error: segment not found in manifest: {args.segment}") 651 + sys.exit(1) 652 + 653 + if args.max_segments: 654 + manifest = manifest[:args.max_segments] 655 + 656 + print(f" Running {len(manifest)} segments with model {args.model}") 657 + print() 658 + 659 + # Setup 660 + client = OpenAI(api_key=api_key) 661 + state_machine = ActivityStateMachine() 662 + args.output_dir.mkdir(parents=True, exist_ok=True) 663 + 664 + results = [] 665 + prev_segment_key = None 666 + 667 + for i, seg_info in enumerate(manifest): 668 + print(f"[{i + 1}/{len(manifest)}]", end="") 669 + result = run_segment( 670 + client, args.model, system_prompt, 671 + seg_info, args.journal, state_machine, prev_segment_key 672 + ) 673 + results.append(result) 674 + prev_segment_key = seg_info["segment"] 675 + 676 + print() 677 + print("=" * 60) 678 + 679 + # Write JSONL results 680 + results_file = args.output_dir / "results.jsonl" 681 + with open(results_file, "w") as f: 682 + for r in results: 683 + f.write(json.dumps(r, default=str) + "\n") 684 + print(f"Results written to {results_file}") 685 + 686 + # Write summary report 687 + summary = generate_summary(results) 688 + summary_file = args.output_dir / "summary.md" 689 + summary_file.write_text(summary) 690 + print(f"Summary written to {summary_file}") 691 + print() 692 + print(summary) 693 + 694 + 695 + if __name__ == "__main__": 696 + main()
+287
scratch/sense_rd/state_machine.py
··· 1 + """ 2 + Activity state machine for replacing LLM-based activity_state agent. 3 + 4 + Maintains per-facet activity state across segments, producing output 5 + compatible with the existing activity_state.json format. 6 + """ 7 + 8 + import re 9 + 10 + 11 + def _parse_segment_time(segment_key: str) -> int | None: 12 + """ 13 + Parse a segment key like "091500_420" into absolute seconds from midnight. 14 + Format: HHMMSS_duration 15 + Returns start time in seconds, or None if unparseable. 16 + """ 17 + match = re.match(r"(\d{2})(\d{2})(\d{2})_(\d+)", segment_key) 18 + if not match: 19 + return None 20 + h, m, s, _ = match.groups() 21 + return int(h) * 3600 + int(m) * 60 + int(s) 22 + 23 + 24 + def _segment_end_time(segment_key: str) -> int | None: 25 + """Return end time in seconds from midnight.""" 26 + match = re.match(r"(\d{2})(\d{2})(\d{2})_(\d+)", segment_key) 27 + if not match: 28 + return None 29 + h, m, s, dur = match.groups() 30 + return int(h) * 3600 + int(m) * 60 + int(s) + int(dur) 31 + 32 + 33 + def _make_activity_id(content_type: str, segment_key: str) -> str: 34 + """Generate an activity ID like 'meeting_091500_420'.""" 35 + return f"{content_type}_{segment_key}" 36 + 37 + 38 + class ActivityStateMachine: 39 + """ 40 + Tracks per-facet activity state across segments. 41 + 42 + State format per facet: 43 + { 44 + "id": "meeting_091500_420", 45 + "activity": "meeting", 46 + "state": "active" | "ended", 47 + "since": "091500_420", 48 + "description": "...", 49 + "level": "high" | "medium" | "low", 50 + "active_entities": [...] 51 + } 52 + """ 53 + 54 + # Gap threshold: if more than 10 minutes between segments, end all active 55 + GAP_THRESHOLD_SECONDS = 600 56 + 57 + def __init__(self): 58 + # {facet_id: state_dict} 59 + self.state: dict[str, dict] = {} 60 + self.last_segment_key: str | None = None 61 + self.last_segment_day: str | None = None 62 + self.history: list[dict] = [] # All state changes 63 + 64 + def update(self, sense_output: dict, segment_key: str, day: str, 65 + previous_segment_key: str | None = None) -> list[dict]: 66 + """ 67 + Given Sense output for a segment, update activity state. 68 + 69 + Args: 70 + sense_output: Parsed JSON from the Sense agent 71 + segment_key: e.g. "091500_420" 72 + day: e.g. "20260201" 73 + previous_segment_key: Previous segment key (for gap detection) 74 + 75 + Returns: 76 + List of state entries (new, continuing, ended) for this segment. 77 + """ 78 + changes = [] 79 + 80 + # Check for day change or time gap — end all active 81 + if self._should_reset(segment_key, day, previous_segment_key): 82 + for facet_id, state in self.state.items(): 83 + if state["state"] == "active": 84 + state["state"] = "ended" 85 + changes.append({**state, "_change": "ended_gap"}) 86 + self.state.clear() 87 + 88 + content_type = sense_output.get("content_type", "idle") 89 + density = sense_output.get("density", "idle") 90 + activity_summary = sense_output.get("activity_summary", "") 91 + entities = sense_output.get("entities", []) 92 + facets = sense_output.get("facets", []) 93 + entity_names = [e.get("name", "") for e in entities] 94 + 95 + # If idle density, end all active states 96 + if density == "idle": 97 + for facet_id, state in list(self.state.items()): 98 + if state["state"] == "active": 99 + state["state"] = "ended" 100 + changes.append({**state, "_change": "ended_idle"}) 101 + self.state.clear() 102 + self.last_segment_key = segment_key 103 + self.last_segment_day = day 104 + self.history.extend(changes) 105 + return changes 106 + 107 + # Process each facet from Sense output 108 + active_facet_ids = set() 109 + 110 + for facet_data in facets: 111 + facet_id = facet_data.get("facet", "") 112 + facet_activity = facet_data.get("activity", "") 113 + facet_level = facet_data.get("level", "medium") 114 + active_facet_ids.add(facet_id) 115 + 116 + if facet_id in self.state and self.state[facet_id]["state"] == "active": 117 + # Same facet still active — check if content_type changed 118 + existing = self.state[facet_id] 119 + if existing["activity"] == content_type: 120 + # Continuing: same content type + same facet 121 + existing["description"] = facet_activity or activity_summary 122 + existing["level"] = facet_level 123 + existing["active_entities"] = entity_names 124 + changes.append({**existing, "_change": "continuing"}) 125 + else: 126 + # Content type changed within same facet — end old, start new 127 + existing["state"] = "ended" 128 + changes.append({**existing, "_change": "ended_type_change"}) 129 + 130 + new_state = { 131 + "id": _make_activity_id(content_type, segment_key), 132 + "activity": content_type, 133 + "state": "active", 134 + "since": segment_key, 135 + "description": facet_activity or activity_summary, 136 + "level": facet_level, 137 + "active_entities": entity_names, 138 + } 139 + self.state[facet_id] = new_state 140 + changes.append({**new_state, "_change": "new"}) 141 + else: 142 + # New facet or previously ended — start new 143 + new_state = { 144 + "id": _make_activity_id(content_type, segment_key), 145 + "activity": content_type, 146 + "state": "active", 147 + "since": segment_key, 148 + "description": facet_activity or activity_summary, 149 + "level": facet_level, 150 + "active_entities": entity_names, 151 + } 152 + self.state[facet_id] = new_state 153 + changes.append({**new_state, "_change": "new"}) 154 + 155 + # End facets that were active but not in current Sense output 156 + for facet_id in list(self.state.keys()): 157 + if facet_id not in active_facet_ids and self.state[facet_id]["state"] == "active": 158 + self.state[facet_id]["state"] = "ended" 159 + changes.append({**self.state[facet_id], "_change": "ended_facet_gone"}) 160 + del self.state[facet_id] 161 + 162 + # If no facets from Sense but there is activity, use content_type as a pseudo-facet 163 + if not facets and density != "idle": 164 + pseudo_facet = f"__{content_type}" 165 + if pseudo_facet in self.state and self.state[pseudo_facet]["state"] == "active": 166 + existing = self.state[pseudo_facet] 167 + existing["description"] = activity_summary 168 + existing["active_entities"] = entity_names 169 + changes.append({**existing, "_change": "continuing"}) 170 + else: 171 + new_state = { 172 + "id": _make_activity_id(content_type, segment_key), 173 + "activity": content_type, 174 + "state": "active", 175 + "since": segment_key, 176 + "description": activity_summary, 177 + "level": "medium", 178 + "active_entities": entity_names, 179 + } 180 + self.state[pseudo_facet] = new_state 181 + changes.append({**new_state, "_change": "new"}) 182 + 183 + self.last_segment_key = segment_key 184 + self.last_segment_day = day 185 + self.history.extend(changes) 186 + return changes 187 + 188 + def get_current_state(self) -> list[dict]: 189 + """Return current active states in activity_state.json format.""" 190 + result = [] 191 + for state in self.state.values(): 192 + if state["state"] == "active": 193 + result.append({ 194 + "id": state["id"], 195 + "activity": state["activity"], 196 + "state": state["state"], 197 + "since": state["since"], 198 + "description": state["description"], 199 + "level": state["level"], 200 + "active_entities": state["active_entities"], 201 + }) 202 + return result 203 + 204 + def _should_reset(self, segment_key: str, day: str, 205 + previous_segment_key: str | None) -> bool: 206 + """Check if we should end all active states due to gap or day change.""" 207 + # Day change 208 + if self.last_segment_day and day != self.last_segment_day: 209 + return True 210 + 211 + # Time gap 212 + prev_key = previous_segment_key or self.last_segment_key 213 + if prev_key: 214 + prev_end = _segment_end_time(prev_key) 215 + curr_start = _parse_segment_time(segment_key) 216 + if prev_end is not None and curr_start is not None: 217 + gap = curr_start - prev_end 218 + if gap > self.GAP_THRESHOLD_SECONDS: 219 + return True 220 + 221 + return False 222 + 223 + 224 + # --------------------------------------------------------------------------- 225 + # Comparison against baseline activity_state.json 226 + # --------------------------------------------------------------------------- 227 + 228 + def compare_state_machine(sm_output: list[dict], baseline_state: list[dict]) -> dict: 229 + """ 230 + Compare state machine output against existing LLM-generated activity_state.json. 231 + 232 + Both are lists of state entries with: id, activity, state, since, description, 233 + level, active_entities. 234 + """ 235 + sm_activities = {s.get("activity", "") for s in sm_output} 236 + bl_activities = {s.get("activity", "") for s in baseline_state} 237 + 238 + # Activity type match 239 + activity_match = sm_activities == bl_activities 240 + 241 + # Count comparison 242 + count_match = len(sm_output) == len(baseline_state) 243 + 244 + # State comparison (active/ended) 245 + sm_states = {s.get("id", ""): s.get("state", "") for s in sm_output} 246 + bl_states = {s.get("id", ""): s.get("state", "") for s in baseline_state} 247 + 248 + # Level comparison for matched activities 249 + sm_by_activity = {s.get("activity", ""): s for s in sm_output} 250 + bl_by_activity = {s.get("activity", ""): s for s in baseline_state} 251 + 252 + common_activities = sm_activities & bl_activities 253 + level_matches = 0 254 + entity_overlaps = [] 255 + 256 + for act in common_activities: 257 + sm_entry = sm_by_activity.get(act, {}) 258 + bl_entry = bl_by_activity.get(act, {}) 259 + 260 + if sm_entry.get("level") == bl_entry.get("level"): 261 + level_matches += 1 262 + 263 + # Entity overlap 264 + sm_ents = set(sm_entry.get("active_entities", [])) 265 + bl_ents = set(bl_entry.get("active_entities", [])) 266 + if sm_ents or bl_ents: 267 + intersection = len(sm_ents & bl_ents) 268 + union = len(sm_ents | bl_ents) 269 + entity_overlaps.append(intersection / union if union > 0 else 1.0) 270 + 271 + n_common = len(common_activities) 272 + 273 + return { 274 + "activity_match": activity_match, 275 + "count_match": count_match, 276 + "sm_count": len(sm_output), 277 + "baseline_count": len(baseline_state), 278 + "common_activities": list(common_activities), 279 + "sm_only_activities": list(sm_activities - bl_activities), 280 + "baseline_only_activities": list(bl_activities - sm_activities), 281 + "level_match_count": level_matches, 282 + "level_match_total": n_common, 283 + "avg_entity_overlap": ( 284 + round(sum(entity_overlaps) / len(entity_overlaps), 4) 285 + if entity_overlaps else None 286 + ), 287 + }