experiments in a post-browser web
10
fork

Configure Feed

Select the types of activity you want to include in your feed.

fix(izui): promote transient session on app focus, fix startup role

Two fixes for ESC incorrectly closing workspace windows:

1. Startup feature opener now passes role:'workspace' — previously
fell through to 'utility' which always closes on ESC.

2. setAppFocused(true) promotes transient→active sessions — when app
gains focus during a transient session, the user is now actively
working in Peek and ESC should not close workspace windows.

+697 -1
+2 -1
app/index.js
··· 519 519 if (startupUrl === settingsAddress) { 520 520 await openSettingsWindow(p); 521 521 } else { 522 - // Open the configured startup feature 522 + // Open the configured startup feature as a workspace 523 523 await windowManager.createWindow(startupUrl, { 524 524 key: startupUrl, 525 + role: 'workspace', 525 526 transparent: true, 526 527 height: p.height || 600, 527 528 width: p.width || 800
+8
backend/electron/izui-state.ts
··· 105 105 setAppFocused(focused: boolean): void { 106 106 this.appFocused = focused; 107 107 DEBUG && console.log('[izui] App focus changed:', focused); 108 + 109 + // One-way ratchet: when app gains focus during a transient session, 110 + // promote to active. The user is now working in Peek. 111 + if (focused && this.session && this.session.entryMode === 'transient') { 112 + this.session.entryMode = 'active'; 113 + this.state = 'active'; 114 + DEBUG && console.log('[izui] Promoted session from transient to active (app gained focus)'); 115 + } 108 116 } 109 117 110 118 /**
+687
notes/research-entity-recognition.md
··· 1 + # Entity Recognition System - Implementation Plan 2 + 3 + Research and design document for the Entity-centrism / NER indices feature described in `peek-todo.md`. 4 + 5 + ## 1. Vision Recap 6 + 7 + Peek's entity-centrism lifts the apex of understanding from publishers into the user agent. Instead of treating the web as a set of disconnected domains, Peek identifies and indexes real-world entities -- people, places, events, organizations, numbers -- across every page you visit. Over time, Peek builds a personal knowledge graph that reflects your world, scored by frecency, with bidirectional links between entities and their source URLs. 8 + 9 + Core insight from `peek-todo.md`: 10 + > "Many significant entities in our lives are singletons. A person, a place, an event. The web makes the publisher the apex of understanding. Browsers do not acknowledge entities across pages." 11 + 12 + ## 2. Existing Systems to Build On 13 + 14 + ### 2.1 Data Model 15 + 16 + The codebase already has the primitives needed: 17 + 18 + - **`items` table** (in `backend/electron/datastore.ts`): Unified content storage with types `url`, `text`, `tagset`, `image`, `series`, `feed`. The `type` CHECK constraint will need a new value: `entity`. 19 + - **`item_tags` junction table**: Many-to-many linking items to tags. Entity types can be modeled as tags initially (e.g., `entity:person`, `entity:place`). 20 + - **`item_events` table**: Append-only time-series data. Entity observations (sightings on pages) map naturally to events on an entity item. 21 + - **`metadata` JSON column** on items: Extensible storage for entity-specific attributes (homepage, email, coordinates, aliases). 22 + - **`tags` with frecency**: The scoring model already exists for tags. Entities will adopt the same frecency calculation from `calculateFrecency()` in the datastore. 23 + 24 + ### 2.2 Content Access 25 + 26 + - **`trackWindowLoad()`** in `backend/electron/datastore.ts`: Called on every `did-navigate` event and new window open. This is the natural hook point for triggering entity extraction on page loads. 27 + - **Scripts extension** (`extensions/scripts/`): Has `ScriptExecutor` with pattern matching and code execution against page DOM. Entity extraction scripts can run through this pipeline. 28 + - **Feeds extension** (`extensions/feeds/`): Demonstrates the pattern of using `item_events` for per-item data streams. Entity observations follow the same model. 29 + - **Context API** (`app/context/`): Per-window context with persistence. Could store active entity context per window. 30 + 31 + ### 2.3 Extension Patterns 32 + 33 + - Extensions register commands via `api.commands.register()`. 34 + - Extensions communicate via pubsub: `api.publish()` / `api.subscribe()`. 35 + - Extensions access data via `api.datastore.*` (addItem, queryItems, tagItem, addItemEvent, etc.). 36 + - Extensions open UIs via `api.window.open()` with `peek://ext/{id}/` URLs. 37 + - All extension backgrounds follow the `init()`/`uninit()` lifecycle pattern with `cmd:ready` subscription for command registration. 38 + 39 + ### 2.4 Sync Architecture 40 + 41 + - `schema/v1.json` defines canonical sync columns. Entity tables must decide sync vs. local-only. 42 + - Items sync via `syncId`/`syncedAt`. Entity items should sync; observations may be local-only (high volume). 43 + - `_sync` metadata pattern inside `metadata` JSON for cross-device provenance. 44 + 45 + ## 3. Entity Types 46 + 47 + ### 3.1 Phase 1: Structured/Extractable Entities (regex + microformats) 48 + 49 + These have well-defined patterns and can be extracted reliably: 50 + 51 + | Entity Type | Detection Method | Example | Stored Attributes | 52 + |-------------|-----------------|---------|-------------------| 53 + | **URL/Link** | Already exists | `https://example.com` | domain, title, path | 54 + | **Email** | Regex `[^\s@]+@[^\s@]+\.[^\s@]+` | `alice@example.com` | address, domain, name context | 55 + | **Phone number** | Regex + libphonenumber patterns | `+1-555-0123` | number, country, type (mobile/work), context | 56 + | **Date/Time** | Regex + chrono-node style parsing | `March 15, 2026`, `next Tuesday` | ISO date, original text, context | 57 + | **Physical address** | Regex + structured data | `123 Main St, Portland, OR` | street, city, region, postal, country | 58 + | **Tracking number** | Regex (carrier patterns) | `1Z999AA10123456784` | carrier, number, context | 59 + | **Price/Currency** | Regex `\$[\d,.]+`, `EUR [\d,.]+` | `$199.99` | amount, currency, context | 60 + | **Code/Reference** | Regex (flight, booking, order) | `UA 1234`, `ORDER-789` | type, number, context | 61 + 62 + ### 3.2 Phase 2: Named Entity Recognition (heuristic + ML) 63 + 64 + These require NLP/statistical methods: 65 + 66 + | Entity Type | Detection Method | Wikidata Mapping | 67 + |-------------|-----------------|-----------------| 68 + | **Person** | Capitalization heuristics, microformats `h-card`, LD+JSON Person | Q5 (human) | 69 + | **Organization** | Capitalization + suffixes (Inc, LLC, Ltd), microformats, LD+JSON | Q43229 (organization) | 70 + | **Place/Location** | Gazetteer matching, microformats `h-adr`, LD+JSON Place | Q17334923 (location) | 71 + | **Event** | Microformats `h-event`, LD+JSON Event, date+venue co-occurrence | Q1656682 (event) | 72 + | **Product** | LD+JSON Product, schema.org markup | Q2424752 (product) | 73 + | **Creative Work** | LD+JSON, og:type=article/book/music | Q17537576 (creative work) | 74 + 75 + ### 3.3 Phase 3: ML NER Models 76 + 77 + - Experiment with small offline models (e.g., ONNX-exported spaCy/Transformers models). 78 + - Train micro-models from accumulated entity data. 79 + - Run in Web Workers or background processes to avoid blocking UI. 80 + 81 + ## 4. Architecture 82 + 83 + ### 4.1 New Extension: `extensions/entities/` 84 + 85 + Entity recognition is implemented as a built-in extension, following the established extension pattern: 86 + 87 + ``` 88 + extensions/entities/ 89 + manifest.json # Extension metadata 90 + background.html # Entry point 91 + background.js # Core logic: lifecycle, commands, pubsub 92 + extractors/ 93 + regex.js # Regex-based extractors (email, phone, date, etc.) 94 + microformats.js # Microformat parser (h-card, h-event, h-adr) 95 + structured-data.js # JSON-LD / schema.org extraction 96 + readability.js # Mozilla Readability for content extraction 97 + entity-store.js # Entity CRUD via api.datastore 98 + entity-matcher.js # Dedup / approximate matching / confidence scoring 99 + wikidata.js # Wikidata validation and triangulation 100 + home.html # Entity browser UI 101 + home.js # Entity browser logic 102 + entity-detail.html # Single entity detail view 103 + entity-detail.js # Detail view logic 104 + ``` 105 + 106 + ### 4.2 Storage Model 107 + 108 + #### Option A: Entities as Items (Recommended) 109 + 110 + Add `entity` to the items type CHECK constraint. Each entity is an `items` row: 111 + 112 + ```sql 113 + -- Entity item 114 + items: { 115 + id: 'entity_17123...', 116 + type: 'entity', 117 + content: 'Tortoise', -- canonical name / primary label 118 + metadata: JSON.stringify({ 119 + entityType: 'organization', -- person, place, org, event, product, etc. 120 + subtype: 'band', -- optional finer classification 121 + aliases: ['Tortoise (band)', 'Tortoise Chicago'], 122 + wikidata: 'Q1400189', -- Wikidata QID if matched 123 + attributes: { -- type-specific structured data 124 + homepage: 'https://...', 125 + genres: ['post-rock'], 126 + location: 'Chicago, IL' 127 + }, 128 + confidence: 0.85, -- overall confidence score 129 + mergedFrom: [], -- IDs of entities merged into this one 130 + _sync: { createdBy: 'device-uuid', ... } 131 + }), 132 + frecencyScore: 42, -- based on observation frequency + recency 133 + createdAt: ..., 134 + updatedAt: ... 135 + } 136 + ``` 137 + 138 + Entity observations (sightings) use `item_events`: 139 + 140 + ```sql 141 + -- Entity observation: "entity X was seen on page Y" 142 + item_events: { 143 + id: 'evt_17123...', 144 + itemId: 'entity_17123...', -- FK to entity item 145 + content: 'https://pitchfork.com/reviews/tortoise-tnt', -- source URL 146 + value: 0.92, -- extraction confidence for this observation 147 + occurredAt: 1707500000000, -- when the page was loaded 148 + metadata: JSON.stringify({ 149 + extractedText: 'Tortoise released their album...', 150 + extractor: 'regex', -- which extractor found it 151 + context: 'article body', -- where on the page 152 + pageTitle: 'Tortoise - TNT Review', 153 + itemId: 'item_abc...' -- FK to the URL item if it exists 154 + }) 155 + } 156 + ``` 157 + 158 + This approach: 159 + - Reuses existing items/item_events infrastructure completely. 160 + - Entity frecency uses the same `calculateFrecency()` function. 161 + - Tags work naturally (tag entities like any item: `entity:person`, `music`, `chicago`). 162 + - Groups work (create a group of entities). 163 + - Sync works via the existing items sync pipeline. 164 + - Search works against items content/metadata. 165 + - Commands work via existing `api.datastore.queryItems({ type: 'entity' })`. 166 + 167 + #### Schema Migration 168 + 169 + ```sql 170 + -- Migration: allow 'entity' type in items CHECK constraint 171 + -- Same pattern as migrateItemTypes() in datastore.ts 172 + -- Recreate items table with updated CHECK: 173 + -- type IN ('url', 'text', 'tagset', 'image', 'series', 'feed', 'entity') 174 + ``` 175 + 176 + #### New Indexes for Entity Queries 177 + 178 + ```sql 179 + -- Fast lookup by entity type within entities 180 + CREATE INDEX IF NOT EXISTS idx_items_entity_type 181 + ON items((json_extract(metadata, '$.entityType'))) 182 + WHERE type = 'entity'; 183 + 184 + -- Fast Wikidata QID lookup 185 + CREATE INDEX IF NOT EXISTS idx_items_wikidata 186 + ON items((json_extract(metadata, '$.wikidata'))) 187 + WHERE type = 'entity' AND json_extract(metadata, '$.wikidata') IS NOT NULL; 188 + 189 + -- item_events by source URL (find all entities on a page) 190 + CREATE INDEX IF NOT EXISTS idx_item_events_content 191 + ON item_events(content) 192 + WHERE content LIKE 'http%'; 193 + ``` 194 + 195 + ### 4.3 Extraction Pipeline 196 + 197 + ``` 198 + Page Load (trackWindowLoad / did-navigate) 199 + | 200 + v 201 + [1] URL recorded in items (existing behavior) 202 + | 203 + v 204 + [2] Entity extraction triggered (async, non-blocking) 205 + | 206 + +---> Regex extractors (emails, phones, dates, prices, tracking#s) 207 + | 208 + +---> Structured data extractors (JSON-LD, microformats, meta tags) 209 + | 210 + +---> (Phase 2) NER heuristics (capitalization, context) 211 + | 212 + +---> (Phase 3) ML models (Web Worker) 213 + | 214 + v 215 + [3] Raw extractions collected 216 + | 217 + v 218 + [4] Entity matching / dedup 219 + | - Normalize names (lowercase, trim, alias matching) 220 + | - Check existing entities by content + entityType 221 + | - Confidence threshold filtering (discard < 0.3) 222 + | 223 + v 224 + [5] Entity items created or updated 225 + | - New entity -> addItem('entity', {...}) 226 + | - Existing entity -> addItemEvent(entityId, {observation}) 227 + | - Update frecencyScore based on new observation 228 + | 229 + v 230 + [6] Publish events 231 + - entities:extracted { url, entities: [...] } 232 + - entities:updated { entityId, observation } 233 + ``` 234 + 235 + ### 4.4 When Extraction Happens 236 + 237 + | Trigger | What Runs | Priority | 238 + |---------|-----------|----------| 239 + | **Page load (did-navigate)** | Full extraction pipeline | Primary - this is the main trigger | 240 + | **Item save (note/url via cmd)** | Regex extractors only | Secondary - lightweight scan of saved content | 241 + | **Background re-process** | All extractors on unprocessed URLs | Batch - fills gaps, retries failures | 242 + | **Manual "extract entities" command** | Full pipeline on current page | On-demand - user initiated | 243 + | **Feed entry ingestion** | Regex extractors on feed content | Feed event - triggered by feeds extension | 244 + 245 + For page loads, extraction should be deferred to avoid blocking the page rendering: 246 + 247 + ```javascript 248 + // In entities/background.js 249 + api.subscribe('page:loaded', async (msg) => { 250 + // Debounce: don't re-extract if we processed this URL recently 251 + const key = `extracted:${msg.url}`; 252 + const lastExtracted = extractionCache.get(key); 253 + if (lastExtracted && Date.now() - lastExtracted < REEXTRACT_COOLDOWN_MS) { 254 + return; 255 + } 256 + 257 + // Run extraction after short delay (let page finish rendering) 258 + setTimeout(async () => { 259 + const entities = await extractEntities(msg.url, msg.pageContent); 260 + await storeEntities(entities, msg.url); 261 + extractionCache.set(key, Date.now()); 262 + }, 2000); 263 + }, api.scopes.GLOBAL); 264 + ``` 265 + 266 + ### 4.5 Content Access Strategy 267 + 268 + Entity extraction needs access to page content. Three approaches, in order of preference: 269 + 270 + 1. **Content script injection** (via `webContents.executeJavaScript`): Extract text content, structured data, and microformats from the page DOM. The `trackWindowLoad` hook in `ipc.ts` already has access to the `BrowserWindow` and its `webContents`. Add a content extraction step after visit recording. 271 + 272 + 2. **Readability extraction**: Use Mozilla Readability (already mentioned in `peek-todo.md`) to extract clean article text. This gives better content for NER than raw DOM text. 273 + 274 + 3. **Metadata only**: For a minimal first pass, extract only from the page's `<meta>` tags, `<title>`, JSON-LD, and microformats. This requires minimal DOM access and yields high-quality structured entities. 275 + 276 + Recommended first implementation: approach 3 (metadata only), then layer in approaches 1 and 2 as extraction quality needs grow. 277 + 278 + ## 5. Entity Matching and Deduplication 279 + 280 + ### 5.1 Approximate Equivalency 281 + 282 + The core challenge: "Tortoise" on pitchfork.com and "Tortoise (band)" on wikipedia.org should resolve to the same entity. 283 + 284 + Matching strategy: 285 + 286 + ```javascript 287 + function matchEntity(candidate, entityType) { 288 + // 1. Exact content match (normalized) 289 + const normalized = normalize(candidate.name); // lowercase, trim, remove diacritics 290 + const exact = await queryItems({ 291 + type: 'entity', 292 + contentMatch: normalized 293 + }); 294 + if (exact.length === 1) return { match: exact[0], confidence: 1.0 }; 295 + 296 + // 2. Alias match 297 + for (const existing of allEntitiesOfType) { 298 + const aliases = JSON.parse(existing.metadata).aliases || []; 299 + if (aliases.some(a => normalize(a) === normalized)) { 300 + return { match: existing, confidence: 0.95 }; 301 + } 302 + } 303 + 304 + // 3. Wikidata match (if candidate has QID) 305 + if (candidate.wikidataId) { 306 + const wdMatch = await queryByWikidata(candidate.wikidataId); 307 + if (wdMatch) return { match: wdMatch, confidence: 0.99 }; 308 + } 309 + 310 + // 4. Fuzzy match (edit distance, token overlap) 311 + const fuzzy = findFuzzyMatches(normalized, entityType, 0.8); 312 + if (fuzzy.length === 1) return { match: fuzzy[0], confidence: fuzzy[0].score }; 313 + 314 + // 5. No match - new entity 315 + return { match: null, confidence: 0 }; 316 + } 317 + ``` 318 + 319 + ### 5.2 Confidence Scoring 320 + 321 + Each entity extraction gets a confidence score: 322 + 323 + - **1.0**: Structured data (JSON-LD Person with explicit name) 324 + - **0.95**: Microformat (h-card with p-name) 325 + - **0.9**: Email/phone regex (unambiguous patterns) 326 + - **0.8**: Date/time regex (common formats) 327 + - **0.7**: Wikidata-validated named entity 328 + - **0.5**: Capitalized multi-word phrase in article text 329 + - **0.3**: Single capitalized word (high false positive rate) 330 + 331 + Threshold for storage: **0.5** (configurable in extension settings). 332 + 333 + ### 5.3 Entity Merging 334 + 335 + When two entities are later determined to be the same: 336 + 337 + ```javascript 338 + async function mergeEntities(keepId, mergeId) { 339 + const keep = await getItem(keepId); 340 + const merge = await getItem(mergeId); 341 + 342 + // Combine aliases 343 + const keepMeta = JSON.parse(keep.metadata); 344 + const mergeMeta = JSON.parse(merge.metadata); 345 + keepMeta.aliases = [...new Set([ 346 + ...(keepMeta.aliases || []), 347 + merge.content, // add merged entity's name as alias 348 + ...(mergeMeta.aliases || []) 349 + ])]; 350 + keepMeta.mergedFrom = [...(keepMeta.mergedFrom || []), mergeId]; 351 + 352 + // Move observations 353 + const observations = await queryItemEvents({ itemId: mergeId }); 354 + for (const obs of observations.data) { 355 + await addItemEvent(keepId, { 356 + content: obs.content, 357 + value: obs.value, 358 + occurredAt: obs.occurredAt, 359 + metadata: obs.metadata 360 + }); 361 + await deleteItemEvent(obs.id); 362 + } 363 + 364 + // Update entity 365 + await updateItem(keepId, { metadata: JSON.stringify(keepMeta) }); 366 + await deleteItem(mergeId); // soft delete 367 + } 368 + ``` 369 + 370 + ## 6. Integration Points 371 + 372 + ### 6.1 Command Palette 373 + 374 + Register commands in `entities/background.js`: 375 + 376 + | Command | Description | Behavior | 377 + |---------|-------------|----------| 378 + | `entities` | Open entity browser | Opens `peek://ext/entities/home.html` | 379 + | `entity <name>` | Search entities by name | Fuzzy search, show matches | 380 + | `entity:extract` | Extract entities from current page | Run pipeline on focused window | 381 + | `entity:people` | Browse person entities | Filter entity browser to type=person | 382 + | `entity:places` | Browse place entities | Filter entity browser to type=place | 383 + | `entity:merge` | Merge two entities | Interactive merge workflow | 384 + 385 + ### 6.2 Tags Integration 386 + 387 + Entity types map to tags: 388 + - Every entity item gets tagged with `entity:{type}` (e.g., `entity:person`, `entity:place`). 389 + - Users can add additional tags to entities like any item. 390 + - Tag-based filtering in the tags extension works out of the box: click `entity:person` to see all person entities. 391 + 392 + ### 6.3 Groups Integration 393 + 394 + - Create smart groups based on entity relationships: "All pages mentioning Alice" = group with query on entity observations. 395 + - Entities found on pages in a group could surface in group metadata. 396 + 397 + ### 6.4 Search Integration 398 + 399 + Entities participate in the existing search infrastructure: 400 + - `queryItems({ type: 'entity', search: 'tortoise' })` for direct entity search. 401 + - When searching URLs/notes, also show entities found on matching pages. 402 + - Entity names as search suggestions in cmd. 403 + 404 + ### 6.5 Page Info / Metadata Overlay 405 + 406 + When viewing a page, show detected entities in the page info overlay: 407 + - List of entities found on the current page. 408 + - Each entity links to its detail view. 409 + - "Also seen on" links showing other pages where the entity appears. 410 + - This ties into the "Page info/metadata/action widgets" todo. 411 + 412 + ### 6.6 Feeds Integration 413 + 414 + When feeds pull in new entries, run entity extraction on entry content: 415 + - Extract entities from RSS item descriptions. 416 + - Link entity observations to the feed entry's URL. 417 + - Over time, feeds become a rich source of entity observations. 418 + 419 + ### 6.7 Sync 420 + 421 + - Entity items sync like any other item (via `syncId`/`syncedAt`). 422 + - Entity observations (`item_events`) are local-only by default (high volume). Can opt-in to sync via `metadata.sync: true` per entity. 423 + - Wikidata QIDs provide stable cross-device entity identity. 424 + 425 + ## 7. Wikidata Integration 426 + 427 + ### 7.1 Validation 428 + 429 + After extracting a named entity, check Wikidata for validation: 430 + 431 + ```javascript 432 + async function validateWithWikidata(entityName, entityType) { 433 + // Use Wikidata search API 434 + const url = `https://www.wikidata.org/w/api.php?action=wbsearchentities&search=${encodeURIComponent(entityName)}&language=en&format=json&limit=5`; 435 + const response = await fetch(url); 436 + const data = await response.json(); 437 + 438 + // Filter by entity type (instance-of mapping) 439 + const typeMap = { 440 + person: 'Q5', 441 + organization: 'Q43229', 442 + place: 'Q17334923', 443 + event: 'Q1656682' 444 + }; 445 + 446 + // Return best match with QID 447 + return data.search.map(result => ({ 448 + qid: result.id, 449 + label: result.label, 450 + description: result.description, 451 + confidence: calculateWikidataConfidence(result, entityName) 452 + })); 453 + } 454 + ``` 455 + 456 + ### 7.2 Triangulation 457 + 458 + Use Wikidata relationships to infer entity connections: 459 + - "Tortoise" (Q1400189) -> has member -> "John McEntire" (Q3182157) 460 + - If both entities are observed, create inferred relationship. 461 + - Store inferred links in entity metadata: `relationships: [{ entityId, type, wikidataProperty, confidence }]`. 462 + 463 + ### 7.3 Enrichment 464 + 465 + Pull attributes from Wikidata to fill entity details: 466 + - Person: birth date, occupation, image 467 + - Place: coordinates, population, country 468 + - Organization: founding date, headquarters, website 469 + 470 + Cache Wikidata responses locally. Re-validate periodically. 471 + 472 + ## 8. Extension API 473 + 474 + ### 8.1 Pubsub Topics 475 + 476 + ```javascript 477 + // Published by entities extension 478 + 'entities:extracted' // { url, entities: [{ id, name, type, confidence }] } 479 + 'entities:created' // { entityId, name, type } 480 + 'entities:updated' // { entityId, observation } 481 + 'entities:merged' // { keepId, mergedId } 482 + 483 + // Subscribed by entities extension 484 + 'entities:extract' // { url, content? } - request extraction 485 + 'entities:search' // { query, type? } - search entities 486 + 'entities:get-for-url' // { url } - get entities found on a URL 487 + ``` 488 + 489 + ### 8.2 Datastore Queries 490 + 491 + Other extensions query entities using existing `api.datastore` methods: 492 + 493 + ```javascript 494 + // Get all person entities 495 + const people = await api.datastore.queryItems({ type: 'entity', metadata: { entityType: 'person' } }); 496 + 497 + // Get entities observed on a specific URL 498 + const observations = await api.datastore.queryItemEvents({ 499 + content: 'https://example.com/page' 500 + }); 501 + const entityIds = observations.data.map(o => o.itemId); 502 + 503 + // Get entity with observations 504 + const entity = await api.datastore.getItem(entityId); 505 + const history = await api.datastore.queryItemEvents({ itemId: entityId, limit: 50 }); 506 + ``` 507 + 508 + ## 9. UI Design 509 + 510 + ### 9.1 Entity Browser (`home.html`) 511 + 512 + A card grid (using `peek-grid` and `peek-card` components) showing all entities: 513 + 514 + - **Filter bar**: Type filter (All / People / Places / Events / Organizations / Numbers) 515 + - **Search input**: Fuzzy search across entity names and aliases 516 + - **Sort**: By frecency (default), alphabetical, recently seen, most observed 517 + - **Cards**: Entity name, type icon, observation count, last seen date, confidence indicator 518 + - **Click**: Opens entity detail view 519 + 520 + ### 9.2 Entity Detail View (`entity-detail.html`) 521 + 522 + - **Header**: Entity name, type badge, Wikidata link, confidence score 523 + - **Aliases**: List of known aliases, editable 524 + - **Attributes**: Type-specific structured data (email, phone, address, etc.) 525 + - **Source URLs**: List of pages where entity was observed, sorted by recency 526 + - Each with visit timestamp, extraction confidence, extracted text snippet 527 + - **Related entities**: Other entities frequently co-occurring on the same pages 528 + - **Tags**: Standard tag editing (shared with existing tag editing UI) 529 + - **Actions**: Merge, delete, re-extract, open Wikidata 530 + 531 + ### 9.3 Page Entity Overlay 532 + 533 + When viewing a web page, the page info overlay shows: 534 + - Count of entities detected on this page 535 + - List of entity cards (inline, compact) 536 + - Click entity to open detail view 537 + 538 + ### 9.4 Cmd Entity Suggestions 539 + 540 + When typing in cmd, entity names appear as suggestions alongside URLs and commands. Selecting an entity opens its detail view or offers actions (open source URLs, view related, etc.). 541 + 542 + ## 10. Implementation Phases 543 + 544 + ### Phase 1: Foundation (Core Infrastructure) 545 + 546 + 1. **Schema migration**: Add `entity` to items type CHECK constraint. 547 + 2. **`extensions/entities/` skeleton**: Manifest, background script, init/uninit lifecycle. 548 + 3. **Regex extractors**: Email, phone, date/time, tracking numbers, prices. 549 + 4. **Entity store**: CRUD operations using `api.datastore` item and item_event APIs. 550 + 5. **Basic matching**: Exact name + type dedup. 551 + 6. **Extraction trigger**: Hook into `page:loaded` pubsub or `trackWindowLoad`. 552 + 7. **Commands**: `entities`, `entity:extract`. 553 + 8. **Basic browser UI**: Card grid listing all entities, filter by type. 554 + 555 + **Deliverable**: Structured data (emails, phones, dates) automatically extracted from visited pages, viewable in entity browser. 556 + 557 + ### Phase 2: Structured Data + Wikidata 558 + 559 + 1. **Structured data extractors**: JSON-LD, microformats (h-card, h-event, h-adr), Open Graph. 560 + 2. **Readability integration**: Clean content extraction for better entity context. 561 + 3. **Wikidata validation**: API integration for entity validation/enrichment. 562 + 4. **Confidence scoring**: Multi-signal confidence calculation. 563 + 5. **Entity detail view**: Full detail page with observations, attributes, related entities. 564 + 6. **Page entity overlay**: Show entities in page info. 565 + 7. **Search integration**: Entity names as cmd suggestions. 566 + 567 + **Deliverable**: Named entities (people, places, orgs) extracted from structured data, validated against Wikidata, browsable with full detail views. 568 + 569 + ### Phase 3: NER + Relationships 570 + 571 + 1. **Heuristic NER**: Capitalization analysis, context patterns, gazetteer matching. 572 + 2. **Approximate matching**: Fuzzy dedup, alias resolution. 573 + 3. **Entity merging**: UI and API for merging duplicate entities. 574 + 4. **Relationship inference**: Co-occurrence analysis, Wikidata relationship mining. 575 + 5. **Smart groups**: Auto-generated groups from entity clusters. 576 + 6. **Feeds integration**: Entity extraction from feed entries. 577 + 578 + **Deliverable**: Statistical entity extraction from unstructured text, entity relationship graph, cross-page entity linking. 579 + 580 + ### Phase 4: ML + Polish 581 + 582 + 1. **ML NER models**: ONNX models in Web Workers for offline NER. 583 + 2. **Micro-model training**: Train specialized models from accumulated entity data. 584 + 3. **Expiration policies**: Archiving low-confidence, stale entities. 585 + 4. **Import/export**: Entity data portability. 586 + 5. **Sync optimization**: Selective entity sync, conflict resolution. 587 + 6. **Performance tuning**: Extraction throttling, index optimization, cache strategies. 588 + 589 + **Deliverable**: ML-powered entity extraction, self-improving models, production-ready performance. 590 + 591 + ## 11. Technical Considerations 592 + 593 + ### 11.1 Performance 594 + 595 + - Entity extraction must never block page rendering. All extraction runs asynchronously after page load. 596 + - Debounce re-extraction: if a URL was processed within the last hour, skip (configurable). 597 + - Batch entity creation: collect all extractions from a page, then write to datastore in a single transaction. 598 + - SQLite JSON functions (`json_extract`) are fast but should not be in hot paths. Denormalize entity type into a tag for fast filtering. 599 + - Consider a background re-processing queue for pages visited before the extension was installed. 600 + 601 + ### 11.2 Content Access Challenges 602 + 603 + - **Webview content**: Pages loaded in webviews (`<webview>` tags in `app/page/page.js`) require `executeJavaScript` calls into the guest page. This is available via Electron's webview API. 604 + - **Cross-origin restrictions**: Content scripts can read DOM of the loaded page, but not iframes from different origins. Accept this limitation. 605 + - **SPAs**: Single-page apps change content without navigation. The `did-navigate` hook misses these. Consider `did-navigate-in-page` for hash changes, but full SPA content observation requires MutationObserver-based approaches (Phase 3+). 606 + 607 + ### 11.3 Privacy 608 + 609 + - All entity data stays local by default (sync opt-in per entity). 610 + - No telemetry or external API calls except Wikidata (user-visible, rate-limited). 611 + - Entity extraction can be globally toggled off in extension settings. 612 + - Individual entity types can be toggled (e.g., disable person detection). 613 + - Private/incognito windows should skip entity extraction. 614 + 615 + ### 11.4 Storage Volume 616 + 617 + Conservative estimates: 618 + - Average page yields 5-10 entity observations. 619 + - 100 pages/day = 500-1000 observations/day. 620 + - Each observation is ~200 bytes in item_events. 621 + - 365 days = ~73 MB of observations. Manageable for SQLite. 622 + - Unique entities grow slowly (Zipf distribution): ~50-100 new entities/day initially, declining. 623 + 624 + ### 11.5 Extension Settings Schema 625 + 626 + ```json 627 + { 628 + "type": "object", 629 + "properties": { 630 + "enabled": { "type": "boolean", "default": true }, 631 + "extractOnPageLoad": { "type": "boolean", "default": true }, 632 + "confidenceThreshold": { "type": "number", "default": 0.5, "minimum": 0.1, "maximum": 1.0 }, 633 + "reextractCooldownMinutes": { "type": "integer", "default": 60 }, 634 + "enabledTypes": { 635 + "type": "object", 636 + "properties": { 637 + "email": { "type": "boolean", "default": true }, 638 + "phone": { "type": "boolean", "default": true }, 639 + "date": { "type": "boolean", "default": true }, 640 + "address": { "type": "boolean", "default": true }, 641 + "person": { "type": "boolean", "default": true }, 642 + "organization": { "type": "boolean", "default": true }, 643 + "place": { "type": "boolean", "default": true }, 644 + "event": { "type": "boolean", "default": true }, 645 + "trackingNumber": { "type": "boolean", "default": true }, 646 + "price": { "type": "boolean", "default": true } 647 + } 648 + }, 649 + "wikidataValidation": { "type": "boolean", "default": true }, 650 + "skipInternalPages": { "type": "boolean", "default": true } 651 + } 652 + } 653 + ``` 654 + 655 + ## 12. Dependencies 656 + 657 + ### External Libraries (Evaluate) 658 + 659 + | Library | Purpose | Size | Notes | 660 + |---------|---------|------|-------| 661 + | [microformats-parser](https://github.com/microformats/microformats-parser) | Parse microformats2 | ~15 KB | Mentioned in peek-todo.md | 662 + | [@mozilla/readability](https://github.com/mozilla/readability) | Clean text extraction | ~30 KB | Mentioned in peek-todo.md | 663 + | [chrono-node](https://github.com/wanasit/chrono) | Natural language date parsing | ~150 KB | For date/time entity extraction | 664 + | [libphonenumber-js](https://github.com/catamphetamine/libphonenumber-js) | Phone number parsing | ~90 KB (min) | For phone number extraction | 665 + 666 + ### No External Dependencies Needed For 667 + 668 + - Email extraction (regex, already in `app/components/schema.js` FORMATS) 669 + - URL extraction (already exists in multiple places) 670 + - JSON-LD extraction (DOM query + JSON.parse) 671 + - Open Graph extraction (DOM meta tag queries) 672 + - Basic NER heuristics (custom code) 673 + - Wikidata API (fetch + JSON) 674 + 675 + ## 13. Open Questions 676 + 677 + 1. **Entity type as tag vs. metadata field**: Using `metadata.entityType` keeps entity classification in structured data. Using tags (e.g., `entity:person`) makes it filterable through existing tag UI. Recommendation: both -- store in metadata AND auto-tag. 678 + 679 + 2. **Entity identity across devices**: Wikidata QIDs provide stable identity. For entities without Wikidata matches (e.g., "my friend Alice"), identity relies on name + type + device sync. Potential for duplicates across devices -- merge UI is essential. 680 + 681 + 3. **Table extraction**: The todo mentions "extract a table as CSV." This is a chaining/connector concern more than an entity concern. Tables could be extracted by the lists/csv command pipeline, with entities extracted from table cell contents. 682 + 683 + 4. **Calendar integration**: The todo mentions "layer outside of web page, and in between pages (eg event page -> event -> any calendar page)." This is entity-triggered action: detect event entity -> offer "add to calendar" action. Implementation depends on calendar connector (Phase 3+). 684 + 685 + 5. **Observation granularity**: Should we record one observation per entity per page load, or one per occurrence on the page? Recommendation: one per page load (observation = "entity X was seen on page Y at time T"), with occurrence count in observation metadata. 686 + 687 + 6. **Background reprocessing**: Should the extension reprocess historical URLs? Yes, but as a low-priority background task, processing a few URLs per minute to avoid resource contention. Gate behind user opt-in.