experiments in a post-browser web
10
fork

Configure Feed

Select the types of activity you want to include in your feed.

test(entities): add comprehensive test data for entity recognition

+2347 -58
+16
DEVELOPMENT.md
··· 205 205 - Transient state propagates to child windows (e.g., cmd → overlay) 206 206 - Use `api.izui.isTransient()` to query current session state 207 207 208 + ### Page View Canvas Architecture 209 + 210 + **DO NOT REPLACE this architecture.** It was accidentally destroyed on Feb 8 2025 (commit `nnpvvvvk`) when a coordinate-mapping bug in slides was "fixed" by removing the entire canvas model. The real fix was to exclude slides from the canvas path via `useCanvas`. This section documents the restored architecture. 211 + 212 + Web pages opened as content or child-content use a **fullscreen transparent canvas**: 213 + 214 + 1. **Backend** (`ipc.ts`): A `useCanvas` flag determines whether a web page gets the canvas treatment. Canvas pages (`useCanvas = true`) are content/child-content web pages that are not modals, overlays, or quick-views. Non-canvas web pages (slides, modals) load their URL directly in a positioned BrowserWindow. 215 + 216 + 2. **Fullscreen transparent BrowserWindow**: The canvas window is sized to cover the entire display work area at position (0,0) via `screen.getDisplayNearestPoint()`. Background color is `#00000000` (fully transparent). This creates an invisible surface for positioning UI elements. 217 + 218 + 3. **JS positioning** (`page.js`): All elements (`<webview>`, navbar, trigger zone, resize handle, mode indicator) are `position: absolute` siblings with NO CSS top/left/right/bottom. JS `updatePositions()` sets inline styles based on a `bounds` object `{x, y, width, height}` parsed from URL params. 219 + 220 + 4. **Custom drag/resize**: Dragging the navbar updates `bounds.x`/`bounds.y`. Resizing via the corner handle updates `bounds.width`/`bounds.height`. Both call `updatePositions()`. No `-webkit-app-region: drag` — the navbar uses a custom mousedown handler. 221 + 222 + 5. **Navbar**: Sits ABOVE the webview with an 8px gap. Shown by Cmd+L or hovering the trigger zone. Has rounded corners and grab cursor. 223 + 208 224 ### Data Storage 209 225 210 226 **Settings Storage (localStorage)**:
+7 -21
app/page/index.html
··· 20 20 } 21 21 22 22 /* 23 - * Webview fills the entire window. 24 - * When the navbar is visible, the webview shifts down via .navbar-active 25 - * to avoid the Electron <webview> layer rendering over the navbar. 23 + * Webview — positioned by JS via updatePositions() on the transparent canvas. 24 + * No explicit top/left/right/bottom — JS sets these based on bounds. 26 25 */ 27 26 webview { 28 27 position: absolute; 29 - top: 0; left: 0; right: 0; bottom: 0; 30 28 border: none; 31 29 border-radius: 10px; 32 30 overflow: hidden; 33 31 -webkit-mask-image: -webkit-radial-gradient(white, white); 34 - transition: top 0.15s ease; 35 - } 36 - 37 - /* Push webview down when navbar is active so it doesn't cover the navbar */ 38 - webview.navbar-active { 39 - top: 36px; 40 32 } 41 33 42 34 /* 43 - * Navbar — full-width bar at the top of the window. 44 - * 45 - * Shown via Cmd+L or hovering near the top of the window. 46 - * Same width as the page, no floating bubble effect. 47 - * Draggable via -webkit-app-region: drag. 35 + * Navbar — positioned by JS via updatePositions() ABOVE the webview with a gap. 36 + * Dragging is handled by custom JS (moves bounds), not -webkit-app-region. 48 37 */ 49 38 .navbar { 50 39 position: absolute; 51 - top: 0; left: 0; right: 0; 52 40 height: 36px; 53 41 display: none; 54 42 align-items: center; ··· 61 49 font-family: var(--theme-font-sans, system-ui, -apple-system, BlinkMacSystemFont, sans-serif); 62 50 font-size: 15px; 63 51 border: none; 64 - -webkit-app-region: drag; 52 + border-radius: 10px; 53 + cursor: grab; 65 54 user-select: none; 66 55 -webkit-user-select: none; 67 56 z-index: 100; ··· 72 61 } 73 62 74 63 .nav-btn { 75 - -webkit-app-region: no-drag; 76 64 background: none; 77 65 border: none; 78 66 color: var(--theme-text-secondary, #bbb); ··· 104 92 } 105 93 106 94 .url-text { 107 - -webkit-app-region: no-drag; 108 95 flex: 1; 109 96 overflow: hidden; 110 97 text-overflow: ellipsis; ··· 122 109 background: var(--theme-bg, rgba(128, 128, 128, 0.18)); 123 110 } 124 111 125 - /* Trigger zone at top of window — hover here to reveal the navbar */ 112 + /* Trigger zone — positioned by JS via updatePositions() above the navbar area */ 126 113 .trigger-zone { 127 114 position: absolute; 128 - top: 0; left: 0; right: 0; 129 115 height: 50px; 130 116 z-index: 50; 131 117 }
+127 -26
app/page/page.js
··· 1 1 /** 2 - * peek://page - Container for web content 2 + * peek://page - Fullscreen transparent canvas container for web content 3 3 * 4 - * Content-sized BrowserWindow with: 5 - * - Webview filling the entire window 6 - * - Full-width navbar at the top (Cmd+L or hover near top to show, Escape/click-outside to dismiss) 7 - * - When shown, the webview shifts down so the Electron <webview> layer doesn't cover the navbar 8 - * - Drag via -webkit-app-region: drag on the navbar 9 - * - Resize via IPC to resize the BrowserWindow 4 + * Architecture: A fullscreen transparent BrowserWindow covers the entire display. 5 + * All UI elements (webview, navbar, trigger zone, resize handle, mode indicator) 6 + * are position:absolute siblings, positioned by JS via updatePositions() using 7 + * a `bounds` object. The navbar sits ABOVE the webview with an 8px gap. 8 + * 9 + * - Custom drag: mousedown on navbar moves bounds 10 + * - Custom resize: mousedown on resize handle changes bounds size 11 + * - Hover trigger zone above navbar reveals/hides navbar 12 + * - Cmd+L shows navbar with URL focus 10 13 */ 11 14 12 15 import api from '../api.js'; 13 16 14 17 const DEBUG = true; 15 18 16 - // Parse URL parameters 19 + // --- Constants --- 20 + const NAVBAR_HEIGHT = 36; 21 + const NAVBAR_GAP = 8; 22 + const MIN_WIDTH = 200; 23 + const MIN_HEIGHT = 150; 24 + 25 + // --- Parse URL parameters --- 17 26 const params = new URLSearchParams(window.location.search); 18 27 const targetUrl = params.get('url'); 28 + const initialX = parseInt(params.get('x')) || 100; 29 + const initialY = parseInt(params.get('y')) || 100; 30 + const initialWidth = parseInt(params.get('width')) || 800; 31 + const initialHeight = parseInt(params.get('height')) || 600; 19 32 20 33 if (!targetUrl) { 21 34 console.error('[page] No URL provided'); ··· 23 36 throw new Error('No URL provided to peek://page'); 24 37 } 25 38 26 - DEBUG && console.log('[page] Loading:', targetUrl); 39 + DEBUG && console.log('[page] Loading:', targetUrl, 'at', initialX, initialY, initialWidth, initialHeight); 27 40 28 - // DOM elements 41 + // --- Bounds state --- 42 + // All positioning is derived from this single object 43 + let bounds = { 44 + x: initialX, 45 + y: initialY, 46 + width: initialWidth, 47 + height: initialHeight, 48 + }; 49 + 50 + // --- DOM elements --- 29 51 const navbar = document.getElementById('navbar'); 30 52 const triggerZone = document.getElementById('trigger-zone'); 31 53 const webview = document.getElementById('content'); ··· 36 58 const urlText = document.getElementById('url-text'); 37 59 const modeIndicator = document.getElementById('mode-indicator'); 38 60 39 - // Set up webview partition for session isolation and load the target URL 61 + // --- Position all elements based on bounds --- 62 + 63 + function updatePositions() { 64 + const { x, y, width, height } = bounds; 65 + 66 + // Webview — the main content area 67 + webview.style.left = `${x}px`; 68 + webview.style.top = `${y}px`; 69 + webview.style.width = `${width}px`; 70 + webview.style.height = `${height}px`; 71 + 72 + // Navbar — above the webview with a gap 73 + const navbarTop = Math.max(0, y - NAVBAR_GAP - NAVBAR_HEIGHT); 74 + navbar.style.left = `${x}px`; 75 + navbar.style.top = `${navbarTop}px`; 76 + navbar.style.width = `${width}px`; 77 + 78 + // Trigger zone — covers the area above the webview where hover reveals the navbar 79 + const triggerTop = Math.max(0, navbarTop - 14); 80 + triggerZone.style.left = `${x}px`; 81 + triggerZone.style.top = `${triggerTop}px`; 82 + triggerZone.style.width = `${width}px`; 83 + 84 + // Resize handle — bottom-right corner of the webview 85 + resizeHandle.style.left = `${x + width - 16}px`; 86 + resizeHandle.style.top = `${y + height - 16}px`; 87 + 88 + // Mode indicator — top-right corner of the webview 89 + if (modeIndicator) { 90 + modeIndicator.style.left = `${x + width - 120}px`; 91 + modeIndicator.style.top = `${y + 8}px`; 92 + } 93 + } 94 + 95 + // Initial positioning 96 + updatePositions(); 97 + 98 + // --- Set up webview partition and load URL --- 99 + 40 100 async function initWebview() { 41 101 try { 42 102 // Get the partition string for the current profile ··· 59 119 // Start initialization 60 120 initWebview(); 61 121 62 - // --- Resize via IPC --- 122 + // --- Custom drag (navbar) --- 123 + 124 + let isDragging = false; 125 + let dragStartX = 0; 126 + let dragStartY = 0; 127 + let dragStartBoundsX = 0; 128 + let dragStartBoundsY = 0; 129 + 130 + navbar.addEventListener('mousedown', (e) => { 131 + // Don't start drag on buttons or URL text 132 + if (e.target.closest('.nav-btn') || e.target.closest('.url-text')) return; 133 + isDragging = true; 134 + dragStartX = e.screenX; 135 + dragStartY = e.screenY; 136 + dragStartBoundsX = bounds.x; 137 + dragStartBoundsY = bounds.y; 138 + navbar.style.cursor = 'grabbing'; 139 + e.preventDefault(); 140 + }); 141 + 142 + // --- Custom resize (resize handle) --- 63 143 64 144 let isResizing = false; 145 + let resizeStartX = 0; 146 + let resizeStartY = 0; 147 + let resizeStartWidth = 0; 148 + let resizeStartHeight = 0; 65 149 66 150 resizeHandle.addEventListener('mousedown', (e) => { 67 151 isResizing = true; 152 + resizeStartX = e.screenX; 153 + resizeStartY = e.screenY; 154 + resizeStartWidth = bounds.width; 155 + resizeStartHeight = bounds.height; 68 156 e.preventDefault(); 69 157 e.stopPropagation(); 70 158 }); 159 + 160 + // --- Shared mousemove/mouseup for drag and resize --- 71 161 72 162 document.addEventListener('mousemove', (e) => { 73 - if (isResizing) { 74 - const newWidth = Math.max(200, e.screenX - window.screenX); 75 - const newHeight = Math.max(150, e.screenY - window.screenY); 76 - api.invoke('window-set-bounds', { width: newWidth, height: newHeight }); 163 + if (isDragging) { 164 + const dx = e.screenX - dragStartX; 165 + const dy = e.screenY - dragStartY; 166 + bounds.x = dragStartBoundsX + dx; 167 + bounds.y = dragStartBoundsY + dy; 168 + updatePositions(); 169 + } else if (isResizing) { 170 + const dx = e.screenX - resizeStartX; 171 + const dy = e.screenY - resizeStartY; 172 + bounds.width = Math.max(MIN_WIDTH, resizeStartWidth + dx); 173 + bounds.height = Math.max(MIN_HEIGHT, resizeStartHeight + dy); 174 + updatePositions(); 77 175 } 78 176 }); 79 177 80 178 document.addEventListener('mouseup', () => { 81 - isResizing = false; 179 + if (isDragging) { 180 + isDragging = false; 181 + navbar.style.cursor = 'grab'; 182 + } 183 + if (isResizing) { 184 + isResizing = false; 185 + } 82 186 }); 83 187 84 188 // --- State display --- ··· 89 193 } 90 194 91 195 // --- Show / Hide navbar --- 92 - // The full-width navbar is shown by: 196 + // The navbar is shown by: 93 197 // 1. Cmd+L (published from main process via before-input-event on guest/host webContents) 94 - // 2. Hovering near the top of the window (trigger zone) 198 + // 2. Hovering near the top of the webview (trigger zone) 95 199 // It is hidden by clicking outside, pressing Escape, or moving mouse away (hover mode). 96 - // 97 - // When visible, the <webview> element shifts down (via .navbar-active class) so the 98 - // Electron composited webview layer doesn't paint over the navbar. 99 200 100 201 let hideTimer = null; 101 202 let showSource = null; // 'hover' or 'shortcut' — determines dismiss behavior ··· 107 208 } 108 209 const wasHidden = !navbar.classList.contains('visible'); 109 210 navbar.classList.add('visible'); 110 - webview.classList.add('navbar-active'); 111 211 if (opts?.source) showSource = opts.source; 112 212 if (wasHidden) { 113 213 updateState(); ··· 125 225 } 126 226 127 227 function hide() { 228 + // Don't hide while dragging 229 + if (isDragging) return; 128 230 if (hideTimer) { 129 231 clearTimeout(hideTimer); 130 232 hideTimer = null; 131 233 } 132 234 navbar.classList.remove('visible'); 133 - webview.classList.remove('navbar-active'); 134 235 window.getSelection().removeAllRanges(); 135 236 showSource = null; 136 237 DEBUG && console.log('[page] Navbar hidden'); ··· 152 253 }); 153 254 154 255 // --- Hover trigger zone --- 155 - // Mouse entering the thin strip at the top of the window shows the navbar. 256 + // Mouse entering the area above the webview shows the navbar. 156 257 // Moving away from both the trigger zone AND the navbar hides it (with a small delay). 157 258 158 259 function scheduleHide() { ··· 452 553 // Initialize mode context 453 554 initModeContext(); 454 555 455 - DEBUG && console.log('[page] Container initialized for:', targetUrl); 556 + DEBUG && console.log('[page] Canvas container initialized for:', targetUrl);
+2154
backend/electron/entities.test.ts
··· 1074 1074 assert.strictEqual(normalizeName('Mary-Jane'), 'mary-jane'); 1075 1075 }); 1076 1076 }); 1077 + 1078 + // ═══════════════════════════════════════════════════════════════════════ 1079 + // COMPREHENSIVE ADDITIONAL TESTS 1080 + // ═══════════════════════════════════════════════════════════════════════ 1081 + 1082 + // ─── Regex Extractors: Realistic & Edge Cases ────────────────────── 1083 + 1084 + describe('Regex Extractors - Comprehensive', () => { 1085 + 1086 + // ── Emails: Realistic web content ──────────────────────────────── 1087 + 1088 + describe('emails in realistic web page content', () => { 1089 + it('should extract emails embedded in long paragraphs', () => { 1090 + const text = `Welcome to our company! We are a leading provider of cloud-based solutions 1091 + for enterprise customers. If you have any questions about our products, pricing, 1092 + or partnership opportunities, please don't hesitate to reach out to our team at 1093 + partnerships@cloudwidgets.io. Our customer support team is also available 24/7 1094 + at support@cloudwidgets.io to help you with any technical issues. For press 1095 + inquiries, contact media@cloudwidgets.io.`; 1096 + const result = extractRegexEntities(text, 'https://cloudwidgets.io'); 1097 + const emails = result.filter(e => e.entityType === 'email'); 1098 + assert.strictEqual(emails.length, 3); 1099 + const names = emails.map(e => e.name).sort(); 1100 + assert.deepStrictEqual(names, [ 1101 + 'media@cloudwidgets.io', 1102 + 'partnerships@cloudwidgets.io', 1103 + 'support@cloudwidgets.io' 1104 + ]); 1105 + }); 1106 + 1107 + it('should extract emails from contact page text', () => { 1108 + const text = `Contact Us 1109 + General Inquiries: info@megacorp.com 1110 + Sales Team: sales@megacorp.com 1111 + Technical Support: techsupport@megacorp.com 1112 + Human Resources: hr@megacorp.com 1113 + Press & Media: press@megacorp.com`; 1114 + const result = extractRegexEntities(text, 'https://megacorp.com/contact'); 1115 + const emails = result.filter(e => e.entityType === 'email'); 1116 + assert.strictEqual(emails.length, 5); 1117 + }); 1118 + 1119 + it('should extract emails from footer text', () => { 1120 + const text = `© 2026 Acme Industries. All rights reserved. 1121 + 123 Industrial Blvd, Suite 400, San Francisco, CA 94105. 1122 + Email: legal@acmeindustries.com | Phone: (415) 555-0199`; 1123 + const result = extractRegexEntities(text, 'https://acmeindustries.com'); 1124 + const emails = result.filter(e => e.entityType === 'email'); 1125 + assert.strictEqual(emails.length, 1); 1126 + assert.strictEqual(emails[0].name, 'legal@acmeindustries.com'); 1127 + }); 1128 + 1129 + it('should extract emails with unusual but valid TLDs', () => { 1130 + const text = `curator@museum-of-art.museum 1131 + agent@luxury-travel.travel 1132 + studio@portrait.photography`; 1133 + const result = extractRegexEntities(text, 'https://example.com'); 1134 + const emails = result.filter(e => e.entityType === 'email'); 1135 + assert.strictEqual(emails.length, 3); 1136 + assert.ok(emails.some(e => e.name === 'curator@museum-of-art.museum')); 1137 + assert.ok(emails.some(e => e.name === 'agent@luxury-travel.travel')); 1138 + assert.ok(emails.some(e => e.name === 'studio@portrait.photography')); 1139 + }); 1140 + 1141 + it('should extract emails with numbers in local part', () => { 1142 + const text = 'Contact user123@domain.com or test456@company.org'; 1143 + const result = extractRegexEntities(text, 'https://example.com'); 1144 + const emails = result.filter(e => e.entityType === 'email'); 1145 + assert.strictEqual(emails.length, 2); 1146 + assert.ok(emails.some(e => e.name === 'user123@domain.com')); 1147 + assert.ok(emails.some(e => e.name === 'test456@company.org')); 1148 + }); 1149 + 1150 + it('should extract emails near punctuation marks', () => { 1151 + const text = 'Send to (email@test.com) or email2@test.com.'; 1152 + const result = extractRegexEntities(text, 'https://test.com'); 1153 + const emails = result.filter(e => e.entityType === 'email'); 1154 + // The regex matches emails, but trailing period may or may not be included 1155 + assert.ok(emails.length >= 1); 1156 + assert.ok(emails.some(e => e.name === 'email@test.com')); 1157 + }); 1158 + 1159 + it('should extract emails in mailto link text', () => { 1160 + const text = 'Click here to email us: mailto:hello@startup.co for more details.'; 1161 + const result = extractRegexEntities(text, 'https://startup.co'); 1162 + const emails = result.filter(e => e.entityType === 'email'); 1163 + assert.ok(emails.length >= 1); 1164 + assert.ok(emails.some(e => e.name === 'hello@startup.co')); 1165 + }); 1166 + 1167 + it('should extract multiple emails in a dense single paragraph', () => { 1168 + const text = 'Our team: alice@co.com, bob@co.com, charlie@co.com, diana@co.com, eve@co.com, frank@co.com, grace@co.com and henry@co.com.'; 1169 + const result = extractRegexEntities(text, 'https://co.com'); 1170 + const emails = result.filter(e => e.entityType === 'email'); 1171 + assert.strictEqual(emails.length, 8); 1172 + }); 1173 + 1174 + it('should handle emails with underscores and percent signs', () => { 1175 + const text = 'Contact first_last@example.com or special%user@domain.net'; 1176 + const result = extractRegexEntities(text, 'https://example.com'); 1177 + const emails = result.filter(e => e.entityType === 'email'); 1178 + assert.strictEqual(emails.length, 2); 1179 + }); 1180 + 1181 + it('should not match CSS-like selectors as emails', () => { 1182 + const text = 'function getName() { return div.class@media; }'; 1183 + const result = extractRegexEntities(text, 'https://example.com'); 1184 + const emails = result.filter(e => e.entityType === 'email'); 1185 + // The regex may or may not match this -- depends on pattern 1186 + // The key thing: even if it matches, it won't be a valid email 1187 + // since @media is followed by ; not .tld 1188 + // Actually the regex requires .[a-zA-Z]{2,} after @, so @media; won't match 1189 + assert.strictEqual(emails.length, 0); 1190 + }); 1191 + 1192 + it('should extract email with two-letter country TLD', () => { 1193 + const text = 'Contact us at admin@example.de for German support'; 1194 + const result = extractRegexEntities(text, 'https://example.de'); 1195 + const emails = result.filter(e => e.entityType === 'email'); 1196 + assert.strictEqual(emails.length, 1); 1197 + assert.strictEqual(emails[0].name, 'admin@example.de'); 1198 + assert.strictEqual(emails[0].attributes.domain, 'example.de'); 1199 + }); 1200 + 1201 + it('should handle email in angle brackets', () => { 1202 + const text = 'From: "John Smith" <john.smith@company.com>'; 1203 + const result = extractRegexEntities(text, 'https://company.com'); 1204 + const emails = result.filter(e => e.entityType === 'email'); 1205 + assert.strictEqual(emails.length, 1); 1206 + assert.strictEqual(emails[0].name, 'john.smith@company.com'); 1207 + }); 1208 + }); 1209 + 1210 + // ── Phones: International & edge cases ─────────────────────────── 1211 + 1212 + describe('phone numbers - international formats', () => { 1213 + it('should extract UK phone number with +44', () => { 1214 + const text = 'London office: +44 20 7946 0958'; 1215 + const result = extractRegexEntities(text, 'https://example.co.uk'); 1216 + const phones = result.filter(e => e.entityType === 'phone'); 1217 + assert.ok(phones.length >= 1, 'Should find UK phone number'); 1218 + }); 1219 + 1220 + it('should extract French phone number with +33', () => { 1221 + const text = 'Bureau de Paris: +33 1 42 68 53 00'; 1222 + const result = extractRegexEntities(text, 'https://example.fr'); 1223 + const phones = result.filter(e => e.entityType === 'phone'); 1224 + assert.ok(phones.length >= 1, 'Should find French phone number'); 1225 + }); 1226 + 1227 + it('should extract German phone number with +49', () => { 1228 + const text = 'Berlin Büro: +49 30 1234 5678'; 1229 + const result = extractRegexEntities(text, 'https://example.de'); 1230 + const phones = result.filter(e => e.entityType === 'phone'); 1231 + assert.ok(phones.length >= 1, 'Should find German phone number'); 1232 + }); 1233 + 1234 + it('should extract Australian phone number with +61', () => { 1235 + const text = 'Sydney office: +61 2 9876 5432'; 1236 + const result = extractRegexEntities(text, 'https://example.com.au'); 1237 + const phones = result.filter(e => e.entityType === 'phone'); 1238 + assert.ok(phones.length >= 1, 'Should find Australian phone number'); 1239 + }); 1240 + 1241 + it('should extract US phone with dots', () => { 1242 + const text = 'Call 555.123.4567 for info'; 1243 + const result = extractRegexEntities(text, 'https://example.com'); 1244 + const phones = result.filter(e => e.entityType === 'phone'); 1245 + assert.ok(phones.length >= 1, 'Should find dot-separated phone'); 1246 + }); 1247 + 1248 + it('should extract US phone with spaces', () => { 1249 + const text = 'Call 555 123 4567 today'; 1250 + const result = extractRegexEntities(text, 'https://example.com'); 1251 + const phones = result.filter(e => e.entityType === 'phone'); 1252 + assert.ok(phones.length >= 1, 'Should find space-separated phone'); 1253 + }); 1254 + 1255 + it('should extract toll-free 1-800 numbers', () => { 1256 + const text = 'Call 1-800-555-1234 for support'; 1257 + const result = extractRegexEntities(text, 'https://example.com'); 1258 + const phones = result.filter(e => e.entityType === 'phone'); 1259 + assert.ok(phones.length >= 1, 'Should find 1-800 number'); 1260 + }); 1261 + 1262 + it('should extract 888 toll-free numbers', () => { 1263 + const text = 'For billing: 888-555-1234'; 1264 + const result = extractRegexEntities(text, 'https://example.com'); 1265 + const phones = result.filter(e => e.entityType === 'phone'); 1266 + assert.ok(phones.length >= 1, 'Should find 888 number'); 1267 + }); 1268 + 1269 + it('should extract phone numbers embedded in a contact block', () => { 1270 + const text = `Contact Information: 1271 + Main: (212) 555-0100 1272 + Fax: (212) 555-0101 1273 + Toll Free: 1-800-555-0199`; 1274 + const result = extractRegexEntities(text, 'https://example.com'); 1275 + const phones = result.filter(e => e.entityType === 'phone'); 1276 + assert.ok(phones.length >= 2, 'Should find multiple phone numbers'); 1277 + }); 1278 + 1279 + it('should not match version numbers as phones', () => { 1280 + const text = 'Using Node.js v18.12.1 and Python 3.11.2'; 1281 + const result = extractRegexEntities(text, 'https://example.com'); 1282 + const phones = result.filter(e => e.entityType === 'phone'); 1283 + assert.strictEqual(phones.length, 0, 'Version numbers should not be phone numbers'); 1284 + }); 1285 + 1286 + it('should not match IP addresses as phones', () => { 1287 + const text = 'Server running at 192.168.1.100 on port 8080'; 1288 + const result = extractRegexEntities(text, 'https://example.com'); 1289 + const phones = result.filter(e => e.entityType === 'phone'); 1290 + // IP 192.168.1.100 - the regex might pick up parts of it 1291 + // but if it does, the digit count check should filter short ones 1292 + // Let's just assert no false positive that looks like the full IP 1293 + for (const phone of phones) { 1294 + assert.ok(!phone.name.includes('192.168'), 'Should not match IP address'); 1295 + } 1296 + }); 1297 + 1298 + it('should not match 4-digit years as phones', () => { 1299 + const text = 'From 2019 to 2025, the company grew rapidly.'; 1300 + const result = extractRegexEntities(text, 'https://example.com'); 1301 + const phones = result.filter(e => e.entityType === 'phone'); 1302 + assert.strictEqual(phones.length, 0, 'Years should not match as phones'); 1303 + }); 1304 + 1305 + it('should extract phone number next to an address', () => { 1306 + const text = '123 Main St, Anytown, USA 90210. Phone: (310) 555-8900'; 1307 + const result = extractRegexEntities(text, 'https://example.com'); 1308 + const phones = result.filter(e => e.entityType === 'phone'); 1309 + assert.ok(phones.length >= 1); 1310 + }); 1311 + 1312 + it('should extract +1 prefix US phone', () => { 1313 + const text = 'Dial +1 (415) 555-2671 for west coast office'; 1314 + const result = extractRegexEntities(text, 'https://example.com'); 1315 + const phones = result.filter(e => e.entityType === 'phone'); 1316 + assert.ok(phones.length >= 1); 1317 + }); 1318 + }); 1319 + 1320 + // ── Dates: All months, different formats, edge cases ───────────── 1321 + 1322 + describe('dates - all months and formats', () => { 1323 + it('should extract all 12 full month names with dates', () => { 1324 + const months = [ 1325 + 'January 15, 2026', 'February 20, 2026', 'March 5, 2026', 1326 + 'April 10, 2026', 'May 25, 2026', 'June 3, 2026', 1327 + 'July 4, 2026', 'August 31, 2026', 'September 1, 2026', 1328 + 'October 12, 2026', 'November 28, 2026', 'December 25, 2026' 1329 + ]; 1330 + for (const month of months) { 1331 + const result = extractRegexEntities(`Event on ${month}`, 'https://example.com'); 1332 + const dates = result.filter(e => e.entityType === 'date'); 1333 + assert.ok(dates.length >= 1, `Should extract date from: ${month}`); 1334 + } 1335 + }); 1336 + 1337 + it('should extract all 12 abbreviated month names with dates', () => { 1338 + const months = [ 1339 + 'Jan 15, 2026', 'Feb 20, 2026', 'Mar 5, 2026', 1340 + 'Apr 10, 2026', 'May 25, 2026', 'Jun 3, 2026', 1341 + 'Jul 4, 2026', 'Aug 31, 2026', 'Sep 1, 2026', 1342 + 'Oct 12, 2026', 'Nov 28, 2026', 'Dec 25, 2026' 1343 + ]; 1344 + for (const month of months) { 1345 + const result = extractRegexEntities(`Posted ${month}`, 'https://example.com'); 1346 + const dates = result.filter(e => e.entityType === 'date'); 1347 + assert.ok(dates.length >= 1, `Should extract date from: ${month}`); 1348 + } 1349 + }); 1350 + 1351 + it('should extract dates with "Posted on" prefix', () => { 1352 + const text = 'Posted on March 1, 2026 by the admin team.'; 1353 + const result = extractRegexEntities(text, 'https://example.com'); 1354 + const dates = result.filter(e => e.entityType === 'date'); 1355 + assert.ok(dates.length >= 1); 1356 + assert.ok(dates.some(d => d.name.includes('March'))); 1357 + }); 1358 + 1359 + it('should extract dates with "Last updated:" prefix', () => { 1360 + const text = 'Last updated: 2026-01-15'; 1361 + const result = extractRegexEntities(text, 'https://example.com'); 1362 + const dates = result.filter(e => e.entityType === 'date'); 1363 + assert.ok(dates.length >= 1); 1364 + assert.ok(dates.some(d => d.name === '2026-01-15')); 1365 + }); 1366 + 1367 + it('should extract leap year date', () => { 1368 + const text = 'Born on 2024-02-29 which was a leap year'; 1369 + const result = extractRegexEntities(text, 'https://example.com'); 1370 + const dates = result.filter(e => e.entityType === 'date'); 1371 + assert.ok(dates.length >= 1); 1372 + assert.ok(dates.some(d => d.name === '2024-02-29')); 1373 + }); 1374 + 1375 + it('should extract EU format dates for multiple months', () => { 1376 + const text = '10 February 2026 was the deadline. We extended to 15 March 2026.'; 1377 + const result = extractRegexEntities(text, 'https://example.com'); 1378 + const dates = result.filter(e => e.entityType === 'date'); 1379 + assert.ok(dates.length >= 2); 1380 + }); 1381 + 1382 + it('should extract slash-formatted dates', () => { 1383 + const text = 'Invoice date: 01/15/2026. Due date: 02/15/2026.'; 1384 + const result = extractRegexEntities(text, 'https://example.com'); 1385 + const dates = result.filter(e => e.entityType === 'date'); 1386 + assert.ok(dates.length >= 2); 1387 + }); 1388 + 1389 + it('should not match invalid month 13 in ISO format', () => { 1390 + const text = 'Reference: 2026-13-01'; 1391 + const result = extractRegexEntities(text, 'https://example.com'); 1392 + const dates = result.filter(e => e.entityType === 'date'); 1393 + assert.strictEqual(dates.length, 0); 1394 + }); 1395 + 1396 + it('should not match invalid day 32 in ISO format', () => { 1397 + const text = 'Reference: 2026-01-32'; 1398 + const result = extractRegexEntities(text, 'https://example.com'); 1399 + const dates = result.filter(e => e.entityType === 'date'); 1400 + assert.strictEqual(dates.length, 0); 1401 + }); 1402 + 1403 + it('should extract multiple dates from an event listing', () => { 1404 + const text = `Upcoming Events: 1405 + Spring Gala - March 20, 2026 1406 + Summer Picnic - June 15, 2026 1407 + Fall Festival - September 22, 2026 1408 + Holiday Party - December 18, 2026`; 1409 + const result = extractRegexEntities(text, 'https://example.com'); 1410 + const dates = result.filter(e => e.entityType === 'date'); 1411 + assert.ok(dates.length >= 4, `Expected at least 4 dates, got ${dates.length}`); 1412 + }); 1413 + 1414 + it('should extract dates with comma after day', () => { 1415 + const text = 'Conference starts August 15, 2026'; 1416 + const result = extractRegexEntities(text, 'https://example.com'); 1417 + const dates = result.filter(e => e.entityType === 'date'); 1418 + assert.ok(dates.length >= 1); 1419 + }); 1420 + 1421 + it('should extract dates without comma after day', () => { 1422 + const text = 'Conference starts August 15 2026'; 1423 + const result = extractRegexEntities(text, 'https://example.com'); 1424 + const dates = result.filter(e => e.entityType === 'date'); 1425 + assert.ok(dates.length >= 1); 1426 + }); 1427 + 1428 + it('should extract ISO dates from metadata-like text', () => { 1429 + const text = 'createdAt: 2026-06-01, updatedAt: 2026-06-15, publishedAt: 2026-07-01'; 1430 + const result = extractRegexEntities(text, 'https://example.com'); 1431 + const dates = result.filter(e => e.entityType === 'date'); 1432 + assert.strictEqual(dates.length, 3); 1433 + }); 1434 + }); 1435 + 1436 + // ── Prices: Various currencies and formats ─────────────────────── 1437 + 1438 + describe('prices - currencies and formats', () => { 1439 + it('should extract USD symbol prices', () => { 1440 + const text = 'Starting at $99 per month'; 1441 + const result = extractRegexEntities(text, 'https://example.com'); 1442 + const prices = result.filter(e => e.entityType === 'price'); 1443 + assert.ok(prices.length >= 1); 1444 + }); 1445 + 1446 + it('should extract large prices with thousands separators', () => { 1447 + const text = 'Property listed at $1,299,999.99'; 1448 + const result = extractRegexEntities(text, 'https://example.com'); 1449 + const prices = result.filter(e => e.entityType === 'price'); 1450 + assert.ok(prices.length >= 1); 1451 + }); 1452 + 1453 + it('should extract prices without decimals', () => { 1454 + const text = 'Only $50 for early birds'; 1455 + const result = extractRegexEntities(text, 'https://example.com'); 1456 + const prices = result.filter(e => e.entityType === 'price'); 1457 + assert.ok(prices.length >= 1); 1458 + }); 1459 + 1460 + it('should extract $0.00 price (free)', () => { 1461 + const text = 'Special offer: $0.00 shipping on orders over $35.00'; 1462 + const result = extractRegexEntities(text, 'https://example.com'); 1463 + const prices = result.filter(e => e.entityType === 'price'); 1464 + assert.ok(prices.length >= 1); 1465 + }); 1466 + 1467 + it('should extract USD suffix prices', () => { 1468 + const text = 'Total: 199.99 USD'; 1469 + const result = extractRegexEntities(text, 'https://example.com'); 1470 + const prices = result.filter(e => e.entityType === 'price'); 1471 + assert.ok(prices.length >= 1); 1472 + }); 1473 + 1474 + it('should extract EUR prefix prices', () => { 1475 + const text = 'Price: EUR 250.00'; 1476 + const result = extractRegexEntities(text, 'https://example.com'); 1477 + const prices = result.filter(e => e.entityType === 'price'); 1478 + assert.ok(prices.length >= 1); 1479 + }); 1480 + 1481 + it('should extract EUR suffix prices', () => { 1482 + const text = 'Total cost: 49.99 EUR including tax'; 1483 + const result = extractRegexEntities(text, 'https://example.com'); 1484 + const prices = result.filter(e => e.entityType === 'price'); 1485 + assert.ok(prices.length >= 1); 1486 + }); 1487 + 1488 + it('should extract GBP prefix prices', () => { 1489 + const text = 'UK price: GBP 150.00'; 1490 + const result = extractRegexEntities(text, 'https://example.com'); 1491 + const prices = result.filter(e => e.entityType === 'price'); 1492 + assert.ok(prices.length >= 1); 1493 + }); 1494 + 1495 + it('should extract CAD prices', () => { 1496 + const text = 'Canadian price: CAD 89.99'; 1497 + const result = extractRegexEntities(text, 'https://example.com'); 1498 + const prices = result.filter(e => e.entityType === 'price'); 1499 + assert.ok(prices.length >= 1); 1500 + }); 1501 + 1502 + it('should extract AUD prices', () => { 1503 + const text = 'Australian price: AUD 120.00'; 1504 + const result = extractRegexEntities(text, 'https://example.com'); 1505 + const prices = result.filter(e => e.entityType === 'price'); 1506 + assert.ok(prices.length >= 1); 1507 + }); 1508 + 1509 + it('should extract JPY prices', () => { 1510 + const text = 'Japanese price: JPY 15000'; 1511 + const result = extractRegexEntities(text, 'https://example.com'); 1512 + const prices = result.filter(e => e.entityType === 'price'); 1513 + assert.ok(prices.length >= 1); 1514 + }); 1515 + 1516 + it('should extract CHF prices', () => { 1517 + const text = 'Swiss price: CHF 200.00'; 1518 + const result = extractRegexEntities(text, 'https://example.com'); 1519 + const prices = result.filter(e => e.entityType === 'price'); 1520 + assert.ok(prices.length >= 1); 1521 + }); 1522 + 1523 + it('should extract multiple prices from a product listing', () => { 1524 + const text = `Product Catalog: 1525 + Widget A - $19.99 1526 + Widget B - $29.99 1527 + Widget C - $49.99 1528 + Bundle Deal - $79.99`; 1529 + const result = extractRegexEntities(text, 'https://example.com'); 1530 + const prices = result.filter(e => e.entityType === 'price'); 1531 + assert.ok(prices.length >= 4); 1532 + }); 1533 + 1534 + it('should extract price from "Starting at" context', () => { 1535 + const text = 'Starting at $99 per user per month'; 1536 + const result = extractRegexEntities(text, 'https://example.com'); 1537 + const prices = result.filter(e => e.entityType === 'price'); 1538 + assert.ok(prices.length >= 1); 1539 + }); 1540 + 1541 + it('should extract price with 5-digit amount', () => { 1542 + const text = 'Annual subscription: $12,000.00'; 1543 + const result = extractRegexEntities(text, 'https://example.com'); 1544 + const prices = result.filter(e => e.entityType === 'price'); 1545 + assert.ok(prices.length >= 1); 1546 + }); 1547 + }); 1548 + 1549 + // ── Tracking numbers: Valid and invalid ────────────────────────── 1550 + 1551 + describe('tracking numbers - various carriers', () => { 1552 + it('should extract valid FedEx tracking number (12 digits)', () => { 1553 + const text = 'Your FedEx tracking: 123456789012'; 1554 + const result = extractRegexEntities(text, 'https://example.com'); 1555 + const tracking = result.filter(e => e.entityType === 'tracking_number'); 1556 + assert.ok(tracking.some(t => t.attributes.carrier === 'FedEx'), 'Should find FedEx tracking'); 1557 + }); 1558 + 1559 + it('should extract valid FedEx tracking number (15 digits)', () => { 1560 + const text = 'FedEx Express: 123456789012345'; 1561 + const result = extractRegexEntities(text, 'https://example.com'); 1562 + const tracking = result.filter(e => e.entityType === 'tracking_number'); 1563 + assert.ok(tracking.some(t => t.attributes.carrier === 'FedEx'), 'Should find FedEx tracking'); 1564 + }); 1565 + 1566 + it('should extract valid FedEx tracking number (20 digits)', () => { 1567 + const text = 'FedEx Ground: 12345678901234567890'; 1568 + const result = extractRegexEntities(text, 'https://example.com'); 1569 + const tracking = result.filter(e => e.entityType === 'tracking_number'); 1570 + assert.ok(tracking.some(t => t.attributes.carrier === 'FedEx'), 'Should find FedEx tracking'); 1571 + }); 1572 + 1573 + it('should extract USPS tracking starting with 92', () => { 1574 + const text = 'USPS tracking: 9261290100130435082657'; 1575 + const result = extractRegexEntities(text, 'https://example.com'); 1576 + const tracking = result.filter(e => e.entityType === 'tracking_number'); 1577 + assert.ok(tracking.some(t => t.attributes.carrier === 'USPS'), 'Should find USPS tracking'); 1578 + }); 1579 + 1580 + it('should extract USPS tracking starting with 93', () => { 1581 + const text = 'Your package: 9361289878091234567890'; 1582 + const result = extractRegexEntities(text, 'https://example.com'); 1583 + const tracking = result.filter(e => e.entityType === 'tracking_number'); 1584 + assert.ok(tracking.some(t => t.attributes.carrier === 'USPS'), 'Should find USPS 93xx tracking'); 1585 + }); 1586 + 1587 + it('should extract UPS tracking with mixed alphanumeric', () => { 1588 + const text = 'UPS tracking: 1Z12345E0205271688'; 1589 + const result = extractRegexEntities(text, 'https://example.com'); 1590 + const tracking = result.filter(e => e.entityType === 'tracking_number'); 1591 + assert.ok(tracking.some(t => t.attributes.carrier === 'UPS')); 1592 + }); 1593 + 1594 + it('should extract multiple tracking numbers from shipping confirmation', () => { 1595 + const text = `Your orders have shipped! 1596 + Order #1001: 1ZABC12345678901AB (UPS) 1597 + Order #1002: 1ZXYZ98765432109CD (UPS)`; 1598 + const result = extractRegexEntities(text, 'https://example.com'); 1599 + const tracking = result.filter(e => e.entityType === 'tracking_number' && e.attributes.carrier === 'UPS'); 1600 + assert.ok(tracking.length >= 2, 'Should find both UPS tracking numbers'); 1601 + }); 1602 + 1603 + it('should not match too-short numbers as tracking', () => { 1604 + const text = 'Order ID: 12345'; 1605 + const result = extractRegexEntities(text, 'https://example.com'); 1606 + const tracking = result.filter(e => e.entityType === 'tracking_number'); 1607 + // 5 digits is too short for any tracking pattern 1608 + assert.strictEqual(tracking.length, 0, 'Short numbers should not be tracking'); 1609 + }); 1610 + }); 1611 + 1612 + // ── Mixed entity extraction from realistic pages ───────────────── 1613 + 1614 + describe('mixed entities from realistic page content', () => { 1615 + it('should extract entities from an e-commerce receipt', () => { 1616 + const text = `Thank you for your order! 1617 + Order placed on January 28, 2026 1618 + Shipping to: 123 Oak Ave, Portland, OR 97201 1619 + Contact: orders@shopify-store.com 1620 + Phone: (503) 555-0142 1621 + 1622 + Items: 1623 + - Blue Widget x2: $24.99 1624 + - Red Gadget x1: $49.99 1625 + Shipping: $5.99 1626 + Total: $105.96 1627 + 1628 + Tracking: 1Z999AA10123456784`; 1629 + const result = extractRegexEntities(text, 'https://shopify-store.com/receipt'); 1630 + const types = new Set(result.map(e => e.entityType)); 1631 + assert.ok(types.has('email'), 'Should find email'); 1632 + assert.ok(types.has('phone'), 'Should find phone'); 1633 + assert.ok(types.has('date'), 'Should find date'); 1634 + assert.ok(types.has('price'), 'Should find prices'); 1635 + assert.ok(types.has('tracking_number'), 'Should find tracking number'); 1636 + }); 1637 + 1638 + it('should extract entities from a job posting', () => { 1639 + const text = `Software Engineer - Remote 1640 + Posted: March 15, 2026 1641 + Salary: $120,000 - $160,000 per year 1642 + Apply: careers@techstartup.io 1643 + Questions? Call (415) 555-0200`; 1644 + const result = extractRegexEntities(text, 'https://techstartup.io/jobs'); 1645 + const emails = result.filter(e => e.entityType === 'email'); 1646 + const phones = result.filter(e => e.entityType === 'phone'); 1647 + const dates = result.filter(e => e.entityType === 'date'); 1648 + const prices = result.filter(e => e.entityType === 'price'); 1649 + assert.ok(emails.length >= 1); 1650 + assert.ok(phones.length >= 1); 1651 + assert.ok(dates.length >= 1); 1652 + assert.ok(prices.length >= 1); 1653 + }); 1654 + 1655 + it('should extract entities from a news article footer', () => { 1656 + const text = `Published: 2026-02-10. Updated: 2026-02-11. 1657 + Contact the newsroom: tips@dailynews.com 1658 + Subscription: $9.99/month or $99.99/year 1659 + Customer service: 1-800-555-NEWS (1-800-555-6397)`; 1660 + const result = extractRegexEntities(text, 'https://dailynews.com'); 1661 + const dates = result.filter(e => e.entityType === 'date'); 1662 + assert.ok(dates.length >= 2); 1663 + const emails = result.filter(e => e.entityType === 'email'); 1664 + assert.ok(emails.length >= 1); 1665 + }); 1666 + 1667 + it('should handle very long text without crashing', () => { 1668 + const longText = 'This is a very long paragraph. '.repeat(1000) + 1669 + 'Contact us at test@example.com or call (555) 123-4567. ' + 1670 + 'Price: $99.99. Date: 2026-06-15.'; 1671 + const result = extractRegexEntities(longText, 'https://example.com'); 1672 + assert.ok(Array.isArray(result)); 1673 + assert.ok(result.length >= 1, 'Should still find entities in long text'); 1674 + }); 1675 + 1676 + it('should extract from text with special unicode characters', () => { 1677 + const text = 'Contact support@example.com for help with your €500 order placed on 2026-03-15'; 1678 + const result = extractRegexEntities(text, 'https://example.com'); 1679 + const emails = result.filter(e => e.entityType === 'email'); 1680 + assert.strictEqual(emails.length, 1); 1681 + // € is not in the PRICE_RE pattern, so no price match for €500 1682 + const dates = result.filter(e => e.entityType === 'date'); 1683 + assert.ok(dates.length >= 1); 1684 + }); 1685 + }); 1686 + }); 1687 + 1688 + // ─── Microformats Extractor: Comprehensive ────────────────────────── 1689 + 1690 + describe('Microformats Extractor - Comprehensive', () => { 1691 + 1692 + // ── Realistic h-card scenarios ─────────────────────────────────── 1693 + 1694 + describe('realistic h-card profiles', () => { 1695 + it('should extract a complete professional profile', () => { 1696 + const html = ` 1697 + <div class="h-card"> 1698 + <img class="u-photo" src="https://example.com/photo.jpg" alt=""> 1699 + <h1 class="p-name">Dr. Sarah Mitchell</h1> 1700 + <span class="p-job-title">Chief Technology Officer</span> 1701 + <span class="p-org">TechVentures Inc.</span> 1702 + <a class="u-email" href="mailto:sarah.mitchell@techventures.com">sarah.mitchell@techventures.com</a> 1703 + <a class="u-url" href="https://sarahmitchell.dev">sarahmitchell.dev</a> 1704 + <span class="p-tel">+1-650-555-0198</span> 1705 + <p class="p-note">20+ years of experience in distributed systems and cloud architecture.</p> 1706 + </div>`; 1707 + const result = extractMicroformatEntities(html, 'https://techventures.com/team'); 1708 + assert.strictEqual(result.length, 1); 1709 + const person = result[0]; 1710 + assert.strictEqual(person.name, 'Dr. Sarah Mitchell'); 1711 + assert.strictEqual(person.entityType, 'person'); 1712 + assert.strictEqual(person.confidence, 0.95); 1713 + assert.strictEqual(person.attributes.email, 'sarah.mitchell@techventures.com'); 1714 + assert.strictEqual(person.attributes.homepage, 'https://sarahmitchell.dev'); 1715 + assert.strictEqual(person.attributes.phone, '+1-650-555-0198'); 1716 + assert.strictEqual(person.attributes.role, 'Chief Technology Officer'); 1717 + assert.strictEqual(person.attributes.organization, 'TechVentures Inc.'); 1718 + assert.strictEqual(person.attributes.photo, 'https://example.com/photo.jpg'); 1719 + }); 1720 + 1721 + it('should extract a minimal h-card with just a name', () => { 1722 + const html = `<div class="h-card"><span class="p-name">Bob</span></div>`; 1723 + const result = extractMicroformatEntities(html, 'https://example.com'); 1724 + assert.strictEqual(result.length, 1); 1725 + assert.strictEqual(result[0].name, 'Bob'); 1726 + assert.strictEqual(result[0].entityType, 'person'); 1727 + // Attributes should be mostly empty 1728 + assert.ok(!result[0].attributes.email); 1729 + assert.ok(!result[0].attributes.homepage); 1730 + }); 1731 + 1732 + it('should extract team page with 5+ h-cards', () => { 1733 + const people = ['Alice Johnson', 'Bob Williams', 'Carol Davis', 'David Brown', 'Emily Wilson', 'Frank Garcia']; 1734 + const cards = people.map(name => 1735 + `<div class="h-card"><span class="p-name">${name}</span><span class="p-job-title">Engineer</span></div>` 1736 + ).join('\n'); 1737 + const html = `<div class="team">${cards}</div>`; 1738 + const result = extractMicroformatEntities(html, 'https://example.com/team'); 1739 + assert.strictEqual(result.length, 6); 1740 + for (const person of result) { 1741 + assert.strictEqual(person.entityType, 'person'); 1742 + assert.strictEqual(person.attributes.role, 'Engineer'); 1743 + } 1744 + const names = result.map(e => e.name).sort(); 1745 + assert.deepStrictEqual(names, people.sort()); 1746 + }); 1747 + 1748 + it('should handle h-card with p-given-name and p-family-name', () => { 1749 + const html = ` 1750 + <div class="h-card"> 1751 + <span class="p-given-name">Maria</span> 1752 + <span class="p-family-name">Rodriguez</span> 1753 + </div>`; 1754 + const result = extractMicroformatEntities(html, 'https://example.com'); 1755 + // The parser does getText('.p-name, .fn') first, which will be empty, 1756 + // then falls back to given+family concatenation 1757 + assert.strictEqual(result.length, 1); 1758 + assert.strictEqual(result[0].name, 'Maria Rodriguez'); 1759 + }); 1760 + 1761 + it('should correctly identify organization h-card', () => { 1762 + const html = ` 1763 + <div class="h-card"> 1764 + <span class="p-name">Mozilla Foundation</span> 1765 + <span class="p-org">Mozilla Foundation</span> 1766 + <a class="u-url" href="https://mozilla.org">mozilla.org</a> 1767 + <a class="u-email" href="mailto:info@mozilla.org">info@mozilla.org</a> 1768 + </div>`; 1769 + const result = extractMicroformatEntities(html, 'https://mozilla.org'); 1770 + assert.strictEqual(result.length, 1); 1771 + assert.strictEqual(result[0].entityType, 'organization'); 1772 + assert.strictEqual(result[0].name, 'Mozilla Foundation'); 1773 + assert.strictEqual(result[0].attributes.email, 'info@mozilla.org'); 1774 + }); 1775 + 1776 + it('should distinguish person with org from org h-card', () => { 1777 + const html = ` 1778 + <div class="h-card"> 1779 + <span class="p-name">John Smith</span> 1780 + <span class="p-org">Google</span> 1781 + </div> 1782 + <div class="h-card"> 1783 + <span class="p-name">Google</span> 1784 + <span class="p-org">Google</span> 1785 + </div>`; 1786 + const result = extractMicroformatEntities(html, 'https://example.com'); 1787 + assert.strictEqual(result.length, 2); 1788 + const john = result.find(e => e.name === 'John Smith'); 1789 + const google = result.find(e => e.name === 'Google'); 1790 + assert.ok(john); 1791 + assert.strictEqual(john!.entityType, 'person'); 1792 + assert.strictEqual(john!.attributes.organization, 'Google'); 1793 + assert.ok(google); 1794 + assert.strictEqual(google!.entityType, 'organization'); 1795 + }); 1796 + 1797 + it('should handle legacy vcard with multiple properties', () => { 1798 + const html = ` 1799 + <div class="vcard"> 1800 + <span class="fn">Jane Legacy</span> 1801 + <a class="email" href="mailto:jane@legacy.com">jane@legacy.com</a> 1802 + <a class="url" href="https://legacy.com">legacy.com</a> 1803 + <span class="tel">555-0100</span> 1804 + <span class="title">Manager</span> 1805 + <span class="org">Legacy Corp</span> 1806 + </div>`; 1807 + const result = extractMicroformatEntities(html, 'https://legacy.com'); 1808 + assert.strictEqual(result.length, 1); 1809 + assert.strictEqual(result[0].name, 'Jane Legacy'); 1810 + assert.strictEqual(result[0].attributes.email, 'jane@legacy.com'); 1811 + assert.strictEqual(result[0].attributes.role, 'Manager'); 1812 + assert.strictEqual(result[0].attributes.organization, 'Legacy Corp'); 1813 + }); 1814 + 1815 + it('should handle h-card with very long name gracefully', () => { 1816 + const longName = 'Professor ' + 'A'.repeat(200) + ' Smith'; 1817 + const html = `<div class="h-card"><span class="p-name">${longName}</span></div>`; 1818 + const result = extractMicroformatEntities(html, 'https://example.com'); 1819 + assert.strictEqual(result.length, 1); 1820 + assert.strictEqual(result[0].name, longName); 1821 + }); 1822 + 1823 + it('should skip h-card with empty name', () => { 1824 + const html = `<div class="h-card"><span class="p-name"> </span></div>`; 1825 + const result = extractMicroformatEntities(html, 'https://example.com'); 1826 + assert.strictEqual(result.length, 0); 1827 + }); 1828 + 1829 + it('should handle h-card without explicit p-name but with fn', () => { 1830 + const html = `<div class="h-card"><span class="fn">Fallback Name</span></div>`; 1831 + const result = extractMicroformatEntities(html, 'https://example.com'); 1832 + assert.strictEqual(result.length, 1); 1833 + assert.strictEqual(result[0].name, 'Fallback Name'); 1834 + }); 1835 + }); 1836 + 1837 + // ── Realistic h-event scenarios ────────────────────────────────── 1838 + 1839 + describe('realistic h-event scenarios', () => { 1840 + it('should extract a multi-day conference with all details', () => { 1841 + const html = ` 1842 + <div class="h-event"> 1843 + <h2 class="p-name">PyCon US 2026</h2> 1844 + <time class="dt-start" datetime="2026-05-15">May 15, 2026</time> 1845 + <time class="dt-end" datetime="2026-05-23">May 23, 2026</time> 1846 + <span class="p-location">Pittsburgh Convention Center, Pittsburgh, PA</span> 1847 + <p class="p-description">The largest annual gathering for the Python community, featuring talks, tutorials, sprints, and an expo hall.</p> 1848 + <a class="u-url" href="https://us.pycon.org/2026/">Event Website</a> 1849 + </div>`; 1850 + const result = extractMicroformatEntities(html, 'https://us.pycon.org/2026/'); 1851 + assert.strictEqual(result.length, 1); 1852 + const event = result[0]; 1853 + assert.strictEqual(event.name, 'PyCon US 2026'); 1854 + assert.strictEqual(event.entityType, 'event'); 1855 + assert.strictEqual(event.attributes.startDate, 'May 15, 2026'); 1856 + assert.strictEqual(event.attributes.endDate, 'May 23, 2026'); 1857 + assert.strictEqual(event.attributes.location, 'Pittsburgh Convention Center, Pittsburgh, PA'); 1858 + assert.ok(event.attributes.description.includes('Python community')); 1859 + assert.strictEqual(event.attributes.url, 'https://us.pycon.org/2026/'); 1860 + }); 1861 + 1862 + it('should extract event with only start date (no end date)', () => { 1863 + const html = ` 1864 + <div class="h-event"> 1865 + <span class="p-name">Monthly Meetup</span> 1866 + <time class="dt-start" datetime="2026-03-10T18:30:00">March 10, 6:30 PM</time> 1867 + <span class="p-location">Community Center</span> 1868 + </div>`; 1869 + const result = extractMicroformatEntities(html, 'https://example.com'); 1870 + assert.strictEqual(result.length, 1); 1871 + assert.strictEqual(result[0].name, 'Monthly Meetup'); 1872 + assert.ok(result[0].attributes.startDate); 1873 + assert.ok(!result[0].attributes.endDate); 1874 + }); 1875 + 1876 + it('should truncate very long event description to 500 characters', () => { 1877 + const longDesc = 'This is a detailed description of the event. '.repeat(50); 1878 + assert.ok(longDesc.length > 500); 1879 + const html = ` 1880 + <div class="h-event"> 1881 + <span class="p-name">Verbose Event</span> 1882 + <p class="p-description">${longDesc}</p> 1883 + </div>`; 1884 + const result = extractMicroformatEntities(html, 'https://example.com'); 1885 + assert.strictEqual(result.length, 1); 1886 + assert.strictEqual(result[0].attributes.description.length, 500); 1887 + }); 1888 + 1889 + it('should extract event with legacy vevent class', () => { 1890 + const html = ` 1891 + <div class="vevent"> 1892 + <span class="summary">Legacy Conference</span> 1893 + <span class="dtstart">2026-09-01</span> 1894 + <span class="location">Hotel Grand</span> 1895 + </div>`; 1896 + const result = extractMicroformatEntities(html, 'https://example.com'); 1897 + assert.strictEqual(result.length, 1); 1898 + assert.strictEqual(result[0].name, 'Legacy Conference'); 1899 + assert.strictEqual(result[0].entityType, 'event'); 1900 + }); 1901 + 1902 + it('should extract multiple events from a page', () => { 1903 + const html = ` 1904 + <div class="h-event"> 1905 + <span class="p-name">Morning Workshop</span> 1906 + <time class="dt-start" datetime="2026-04-01T09:00">9 AM</time> 1907 + </div> 1908 + <div class="h-event"> 1909 + <span class="p-name">Afternoon Session</span> 1910 + <time class="dt-start" datetime="2026-04-01T14:00">2 PM</time> 1911 + </div> 1912 + <div class="h-event"> 1913 + <span class="p-name">Evening Reception</span> 1914 + <time class="dt-start" datetime="2026-04-01T18:00">6 PM</time> 1915 + </div>`; 1916 + const result = extractMicroformatEntities(html, 'https://example.com'); 1917 + assert.strictEqual(result.length, 3); 1918 + const names = result.map(e => e.name); 1919 + assert.ok(names.includes('Morning Workshop')); 1920 + assert.ok(names.includes('Afternoon Session')); 1921 + assert.ok(names.includes('Evening Reception')); 1922 + }); 1923 + 1924 + it('should skip event with name of just 1 character', () => { 1925 + const html = ` 1926 + <div class="h-event"> 1927 + <span class="p-name">A</span> 1928 + <time class="dt-start">2026-05-01</time> 1929 + </div>`; 1930 + const result = extractMicroformatEntities(html, 'https://example.com'); 1931 + assert.strictEqual(result.length, 0); 1932 + }); 1933 + 1934 + it('should extract event with no location or description', () => { 1935 + const html = ` 1936 + <div class="h-event"> 1937 + <span class="p-name">Simple Event</span> 1938 + <time class="dt-start" datetime="2026-08-20">August 20</time> 1939 + </div>`; 1940 + const result = extractMicroformatEntities(html, 'https://example.com'); 1941 + assert.strictEqual(result.length, 1); 1942 + assert.strictEqual(result[0].name, 'Simple Event'); 1943 + assert.ok(!result[0].attributes.location); 1944 + assert.ok(!result[0].attributes.description); 1945 + }); 1946 + }); 1947 + 1948 + // ── Realistic h-adr scenarios ──────────────────────────────────── 1949 + 1950 + describe('realistic h-adr address scenarios', () => { 1951 + it('should extract full US address', () => { 1952 + const html = ` 1953 + <div class="h-adr"> 1954 + <span class="p-street-address">1600 Amphitheatre Parkway</span> 1955 + <span class="p-locality">Mountain View</span> 1956 + <span class="p-region">CA</span> 1957 + <span class="p-postal-code">94043</span> 1958 + <span class="p-country-name">United States</span> 1959 + </div>`; 1960 + const result = extractMicroformatEntities(html, 'https://example.com'); 1961 + assert.strictEqual(result.length, 1); 1962 + assert.strictEqual(result[0].entityType, 'place'); 1963 + assert.strictEqual(result[0].name, 'Mountain View, CA, United States'); 1964 + assert.strictEqual(result[0].attributes.streetAddress, '1600 Amphitheatre Parkway'); 1965 + assert.strictEqual(result[0].attributes.postalCode, '94043'); 1966 + }); 1967 + 1968 + it('should extract UK-style address', () => { 1969 + const html = ` 1970 + <div class="h-adr"> 1971 + <span class="p-street-address">221B Baker Street</span> 1972 + <span class="p-locality">London</span> 1973 + <span class="p-postal-code">NW1 6XE</span> 1974 + <span class="p-country-name">United Kingdom</span> 1975 + </div>`; 1976 + const result = extractMicroformatEntities(html, 'https://example.co.uk'); 1977 + assert.strictEqual(result.length, 1); 1978 + assert.strictEqual(result[0].name, 'London, United Kingdom'); 1979 + assert.strictEqual(result[0].attributes.streetAddress, '221B Baker Street'); 1980 + assert.strictEqual(result[0].attributes.postalCode, 'NW1 6XE'); 1981 + }); 1982 + 1983 + it('should extract German address', () => { 1984 + const html = ` 1985 + <div class="h-adr"> 1986 + <span class="p-street-address">Friedrichstraße 123</span> 1987 + <span class="p-locality">Berlin</span> 1988 + <span class="p-postal-code">10117</span> 1989 + <span class="p-country-name">Germany</span> 1990 + </div>`; 1991 + const result = extractMicroformatEntities(html, 'https://example.de'); 1992 + assert.strictEqual(result.length, 1); 1993 + assert.strictEqual(result[0].name, 'Berlin, Germany'); 1994 + }); 1995 + 1996 + it('should extract address with only city and country', () => { 1997 + const html = ` 1998 + <div class="h-adr"> 1999 + <span class="p-locality">Tokyo</span> 2000 + <span class="p-country-name">Japan</span> 2001 + </div>`; 2002 + const result = extractMicroformatEntities(html, 'https://example.jp'); 2003 + assert.strictEqual(result.length, 1); 2004 + assert.strictEqual(result[0].name, 'Tokyo, Japan'); 2005 + assert.ok(!result[0].attributes.streetAddress); 2006 + }); 2007 + 2008 + it('should extract multiple addresses from one page', () => { 2009 + const html = ` 2010 + <h3>Our Offices</h3> 2011 + <div class="h-adr"> 2012 + <span class="p-locality">New York</span> 2013 + <span class="p-region">NY</span> 2014 + <span class="p-country-name">US</span> 2015 + </div> 2016 + <div class="h-adr"> 2017 + <span class="p-locality">London</span> 2018 + <span class="p-country-name">UK</span> 2019 + </div> 2020 + <div class="h-adr"> 2021 + <span class="p-locality">Singapore</span> 2022 + <span class="p-country-name">Singapore</span> 2023 + </div>`; 2024 + const result = extractMicroformatEntities(html, 'https://example.com'); 2025 + const places = result.filter(e => e.entityType === 'place'); 2026 + assert.strictEqual(places.length, 3); 2027 + assert.ok(places.some(p => p.name.includes('New York'))); 2028 + assert.ok(places.some(p => p.name.includes('London'))); 2029 + assert.ok(places.some(p => p.name.includes('Singapore'))); 2030 + }); 2031 + 2032 + it('should use legacy adr class', () => { 2033 + const html = ` 2034 + <div class="adr"> 2035 + <span class="locality">Portland</span> 2036 + <span class="region">OR</span> 2037 + <span class="country-name">US</span> 2038 + </div>`; 2039 + const result = extractMicroformatEntities(html, 'https://example.com'); 2040 + assert.strictEqual(result.length, 1); 2041 + assert.strictEqual(result[0].entityType, 'place'); 2042 + assert.strictEqual(result[0].name, 'Portland, OR, US'); 2043 + }); 2044 + 2045 + it('should handle address with only country', () => { 2046 + const html = ` 2047 + <div class="h-adr"> 2048 + <span class="p-country-name">Australia</span> 2049 + </div>`; 2050 + const result = extractMicroformatEntities(html, 'https://example.com'); 2051 + // Name would be "Australia" which is >= 2 chars 2052 + assert.strictEqual(result.length, 1); 2053 + assert.strictEqual(result[0].name, 'Australia'); 2054 + }); 2055 + }); 2056 + 2057 + // ── Mixed microformat pages ────────────────────────────────────── 2058 + 2059 + describe('pages with mixed microformat types', () => { 2060 + it('should extract persons, events, and addresses from a conference page', () => { 2061 + const html = ` 2062 + <div class="h-card"> 2063 + <span class="p-name">Keynote Speaker</span> 2064 + <span class="p-job-title">Distinguished Engineer</span> 2065 + <span class="p-org">Big Tech Co</span> 2066 + </div> 2067 + <div class="h-event"> 2068 + <span class="p-name">Annual Developer Conference 2026</span> 2069 + <time class="dt-start" datetime="2026-10-15">Oct 15</time> 2070 + <time class="dt-end" datetime="2026-10-17">Oct 17</time> 2071 + <span class="p-location">Moscone Center, San Francisco</span> 2072 + </div> 2073 + <div class="h-adr"> 2074 + <span class="p-street-address">747 Howard Street</span> 2075 + <span class="p-locality">San Francisco</span> 2076 + <span class="p-region">CA</span> 2077 + <span class="p-postal-code">94103</span> 2078 + <span class="p-country-name">US</span> 2079 + </div>`; 2080 + const result = extractMicroformatEntities(html, 'https://devconf.example.com'); 2081 + assert.strictEqual(result.length, 3); 2082 + const types = result.map(e => e.entityType).sort(); 2083 + assert.deepStrictEqual(types, ['event', 'person', 'place']); 2084 + }); 2085 + 2086 + it('should handle 10 h-cards on a team page', () => { 2087 + const names = [ 2088 + 'Alice Anderson', 'Bob Baker', 'Carol Chen', 'David Davis', 2089 + 'Emily Evans', 'Frank Foster', 'Grace Green', 'Henry Hill', 2090 + 'Irene Irving', 'Jack Johnson' 2091 + ]; 2092 + const cards = names.map(name => 2093 + `<div class="h-card"><span class="p-name">${name}</span></div>` 2094 + ).join('\n'); 2095 + const result = extractMicroformatEntities(cards, 'https://example.com'); 2096 + assert.strictEqual(result.length, 10); 2097 + for (const entity of result) { 2098 + assert.strictEqual(entity.entityType, 'person'); 2099 + assert.strictEqual(entity.confidence, 0.95); 2100 + } 2101 + }); 2102 + 2103 + it('should handle HTML with no microformat classes', () => { 2104 + const html = ` 2105 + <div class="card"> 2106 + <h2>John Smith</h2> 2107 + <p>Software Engineer at Acme Corp</p> 2108 + <p>john@acme.com</p> 2109 + </div>`; 2110 + const result = extractMicroformatEntities(html, 'https://example.com'); 2111 + assert.strictEqual(result.length, 0); 2112 + }); 2113 + 2114 + it('should handle malformed HTML gracefully', () => { 2115 + const html = `<div class="h-card"><span class="p-name">Unclosed Name`; 2116 + const result = extractMicroformatEntities(html, 'https://example.com'); 2117 + // The DOMParser should still parse it, just with implied closing tags 2118 + assert.ok(Array.isArray(result)); 2119 + }); 2120 + }); 2121 + }); 2122 + 2123 + // ─── Structured Data Extractor: Comprehensive ────────────────────── 2124 + 2125 + describe('Structured Data Extractor - Comprehensive', () => { 2126 + 2127 + // ── JSON-LD: Realistic page scenarios ───────────────────────────── 2128 + 2129 + describe('JSON-LD - realistic blog post page', () => { 2130 + it('should extract Person author and Organization publisher from article JSON-LD', () => { 2131 + const html = `<html><head> 2132 + <script type="application/ld+json"> 2133 + { 2134 + "@context": "https://schema.org", 2135 + "@type": "Article", 2136 + "headline": "Understanding Machine Learning Basics", 2137 + "author": { 2138 + "@type": "Person", 2139 + "name": "Dr. Emily Zhang", 2140 + "email": "emily@airesearch.org", 2141 + "jobTitle": "Research Scientist", 2142 + "worksFor": { 2143 + "@type": "Organization", 2144 + "name": "AI Research Lab" 2145 + } 2146 + }, 2147 + "publisher": { 2148 + "@type": "Organization", 2149 + "name": "Tech Blog Weekly", 2150 + "url": "https://techblogweekly.com", 2151 + "logo": { 2152 + "url": "https://techblogweekly.com/logo.png" 2153 + } 2154 + }, 2155 + "datePublished": "2026-01-15", 2156 + "dateModified": "2026-02-01" 2157 + } 2158 + </script> 2159 + </head><body></body></html>`; 2160 + const result = extractStructuredDataEntities(html, 'https://techblogweekly.com/ml-basics'); 2161 + const jsonLd = result.filter(e => e.extractor === 'json-ld'); 2162 + // Article itself should be extracted as creative_work 2163 + const article = jsonLd.find(e => e.entityType === 'creative_work'); 2164 + assert.ok(article, 'Should find article as creative_work'); 2165 + assert.strictEqual(article!.name, 'Understanding Machine Learning Basics'); 2166 + }); 2167 + 2168 + it('should extract entities from an e-commerce product page', () => { 2169 + const html = `<html><head> 2170 + <script type="application/ld+json"> 2171 + { 2172 + "@context": "https://schema.org", 2173 + "@type": "Product", 2174 + "name": "Premium Wireless Headphones", 2175 + "brand": { "name": "AudioMax" }, 2176 + "description": "High-fidelity wireless headphones with active noise cancellation", 2177 + "offers": { 2178 + "@type": "Offer", 2179 + "price": "299.99", 2180 + "priceCurrency": "USD" 2181 + } 2182 + } 2183 + </script> 2184 + <script type="application/ld+json"> 2185 + { 2186 + "@type": "Organization", 2187 + "name": "ElectroShop", 2188 + "url": "https://electroshop.com", 2189 + "telephone": "+1-888-555-0100" 2190 + } 2191 + </script> 2192 + </head><body></body></html>`; 2193 + const result = extractStructuredDataEntities(html, 'https://electroshop.com/headphones'); 2194 + const product = result.find(e => e.entityType === 'product'); 2195 + assert.ok(product, 'Should find product'); 2196 + assert.strictEqual(product!.name, 'Premium Wireless Headphones'); 2197 + assert.strictEqual(product!.attributes.brand, 'AudioMax'); 2198 + assert.strictEqual(product!.attributes.price, 'USD299.99'); 2199 + 2200 + const org = result.find(e => e.entityType === 'organization'); 2201 + assert.ok(org, 'Should find organization'); 2202 + assert.strictEqual(org!.name, 'ElectroShop'); 2203 + }); 2204 + 2205 + it('should extract LocalBusiness as place type (due to typeMap override)', () => { 2206 + const html = `<html><head> 2207 + <script type="application/ld+json"> 2208 + { 2209 + "@context": "https://schema.org", 2210 + "@type": "LocalBusiness", 2211 + "name": "Joe's Coffee Shop", 2212 + "telephone": "+1-555-0142", 2213 + "address": { 2214 + "@type": "PostalAddress", 2215 + "streetAddress": "456 Oak Avenue", 2216 + "addressLocality": "Portland", 2217 + "addressRegion": "OR", 2218 + "postalCode": "97201" 2219 + } 2220 + } 2221 + </script> 2222 + </head><body></body></html>`; 2223 + const result = extractStructuredDataEntities(html, 'https://joescoffee.com'); 2224 + const entity = result.find(e => e.extractor === 'json-ld'); 2225 + assert.ok(entity); 2226 + // LocalBusiness is mapped to 'place' (last duplicate key wins in typeMap) 2227 + assert.strictEqual(entity!.entityType, 'place'); 2228 + assert.strictEqual(entity!.name, "Joe's Coffee Shop"); 2229 + }); 2230 + 2231 + it('should extract Restaurant as organization', () => { 2232 + const html = `<html><head> 2233 + <script type="application/ld+json"> 2234 + { 2235 + "@type": "Restaurant", 2236 + "name": "Bella Italia", 2237 + "telephone": "+1-555-0200", 2238 + "address": "789 Pasta Lane, Rome Town" 2239 + } 2240 + </script> 2241 + </head><body></body></html>`; 2242 + const result = extractStructuredDataEntities(html, 'https://bellaitalia.com'); 2243 + const org = result.find(e => e.entityType === 'organization'); 2244 + assert.ok(org, 'Should find restaurant as organization'); 2245 + assert.strictEqual(org!.name, 'Bella Italia'); 2246 + assert.strictEqual(org!.schemaType, 'Restaurant'); 2247 + }); 2248 + 2249 + it('should skip WebSite type (not in typeMap)', () => { 2250 + const html = `<html><head> 2251 + <script type="application/ld+json"> 2252 + { 2253 + "@type": "WebSite", 2254 + "name": "Example Website", 2255 + "url": "https://example.com" 2256 + } 2257 + </script> 2258 + </head><body></body></html>`; 2259 + const result = extractStructuredDataEntities(html, 'https://example.com'); 2260 + const jsonLd = result.filter(e => e.extractor === 'json-ld'); 2261 + assert.strictEqual(jsonLd.length, 0, 'WebSite type should be skipped'); 2262 + }); 2263 + 2264 + it('should skip BreadcrumbList type', () => { 2265 + const html = `<html><head> 2266 + <script type="application/ld+json"> 2267 + { 2268 + "@type": "BreadcrumbList", 2269 + "name": "Navigation Breadcrumbs", 2270 + "itemListElement": [ 2271 + { "@type": "ListItem", "position": 1, "name": "Home" } 2272 + ] 2273 + } 2274 + </script> 2275 + </head><body></body></html>`; 2276 + const result = extractStructuredDataEntities(html, 'https://example.com'); 2277 + const jsonLd = result.filter(e => e.extractor === 'json-ld'); 2278 + assert.strictEqual(jsonLd.length, 0, 'BreadcrumbList should be skipped'); 2279 + }); 2280 + 2281 + it('should skip Recipe type (not in typeMap)', () => { 2282 + const html = `<html><head> 2283 + <script type="application/ld+json"> 2284 + { 2285 + "@type": "Recipe", 2286 + "name": "Chocolate Chip Cookies", 2287 + "author": { "@type": "Person", "name": "Chef Julia" }, 2288 + "cookTime": "PT25M" 2289 + } 2290 + </script> 2291 + </head><body></body></html>`; 2292 + const result = extractStructuredDataEntities(html, 'https://recipes.com'); 2293 + const jsonLd = result.filter(e => e.extractor === 'json-ld'); 2294 + assert.strictEqual(jsonLd.length, 0, 'Recipe type should be skipped'); 2295 + }); 2296 + 2297 + it('should extract Event with performer and location object', () => { 2298 + const html = `<html><head> 2299 + <script type="application/ld+json"> 2300 + { 2301 + "@type": "Event", 2302 + "name": "Summer Rock Festival", 2303 + "startDate": "2026-07-20T16:00:00", 2304 + "endDate": "2026-07-22T23:00:00", 2305 + "location": { 2306 + "@type": "Place", 2307 + "name": "Riverside Amphitheatre", 2308 + "address": { 2309 + "@type": "PostalAddress", 2310 + "addressLocality": "Austin", 2311 + "addressRegion": "TX" 2312 + } 2313 + }, 2314 + "performer": { 2315 + "@type": "MusicGroup", 2316 + "name": "The Rolling Tones" 2317 + }, 2318 + "description": "Three days of live music featuring top artists from around the world." 2319 + } 2320 + </script> 2321 + </head><body></body></html>`; 2322 + const result = extractStructuredDataEntities(html, 'https://summerfest.com'); 2323 + const event = result.find(e => e.entityType === 'event'); 2324 + assert.ok(event); 2325 + assert.strictEqual(event!.name, 'Summer Rock Festival'); 2326 + assert.strictEqual(event!.attributes.startDate, '2026-07-20T16:00:00'); 2327 + assert.strictEqual(event!.attributes.endDate, '2026-07-22T23:00:00'); 2328 + assert.strictEqual(event!.attributes.location, 'Riverside Amphitheatre'); 2329 + assert.strictEqual(event!.attributes.performer, 'The Rolling Tones'); 2330 + }); 2331 + 2332 + it('should extract MusicEvent as event type', () => { 2333 + const html = `<html><head> 2334 + <script type="application/ld+json"> 2335 + { "@type": "MusicEvent", "name": "Classical Evening", "startDate": "2026-12-01" } 2336 + </script> 2337 + </head><body></body></html>`; 2338 + const result = extractStructuredDataEntities(html, 'https://example.com'); 2339 + const event = result.find(e => e.entityType === 'event'); 2340 + assert.ok(event); 2341 + assert.strictEqual(event!.schemaType, 'MusicEvent'); 2342 + }); 2343 + 2344 + it('should extract SportsEvent as event type', () => { 2345 + const html = `<html><head> 2346 + <script type="application/ld+json"> 2347 + { "@type": "SportsEvent", "name": "Championship Finals", "startDate": "2026-11-15" } 2348 + </script> 2349 + </head><body></body></html>`; 2350 + const result = extractStructuredDataEntities(html, 'https://example.com'); 2351 + const event = result.find(e => e.entityType === 'event'); 2352 + assert.ok(event); 2353 + assert.strictEqual(event!.schemaType, 'SportsEvent'); 2354 + }); 2355 + 2356 + it('should extract Place from JSON-LD', () => { 2357 + const html = `<html><head> 2358 + <script type="application/ld+json"> 2359 + { 2360 + "@type": "Place", 2361 + "name": "Golden Gate Park", 2362 + "address": "San Francisco, CA", 2363 + "geo": { "latitude": 37.7694, "longitude": -122.4862 } 2364 + } 2365 + </script> 2366 + </head><body></body></html>`; 2367 + const result = extractStructuredDataEntities(html, 'https://example.com'); 2368 + const place = result.find(e => e.entityType === 'place'); 2369 + assert.ok(place); 2370 + assert.strictEqual(place!.name, 'Golden Gate Park'); 2371 + assert.strictEqual(place!.attributes.address, 'San Francisco, CA'); 2372 + assert.strictEqual(place!.attributes.latitude, 37.7694); 2373 + assert.strictEqual(place!.attributes.longitude, -122.4862); 2374 + }); 2375 + 2376 + it('should extract NewsArticle as creative_work', () => { 2377 + const html = `<html><head> 2378 + <script type="application/ld+json"> 2379 + { "@type": "NewsArticle", "headline": "Breaking: Major Discovery", "datePublished": "2026-02-10" } 2380 + </script> 2381 + </head><body></body></html>`; 2382 + const result = extractStructuredDataEntities(html, 'https://news.com'); 2383 + const article = result.find(e => e.entityType === 'creative_work'); 2384 + assert.ok(article); 2385 + assert.strictEqual(article!.name, 'Breaking: Major Discovery'); 2386 + assert.strictEqual(article!.schemaType, 'NewsArticle'); 2387 + }); 2388 + 2389 + it('should extract BlogPosting as creative_work', () => { 2390 + const html = `<html><head> 2391 + <script type="application/ld+json"> 2392 + { "@type": "BlogPosting", "headline": "My Thoughts on AI", "datePublished": "2026-01-20" } 2393 + </script> 2394 + </head><body></body></html>`; 2395 + const result = extractStructuredDataEntities(html, 'https://blog.com'); 2396 + const post = result.find(e => e.entityType === 'creative_work'); 2397 + assert.ok(post); 2398 + assert.strictEqual(post!.name, 'My Thoughts on AI'); 2399 + }); 2400 + 2401 + it('should extract Book as creative_work', () => { 2402 + const html = `<html><head> 2403 + <script type="application/ld+json"> 2404 + { "@type": "Book", "name": "The Great Novel", "author": { "@type": "Person", "name": "Famous Author" } } 2405 + </script> 2406 + </head><body></body></html>`; 2407 + const result = extractStructuredDataEntities(html, 'https://books.com'); 2408 + const book = result.find(e => e.entityType === 'creative_work'); 2409 + assert.ok(book); 2410 + assert.strictEqual(book!.name, 'The Great Novel'); 2411 + }); 2412 + 2413 + it('should extract Movie as creative_work', () => { 2414 + const html = `<html><head> 2415 + <script type="application/ld+json"> 2416 + { "@type": "Movie", "name": "Galactic Adventures", "description": "A space epic" } 2417 + </script> 2418 + </head><body></body></html>`; 2419 + const result = extractStructuredDataEntities(html, 'https://movies.com'); 2420 + const movie = result.find(e => e.entityType === 'creative_work'); 2421 + assert.ok(movie); 2422 + assert.strictEqual(movie!.name, 'Galactic Adventures'); 2423 + }); 2424 + 2425 + it('should extract MusicRecording as creative_work', () => { 2426 + const html = `<html><head> 2427 + <script type="application/ld+json"> 2428 + { "@type": "MusicRecording", "name": "Summer Vibes", "description": "A chill track" } 2429 + </script> 2430 + </head><body></body></html>`; 2431 + const result = extractStructuredDataEntities(html, 'https://music.com'); 2432 + const track = result.find(e => e.entityType === 'creative_work'); 2433 + assert.ok(track); 2434 + assert.strictEqual(track!.name, 'Summer Vibes'); 2435 + }); 2436 + 2437 + it('should extract MusicAlbum as creative_work', () => { 2438 + const html = `<html><head> 2439 + <script type="application/ld+json"> 2440 + { "@type": "MusicAlbum", "name": "Greatest Hits Collection" } 2441 + </script> 2442 + </head><body></body></html>`; 2443 + const result = extractStructuredDataEntities(html, 'https://music.com'); 2444 + const album = result.find(e => e.entityType === 'creative_work'); 2445 + assert.ok(album); 2446 + assert.strictEqual(album!.name, 'Greatest Hits Collection'); 2447 + }); 2448 + 2449 + it('should handle sameAs as array', () => { 2450 + const html = `<html><head> 2451 + <script type="application/ld+json"> 2452 + { 2453 + "@type": "Person", 2454 + "name": "John Developer", 2455 + "sameAs": [ 2456 + "https://twitter.com/johndev", 2457 + "https://github.com/johndev", 2458 + "https://linkedin.com/in/johndev" 2459 + ] 2460 + } 2461 + </script> 2462 + </head><body></body></html>`; 2463 + const result = extractStructuredDataEntities(html, 'https://example.com'); 2464 + const person = result.find(e => e.entityType === 'person'); 2465 + assert.ok(person); 2466 + assert.ok(Array.isArray(person!.attributes.sameAs)); 2467 + assert.strictEqual(person!.attributes.sameAs.length, 3); 2468 + }); 2469 + 2470 + it('should handle sameAs as single string', () => { 2471 + const html = `<html><head> 2472 + <script type="application/ld+json"> 2473 + { 2474 + "@type": "Person", 2475 + "name": "Jane Developer", 2476 + "sameAs": "https://twitter.com/janedev" 2477 + } 2478 + </script> 2479 + </head><body></body></html>`; 2480 + const result = extractStructuredDataEntities(html, 'https://example.com'); 2481 + const person = result.find(e => e.entityType === 'person'); 2482 + assert.ok(person); 2483 + assert.ok(Array.isArray(person!.attributes.sameAs)); 2484 + assert.strictEqual(person!.attributes.sameAs[0], 'https://twitter.com/janedev'); 2485 + }); 2486 + 2487 + it('should handle image as string URL', () => { 2488 + const html = `<html><head> 2489 + <script type="application/ld+json"> 2490 + { 2491 + "@type": "Person", 2492 + "name": "Photo Person", 2493 + "image": "https://example.com/photo.jpg" 2494 + } 2495 + </script> 2496 + </head><body></body></html>`; 2497 + const result = extractStructuredDataEntities(html, 'https://example.com'); 2498 + const person = result.find(e => e.entityType === 'person'); 2499 + assert.ok(person); 2500 + assert.strictEqual(person!.attributes.image, 'https://example.com/photo.jpg'); 2501 + }); 2502 + 2503 + it('should handle image as object with url', () => { 2504 + const html = `<html><head> 2505 + <script type="application/ld+json"> 2506 + { 2507 + "@type": "Person", 2508 + "name": "Image Object Person", 2509 + "image": { "url": "https://example.com/headshot.jpg" } 2510 + } 2511 + </script> 2512 + </head><body></body></html>`; 2513 + const result = extractStructuredDataEntities(html, 'https://example.com'); 2514 + const person = result.find(e => e.entityType === 'person'); 2515 + assert.ok(person); 2516 + assert.strictEqual(person!.attributes.image, 'https://example.com/headshot.jpg'); 2517 + }); 2518 + 2519 + it('should handle Organization with logo as string', () => { 2520 + const html = `<html><head> 2521 + <script type="application/ld+json"> 2522 + { 2523 + "@type": "Organization", 2524 + "name": "Logo Corp", 2525 + "logo": "https://logocorp.com/logo.png" 2526 + } 2527 + </script> 2528 + </head><body></body></html>`; 2529 + const result = extractStructuredDataEntities(html, 'https://logocorp.com'); 2530 + const org = result.find(e => e.entityType === 'organization'); 2531 + assert.ok(org); 2532 + assert.strictEqual(org!.attributes.logo, 'https://logocorp.com/logo.png'); 2533 + }); 2534 + 2535 + it('should handle Organization with logo as object', () => { 2536 + const html = `<html><head> 2537 + <script type="application/ld+json"> 2538 + { 2539 + "@type": "Organization", 2540 + "name": "Object Logo Corp", 2541 + "logo": { "url": "https://corp.com/logo.svg" } 2542 + } 2543 + </script> 2544 + </head><body></body></html>`; 2545 + const result = extractStructuredDataEntities(html, 'https://corp.com'); 2546 + const org = result.find(e => e.entityType === 'organization'); 2547 + assert.ok(org); 2548 + assert.strictEqual(org!.attributes.logo, 'https://corp.com/logo.svg'); 2549 + }); 2550 + 2551 + it('should handle Product with brand as string', () => { 2552 + const html = `<html><head> 2553 + <script type="application/ld+json"> 2554 + { 2555 + "@type": "Product", 2556 + "name": "Ultra Widget", 2557 + "brand": "BrandName", 2558 + "offers": { "price": "49.99", "priceCurrency": "USD" } 2559 + } 2560 + </script> 2561 + </head><body></body></html>`; 2562 + const result = extractStructuredDataEntities(html, 'https://example.com'); 2563 + const product = result.find(e => e.entityType === 'product'); 2564 + assert.ok(product); 2565 + assert.strictEqual(product!.attributes.brand, 'BrandName'); 2566 + assert.strictEqual(product!.attributes.price, 'USD49.99'); 2567 + }); 2568 + 2569 + it('should handle Product with offers array (use first)', () => { 2570 + const html = `<html><head> 2571 + <script type="application/ld+json"> 2572 + { 2573 + "@type": "Product", 2574 + "name": "Multi-Offer Widget", 2575 + "offers": [ 2576 + { "price": "29.99", "priceCurrency": "USD" }, 2577 + { "price": "24.99", "priceCurrency": "USD" } 2578 + ] 2579 + } 2580 + </script> 2581 + </head><body></body></html>`; 2582 + const result = extractStructuredDataEntities(html, 'https://example.com'); 2583 + const product = result.find(e => e.entityType === 'product'); 2584 + assert.ok(product); 2585 + assert.strictEqual(product!.attributes.price, 'USD29.99'); 2586 + }); 2587 + }); 2588 + 2589 + describe('JSON-LD - @graph with multiple entity types', () => { 2590 + it('should extract all entities from a complex @graph', () => { 2591 + const html = `<html><head> 2592 + <script type="application/ld+json"> 2593 + { 2594 + "@context": "https://schema.org", 2595 + "@graph": [ 2596 + { "@type": "Organization", "name": "News Daily" }, 2597 + { "@type": "Person", "name": "Alice Reporter", "jobTitle": "Senior Editor" }, 2598 + { "@type": "Article", "headline": "Economy Update 2026" }, 2599 + { "@type": "WebSite", "name": "WebsiteIgnored" }, 2600 + { "@type": "BreadcrumbList", "name": "BreadcrumbIgnored" } 2601 + ] 2602 + } 2603 + </script> 2604 + </head><body></body></html>`; 2605 + const result = extractStructuredDataEntities(html, 'https://newsdaily.com'); 2606 + const jsonLd = result.filter(e => e.extractor === 'json-ld'); 2607 + assert.strictEqual(jsonLd.length, 3, 'Should find org, person, and article; skip WebSite and BreadcrumbList'); 2608 + assert.ok(jsonLd.some(e => e.name === 'News Daily' && e.entityType === 'organization')); 2609 + assert.ok(jsonLd.some(e => e.name === 'Alice Reporter' && e.entityType === 'person')); 2610 + assert.ok(jsonLd.some(e => e.name === 'Economy Update 2026' && e.entityType === 'creative_work')); 2611 + }); 2612 + 2613 + it('should handle JSON-LD array at top level', () => { 2614 + const html = `<html><head> 2615 + <script type="application/ld+json"> 2616 + [ 2617 + { "@type": "Person", "name": "Person In Array" }, 2618 + { "@type": "Organization", "name": "Org In Array" } 2619 + ] 2620 + </script> 2621 + </head><body></body></html>`; 2622 + const result = extractStructuredDataEntities(html, 'https://example.com'); 2623 + const jsonLd = result.filter(e => e.extractor === 'json-ld'); 2624 + assert.strictEqual(jsonLd.length, 2); 2625 + }); 2626 + }); 2627 + 2628 + describe('JSON-LD - edge cases and malformed data', () => { 2629 + it('should handle JSON-LD with empty name', () => { 2630 + const html = `<html><head> 2631 + <script type="application/ld+json"> 2632 + { "@type": "Person", "name": "" } 2633 + </script> 2634 + </head><body></body></html>`; 2635 + const result = extractStructuredDataEntities(html, 'https://example.com'); 2636 + const jsonLd = result.filter(e => e.extractor === 'json-ld'); 2637 + assert.strictEqual(jsonLd.length, 0); 2638 + }); 2639 + 2640 + it('should use headline as fallback when name is missing', () => { 2641 + const html = `<html><head> 2642 + <script type="application/ld+json"> 2643 + { "@type": "Article", "headline": "Article Without Name Field" } 2644 + </script> 2645 + </head><body></body></html>`; 2646 + const result = extractStructuredDataEntities(html, 'https://example.com'); 2647 + const article = result.find(e => e.entityType === 'creative_work'); 2648 + assert.ok(article); 2649 + assert.strictEqual(article!.name, 'Article Without Name Field'); 2650 + }); 2651 + 2652 + it('should handle deeply nested JSON-LD with address in address', () => { 2653 + const html = `<html><head> 2654 + <script type="application/ld+json"> 2655 + { 2656 + "@type": "Organization", 2657 + "name": "Nested Address Org", 2658 + "address": { 2659 + "@type": "PostalAddress", 2660 + "streetAddress": "100 Main St", 2661 + "addressLocality": "Springfield", 2662 + "addressRegion": "IL", 2663 + "postalCode": "62701", 2664 + "addressCountry": "US" 2665 + } 2666 + } 2667 + </script> 2668 + </head><body></body></html>`; 2669 + const result = extractStructuredDataEntities(html, 'https://example.com'); 2670 + const org = result.find(e => e.entityType === 'organization'); 2671 + assert.ok(org); 2672 + assert.strictEqual(org!.attributes.address, '100 Main St, Springfield, IL, 62701, US'); 2673 + }); 2674 + 2675 + it('should handle multiple JSON-LD script tags with mixed valid/invalid', () => { 2676 + const html = `<html><head> 2677 + <script type="application/ld+json">{ invalid json }</script> 2678 + <script type="application/ld+json"> 2679 + { "@type": "Person", "name": "Valid Person" } 2680 + </script> 2681 + <script type="application/ld+json">null</script> 2682 + </head><body></body></html>`; 2683 + const result = extractStructuredDataEntities(html, 'https://example.com'); 2684 + const jsonLd = result.filter(e => e.extractor === 'json-ld'); 2685 + assert.ok(jsonLd.length >= 1); 2686 + assert.ok(jsonLd.some(e => e.name === 'Valid Person')); 2687 + }); 2688 + 2689 + it('should handle JSON-LD with number as name (not string)', () => { 2690 + const html = `<html><head> 2691 + <script type="application/ld+json"> 2692 + { "@type": "Organization", "name": 12345 } 2693 + </script> 2694 + </head><body></body></html>`; 2695 + const result = extractStructuredDataEntities(html, 'https://example.com'); 2696 + const jsonLd = result.filter(e => e.extractor === 'json-ld'); 2697 + assert.strictEqual(jsonLd.length, 0, 'Non-string names should be skipped'); 2698 + }); 2699 + 2700 + it('should handle JSON-LD with whitespace-only name', () => { 2701 + const html = `<html><head> 2702 + <script type="application/ld+json"> 2703 + { "@type": "Person", "name": " " } 2704 + </script> 2705 + </head><body></body></html>`; 2706 + const result = extractStructuredDataEntities(html, 'https://example.com'); 2707 + const jsonLd = result.filter(e => e.extractor === 'json-ld'); 2708 + assert.strictEqual(jsonLd.length, 0, 'Whitespace-only names should be skipped'); 2709 + }); 2710 + 2711 + it('should set confidence 1.0 for all JSON-LD entities', () => { 2712 + const html = `<html><head> 2713 + <script type="application/ld+json"> 2714 + { 2715 + "@graph": [ 2716 + { "@type": "Person", "name": "Person A" }, 2717 + { "@type": "Organization", "name": "Org B" }, 2718 + { "@type": "Event", "name": "Event C" } 2719 + ] 2720 + } 2721 + </script> 2722 + </head><body></body></html>`; 2723 + const result = extractStructuredDataEntities(html, 'https://example.com'); 2724 + const jsonLd = result.filter(e => e.extractor === 'json-ld'); 2725 + for (const entity of jsonLd) { 2726 + assert.strictEqual(entity.confidence, 1.0, `${entity.name} should have confidence 1.0`); 2727 + } 2728 + }); 2729 + }); 2730 + 2731 + // ── Open Graph: Realistic scenarios ─────────────────────────────── 2732 + 2733 + describe('Open Graph - realistic scenarios', () => { 2734 + it('should extract profile with first and last name', () => { 2735 + const html = `<html><head> 2736 + <meta property="og:type" content="profile"> 2737 + <meta property="og:title" content="Maria Santos"> 2738 + <meta property="og:description" content="Full-stack developer and open source contributor"> 2739 + <meta property="og:image" content="https://example.com/maria.jpg"> 2740 + <meta property="og:site_name" content="DevProfiles"> 2741 + <meta property="profile:first_name" content="Maria"> 2742 + <meta property="profile:last_name" content="Santos"> 2743 + </head><body></body></html>`; 2744 + const result = extractStructuredDataEntities(html, 'https://devprofiles.com/maria'); 2745 + const person = result.find(e => e.extractor === 'opengraph' && e.entityType === 'person'); 2746 + assert.ok(person); 2747 + assert.strictEqual(person!.name, 'Maria Santos'); 2748 + assert.strictEqual(person!.attributes.image, 'https://example.com/maria.jpg'); 2749 + assert.strictEqual(person!.attributes.source, 'DevProfiles'); 2750 + }); 2751 + 2752 + it('should extract business with og:type=business.business', () => { 2753 + const html = `<html><head> 2754 + <meta property="og:type" content="business.business"> 2755 + <meta property="og:title" content="Sunrise Bakery"> 2756 + <meta property="og:description" content="Artisan bread and pastries since 1985"> 2757 + <meta property="og:image" content="https://sunrise.com/storefront.jpg"> 2758 + </head><body></body></html>`; 2759 + const result = extractStructuredDataEntities(html, 'https://sunrise.com'); 2760 + const org = result.find(e => e.extractor === 'opengraph' && e.entityType === 'organization'); 2761 + assert.ok(org); 2762 + assert.strictEqual(org!.name, 'Sunrise Bakery'); 2763 + assert.strictEqual(org!.confidence, 0.85); 2764 + }); 2765 + 2766 + it('should extract article author as person (non-URL)', () => { 2767 + const html = `<html><head> 2768 + <meta property="og:type" content="article"> 2769 + <meta property="og:title" content="How to Build a Startup"> 2770 + <meta property="article:author" content="Sarah Founder"> 2771 + </head><body></body></html>`; 2772 + const result = extractStructuredDataEntities(html, 'https://blog.com'); 2773 + const author = result.find(e => e.extractor === 'opengraph' && e.entityType === 'person'); 2774 + assert.ok(author); 2775 + assert.strictEqual(author!.name, 'Sarah Founder'); 2776 + assert.strictEqual(author!.attributes.role, 'author'); 2777 + assert.strictEqual(author!.confidence, 0.7); 2778 + }); 2779 + 2780 + it('should NOT extract article:author when it is a URL', () => { 2781 + const html = `<html><head> 2782 + <meta property="article:author" content="https://facebook.com/someauthor"> 2783 + </head><body></body></html>`; 2784 + const result = extractStructuredDataEntities(html, 'https://example.com'); 2785 + const ogAuthors = result.filter(e => e.extractor === 'opengraph'); 2786 + assert.strictEqual(ogAuthors.length, 0); 2787 + }); 2788 + 2789 + it('should NOT extract article:author when it starts with http', () => { 2790 + const html = `<html><head> 2791 + <meta property="article:author" content="http://example.com/author/john"> 2792 + </head><body></body></html>`; 2793 + const result = extractStructuredDataEntities(html, 'https://example.com'); 2794 + const ogAuthors = result.filter(e => e.extractor === 'opengraph'); 2795 + assert.strictEqual(ogAuthors.length, 0); 2796 + }); 2797 + 2798 + it('should handle og:type=profile without first/last name (fallback to title)', () => { 2799 + const html = `<html><head> 2800 + <meta property="og:type" content="profile"> 2801 + <meta property="og:title" content="Anonymous User"> 2802 + </head><body></body></html>`; 2803 + const result = extractStructuredDataEntities(html, 'https://example.com'); 2804 + const person = result.find(e => e.extractor === 'opengraph'); 2805 + assert.ok(person); 2806 + assert.strictEqual(person!.name, 'Anonymous User'); 2807 + }); 2808 + 2809 + it('should not create OG entity when og:type is article (no author)', () => { 2810 + const html = `<html><head> 2811 + <meta property="og:type" content="article"> 2812 + <meta property="og:title" content="Some Article Title"> 2813 + </head><body></body></html>`; 2814 + const result = extractStructuredDataEntities(html, 'https://example.com'); 2815 + const og = result.filter(e => e.extractor === 'opengraph'); 2816 + assert.strictEqual(og.length, 0, 'Article without author should not create OG entity'); 2817 + }); 2818 + 2819 + it('should handle og:type with business prefix variations', () => { 2820 + const html = `<html><head> 2821 + <meta property="og:type" content="business.restaurant"> 2822 + <meta property="og:title" content="Taco Palace"> 2823 + </head><body></body></html>`; 2824 + const result = extractStructuredDataEntities(html, 'https://example.com'); 2825 + const org = result.find(e => e.extractor === 'opengraph' && e.entityType === 'organization'); 2826 + assert.ok(org, 'Should match business.restaurant as business type'); 2827 + assert.strictEqual(org!.name, 'Taco Palace'); 2828 + }); 2829 + 2830 + it('should handle both OG profile and article author on same page', () => { 2831 + const html = `<html><head> 2832 + <meta property="og:type" content="profile"> 2833 + <meta property="og:title" content="Author Profile"> 2834 + <meta property="profile:first_name" content="John"> 2835 + <meta property="profile:last_name" content="Writer"> 2836 + <meta property="article:author" content="John Writer"> 2837 + </head><body></body></html>`; 2838 + const result = extractStructuredDataEntities(html, 'https://example.com'); 2839 + const og = result.filter(e => e.extractor === 'opengraph'); 2840 + // Should get profile person AND article author person 2841 + assert.ok(og.length >= 2); 2842 + }); 2843 + }); 2844 + 2845 + // ── Meta tags: Realistic scenarios ──────────────────────────────── 2846 + 2847 + describe('meta tags - realistic scenarios', () => { 2848 + it('should extract author from WordPress-style meta', () => { 2849 + const html = `<html><head> 2850 + <meta name="author" content="WordPress Blogger"> 2851 + <meta name="generator" content="WordPress 6.4"> 2852 + </head><body></body></html>`; 2853 + const result = extractStructuredDataEntities(html, 'https://myblog.com'); 2854 + const author = result.find(e => e.extractor === 'meta' && e.attributes.role === 'author'); 2855 + assert.ok(author); 2856 + assert.strictEqual(author!.name, 'WordPress Blogger'); 2857 + }); 2858 + 2859 + it('should extract publisher from news site meta', () => { 2860 + const html = `<html><head> 2861 + <meta name="publisher" content="The Daily Chronicle"> 2862 + <meta name="author" content="Staff Reporter"> 2863 + </head><body></body></html>`; 2864 + const result = extractStructuredDataEntities(html, 'https://dailychronicle.com'); 2865 + const pub = result.find(e => e.extractor === 'meta' && e.entityType === 'organization'); 2866 + assert.ok(pub); 2867 + assert.strictEqual(pub!.name, 'The Daily Chronicle'); 2868 + assert.strictEqual(pub!.attributes.role, 'publisher'); 2869 + 2870 + const author = result.find(e => e.extractor === 'meta' && e.entityType === 'person'); 2871 + assert.ok(author); 2872 + assert.strictEqual(author!.name, 'Staff Reporter'); 2873 + }); 2874 + 2875 + it('should handle author that is a comma-separated list', () => { 2876 + const html = `<html><head> 2877 + <meta name="author" content="Alice Smith, Bob Jones"> 2878 + </head><body></body></html>`; 2879 + const result = extractStructuredDataEntities(html, 'https://example.com'); 2880 + const authors = result.filter(e => e.extractor === 'meta' && e.attributes.role === 'author'); 2881 + // The extractor treats this as a single author name 2882 + assert.strictEqual(authors.length, 1); 2883 + assert.strictEqual(authors[0].name, 'Alice Smith, Bob Jones'); 2884 + }); 2885 + 2886 + it('should skip meta author that is just 2 characters', () => { 2887 + const html = `<html><head> 2888 + <meta name="author" content="AB"> 2889 + </head><body></body></html>`; 2890 + const result = extractStructuredDataEntities(html, 'https://example.com'); 2891 + const authors = result.filter(e => e.extractor === 'meta' && e.attributes.role === 'author'); 2892 + assert.strictEqual(authors.length, 0); 2893 + }); 2894 + 2895 + it('should extract meta author with exactly 3 characters', () => { 2896 + const html = `<html><head> 2897 + <meta name="author" content="Bob"> 2898 + </head><body></body></html>`; 2899 + const result = extractStructuredDataEntities(html, 'https://example.com'); 2900 + const authors = result.filter(e => e.extractor === 'meta' && e.attributes.role === 'author'); 2901 + assert.strictEqual(authors.length, 1); 2902 + assert.strictEqual(authors[0].name, 'Bob'); 2903 + }); 2904 + 2905 + it('should skip meta author that starts with http', () => { 2906 + const html = `<html><head> 2907 + <meta name="author" content="http://example.com/profiles/john"> 2908 + </head><body></body></html>`; 2909 + const result = extractStructuredDataEntities(html, 'https://example.com'); 2910 + const authors = result.filter(e => e.extractor === 'meta' && e.attributes.role === 'author'); 2911 + assert.strictEqual(authors.length, 0); 2912 + }); 2913 + 2914 + it('should set confidence 0.7 for meta-extracted entities', () => { 2915 + const html = `<html><head> 2916 + <meta name="author" content="Some Author"> 2917 + <meta name="publisher" content="Some Publisher"> 2918 + </head><body></body></html>`; 2919 + const result = extractStructuredDataEntities(html, 'https://example.com'); 2920 + const metaEntities = result.filter(e => e.extractor === 'meta'); 2921 + for (const entity of metaEntities) { 2922 + assert.strictEqual(entity.confidence, 0.7); 2923 + } 2924 + }); 2925 + }); 2926 + 2927 + // ── Combined: Multiple extractors on one page ──────────────────── 2928 + 2929 + describe('combined extractors on realistic pages', () => { 2930 + it('should extract from page with JSON-LD, OG, and meta tags', () => { 2931 + const html = `<html><head> 2932 + <script type="application/ld+json"> 2933 + { "@type": "Person", "name": "Jane Expert", "jobTitle": "Professor" } 2934 + </script> 2935 + <meta property="og:type" content="profile"> 2936 + <meta property="og:title" content="Jane Expert"> 2937 + <meta property="profile:first_name" content="Jane"> 2938 + <meta property="profile:last_name" content="Expert"> 2939 + <meta name="author" content="Jane Expert"> 2940 + </head><body></body></html>`; 2941 + const result = extractStructuredDataEntities(html, 'https://university.edu/jane'); 2942 + // Should get entities from all three extractors 2943 + const jsonLd = result.filter(e => e.extractor === 'json-ld'); 2944 + const og = result.filter(e => e.extractor === 'opengraph'); 2945 + const meta = result.filter(e => e.extractor === 'meta'); 2946 + assert.ok(jsonLd.length >= 1, 'Should have JSON-LD entity'); 2947 + assert.ok(og.length >= 1, 'Should have OG entity'); 2948 + assert.ok(meta.length >= 1, 'Should have meta entity'); 2949 + }); 2950 + 2951 + it('should handle page with only meta tags (no JSON-LD, no OG)', () => { 2952 + const html = `<html><head> 2953 + <meta name="author" content="Simple Author"> 2954 + <meta name="description" content="A simple page"> 2955 + <title>Simple Page</title> 2956 + </head><body><p>Content</p></body></html>`; 2957 + const result = extractStructuredDataEntities(html, 'https://example.com'); 2958 + assert.ok(result.length >= 1); 2959 + assert.ok(result.some(e => e.extractor === 'meta')); 2960 + }); 2961 + }); 2962 + }); 2963 + 2964 + // ─── Entity Matcher / Dedup: Comprehensive ──────────────────────── 2965 + 2966 + describe('Entity Matcher - Comprehensive Dedup', () => { 2967 + 2968 + function dedup(rawEntities: any[], confidenceThreshold: number) { 2969 + const seen = new Map(); 2970 + const dedupedEntities: any[] = []; 2971 + 2972 + for (const entity of rawEntities) { 2973 + if (entity.confidence < confidenceThreshold) continue; 2974 + const key = `${normalizeName(entity.name)}:${entity.entityType}`; 2975 + if (seen.has(key)) { 2976 + const existing = seen.get(key); 2977 + if (entity.confidence > existing.confidence) { 2978 + seen.set(key, entity); 2979 + const idx = dedupedEntities.findIndex(e => 2980 + `${normalizeName(e.name)}:${e.entityType}` === key 2981 + ); 2982 + if (idx >= 0) dedupedEntities[idx] = entity; 2983 + } 2984 + } else { 2985 + seen.set(key, entity); 2986 + dedupedEntities.push(entity); 2987 + } 2988 + } 2989 + return dedupedEntities; 2990 + } 2991 + 2992 + it('should deduplicate a large batch of 20+ entities with many duplicates', () => { 2993 + const entities = [ 2994 + // Person appears from multiple extractors 2995 + { name: 'John Smith', entityType: 'person', confidence: 0.7, extractor: 'meta' }, 2996 + { name: 'John Smith', entityType: 'person', confidence: 0.9, extractor: 'opengraph' }, 2997 + { name: 'John Smith', entityType: 'person', confidence: 1.0, extractor: 'json-ld' }, 2998 + { name: 'john smith', entityType: 'person', confidence: 0.85, extractor: 'regex' }, 2999 + // Organization 3000 + { name: 'Acme Corp', entityType: 'organization', confidence: 0.7, extractor: 'meta' }, 3001 + { name: 'Acme Corp', entityType: 'organization', confidence: 1.0, extractor: 'json-ld' }, 3002 + { name: 'acme corp', entityType: 'organization', confidence: 0.85, extractor: 'opengraph' }, 3003 + // Different people 3004 + { name: 'Alice Johnson', entityType: 'person', confidence: 0.95, extractor: 'microformats' }, 3005 + { name: 'Alice Johnson', entityType: 'person', confidence: 0.7, extractor: 'meta' }, 3006 + { name: 'Bob Williams', entityType: 'person', confidence: 0.9, extractor: 'json-ld' }, 3007 + { name: 'Carol Davis', entityType: 'person', confidence: 0.8, extractor: 'opengraph' }, 3008 + // Events 3009 + { name: 'Annual Conference', entityType: 'event', confidence: 0.95, extractor: 'microformats' }, 3010 + { name: 'Annual Conference', entityType: 'event', confidence: 1.0, extractor: 'json-ld' }, 3011 + { name: 'Summer Meetup', entityType: 'event', confidence: 0.9, extractor: 'json-ld' }, 3012 + // Places 3013 + { name: 'San Francisco, CA, US', entityType: 'place', confidence: 0.9, extractor: 'microformats' }, 3014 + { name: 'san francisco, ca, us', entityType: 'place', confidence: 0.8, extractor: 'json-ld' }, 3015 + // More orgs 3016 + { name: 'Google', entityType: 'organization', confidence: 1.0, extractor: 'json-ld' }, 3017 + { name: 'google', entityType: 'organization', confidence: 0.85, extractor: 'opengraph' }, 3018 + { name: 'Microsoft', entityType: 'organization', confidence: 0.9, extractor: 'json-ld' }, 3019 + { name: 'Apple', entityType: 'organization', confidence: 0.95, extractor: 'json-ld' }, 3020 + // Creative works 3021 + { name: 'My Article', entityType: 'creative_work', confidence: 1.0, extractor: 'json-ld' }, 3022 + { name: 'My Article', entityType: 'creative_work', confidence: 0.7, extractor: 'meta' }, 3023 + ]; 3024 + const result = dedup(entities, 0.5); 3025 + // Unique name:type combos: John Smith:person, Acme Corp:org, Alice Johnson:person, 3026 + // Bob Williams:person, Carol Davis:person, Annual Conference:event, Summer Meetup:event, 3027 + // San Francisco:place, Google:org, Microsoft:org, Apple:org, My Article:creative_work 3028 + assert.strictEqual(result.length, 12); 3029 + 3030 + // Check highest confidence kept 3031 + const john = result.find(e => normalizeName(e.name) === 'john smith' && e.entityType === 'person'); 3032 + assert.ok(john); 3033 + assert.strictEqual(john!.confidence, 1.0); 3034 + assert.strictEqual(john!.extractor, 'json-ld'); 3035 + 3036 + const acme = result.find(e => normalizeName(e.name) === 'acme corp' && e.entityType === 'organization'); 3037 + assert.ok(acme); 3038 + assert.strictEqual(acme!.confidence, 1.0); 3039 + }); 3040 + 3041 + it('should handle entities with similar names that are actually different', () => { 3042 + const entities = [ 3043 + { name: 'Dr. Smith', entityType: 'person', confidence: 0.9 }, 3044 + { name: 'Smith', entityType: 'person', confidence: 0.8 }, 3045 + { name: 'John Smith', entityType: 'person', confidence: 0.95 }, 3046 + { name: 'Jane Smith', entityType: 'person', confidence: 0.9 }, 3047 + ]; 3048 + const result = dedup(entities, 0.5); 3049 + // All four have different normalized names 3050 + assert.strictEqual(result.length, 4); 3051 + }); 3052 + 3053 + it('should handle mixed entity types all deduped together', () => { 3054 + const entities = [ 3055 + { name: 'Apple', entityType: 'organization', confidence: 0.9 }, 3056 + { name: 'Apple', entityType: 'product', confidence: 0.8 }, 3057 + { name: 'Apple', entityType: 'organization', confidence: 0.95 }, 3058 + { name: 'Apple', entityType: 'product', confidence: 0.85 }, 3059 + ]; 3060 + const result = dedup(entities, 0.5); 3061 + assert.strictEqual(result.length, 2); // one org, one product 3062 + const org = result.find(e => e.entityType === 'organization'); 3063 + assert.strictEqual(org!.confidence, 0.95); 3064 + const product = result.find(e => e.entityType === 'product'); 3065 + assert.strictEqual(product!.confidence, 0.85); 3066 + }); 3067 + 3068 + it('should handle confidence threshold exactly at entity confidence', () => { 3069 + const entities = [ 3070 + { name: 'Exact Match', entityType: 'person', confidence: 0.5 }, 3071 + ]; 3072 + // confidence < threshold means filtered out. 0.5 < 0.5 is false, so it stays. 3073 + const result = dedup(entities, 0.5); 3074 + assert.strictEqual(result.length, 1); 3075 + }); 3076 + 3077 + it('should filter entity just below threshold', () => { 3078 + const entities = [ 3079 + { name: 'Just Below', entityType: 'person', confidence: 0.499 }, 3080 + ]; 3081 + const result = dedup(entities, 0.5); 3082 + assert.strictEqual(result.length, 0); 3083 + }); 3084 + 3085 + it('should handle dedup with diacritics normalization', () => { 3086 + const entities = [ 3087 + { name: 'José García', entityType: 'person', confidence: 0.7 }, 3088 + { name: 'Jose Garcia', entityType: 'person', confidence: 0.95 }, 3089 + { name: 'JOSE GARCIA', entityType: 'person', confidence: 0.8 }, 3090 + ]; 3091 + const result = dedup(entities, 0.5); 3092 + assert.strictEqual(result.length, 1); 3093 + assert.strictEqual(result[0].confidence, 0.95); 3094 + }); 3095 + 3096 + it('should handle dedup with extra whitespace in names', () => { 3097 + const entities = [ 3098 + { name: ' John Doe ', entityType: 'person', confidence: 0.7 }, 3099 + { name: 'John Doe', entityType: 'person', confidence: 0.9 }, 3100 + ]; 3101 + const result = dedup(entities, 0.5); 3102 + assert.strictEqual(result.length, 1); 3103 + assert.strictEqual(result[0].confidence, 0.9); 3104 + }); 3105 + 3106 + it('should preserve order (first occurrence position)', () => { 3107 + const entities = [ 3108 + { name: 'First', entityType: 'person', confidence: 0.8 }, 3109 + { name: 'Second', entityType: 'person', confidence: 0.7 }, 3110 + { name: 'Third', entityType: 'person', confidence: 0.9 }, 3111 + ]; 3112 + const result = dedup(entities, 0.5); 3113 + assert.strictEqual(result[0].name, 'First'); 3114 + assert.strictEqual(result[1].name, 'Second'); 3115 + assert.strictEqual(result[2].name, 'Third'); 3116 + }); 3117 + 3118 + it('should handle single entity', () => { 3119 + const entities = [ 3120 + { name: 'Sole Entity', entityType: 'organization', confidence: 0.9 }, 3121 + ]; 3122 + const result = dedup(entities, 0.5); 3123 + assert.strictEqual(result.length, 1); 3124 + }); 3125 + }); 3126 + 3127 + // ─── normalizeName: Comprehensive ───────────────────────────────── 3128 + 3129 + describe('normalizeName - Comprehensive', () => { 3130 + it('should handle CJK characters (preserve them)', () => { 3131 + const result = normalizeName('田中太郎'); 3132 + assert.strictEqual(result, '田中太郎'); 3133 + }); 3134 + 3135 + it('should handle mixed CJK and Latin characters', () => { 3136 + const result = normalizeName('John 田中'); 3137 + assert.strictEqual(result, 'john 田中'); 3138 + }); 3139 + 3140 + it('should handle Korean characters (NFD decomposed)', () => { 3141 + const result = normalizeName('김철수'); 3142 + // NFD decomposes Hangul syllables into jamo, so the result won't 3143 + // equal the original composed form. Just verify it's non-empty and stable. 3144 + assert.ok(result.length > 0); 3145 + assert.strictEqual(result, normalizeName('김철수')); 3146 + }); 3147 + 3148 + it('should handle emoji in names (preserve them)', () => { 3149 + const result = normalizeName('John 🎉 Smith'); 3150 + assert.strictEqual(result, 'john 🎉 smith'); 3151 + }); 3152 + 3153 + it('should handle names with only emoji', () => { 3154 + const result = normalizeName('🎭🎪'); 3155 + assert.strictEqual(result, '🎭🎪'); 3156 + }); 3157 + 3158 + it('should handle very long names (100+ characters)', () => { 3159 + const longName = 'A'.repeat(150); 3160 + const result = normalizeName(longName); 3161 + assert.strictEqual(result, 'a'.repeat(150)); 3162 + assert.strictEqual(result.length, 150); 3163 + }); 3164 + 3165 + it('should handle names with HTML entities literally (not decode them)', () => { 3166 + // normalizeName just does string operations, it doesn't decode HTML 3167 + const result = normalizeName('Smith &amp; Jones'); 3168 + assert.strictEqual(result, 'smith &amp; jones'); 3169 + }); 3170 + 3171 + it('should handle names with numbers', () => { 3172 + const result = normalizeName('Agent 007'); 3173 + assert.strictEqual(result, 'agent 007'); 3174 + }); 3175 + 3176 + it('should handle names with special characters', () => { 3177 + assert.strictEqual(normalizeName('Müller'), 'muller'); 3178 + assert.strictEqual(normalizeName('Łukasz'), 'łukasz'); // Polish L is not a combining mark 3179 + assert.strictEqual(normalizeName('Ñoño'), 'nono'); // Spanish ñ 3180 + }); 3181 + 3182 + it('should handle names with multiple types of whitespace', () => { 3183 + const result = normalizeName('John\t\t \n\r Doe'); 3184 + assert.strictEqual(result, 'john doe'); 3185 + }); 3186 + 3187 + it('should handle Cyrillic characters', () => { 3188 + const result = normalizeName('Иван Петров'); 3189 + assert.strictEqual(result, 'иван петров'); 3190 + }); 3191 + 3192 + it('should handle Arabic characters', () => { 3193 + const result = normalizeName('محمد علي'); 3194 + assert.strictEqual(result, 'محمد علي'); 3195 + }); 3196 + 3197 + it('should handle string with only whitespace', () => { 3198 + assert.strictEqual(normalizeName(' \t\n '), ''); 3199 + }); 3200 + 3201 + it('should handle French diacritics comprehensively', () => { 3202 + assert.strictEqual(normalizeName('François Lemaître'), 'francois lemaitre'); 3203 + assert.strictEqual(normalizeName('Hélène Bézier'), 'helene bezier'); 3204 + }); 3205 + 3206 + it('should handle Scandinavian diacritics', () => { 3207 + assert.strictEqual(normalizeName('Ångström'), 'angstrom'); 3208 + }); 3209 + 3210 + it('should handle German umlauts', () => { 3211 + assert.strictEqual(normalizeName('Über Straße'), 'uber straße'); 3212 + // Note: ß is not a combining mark, so it stays 3213 + }); 3214 + 3215 + it('should handle Vietnamese diacritics', () => { 3216 + assert.strictEqual(normalizeName('Nguyễn Văn'), 'nguyen van'); 3217 + }); 3218 + 3219 + it('should handle name with dots and hyphens', () => { 3220 + assert.strictEqual(normalizeName('Dr. Mary-Jane O\'Brien'), "dr. mary-jane o'brien"); 3221 + }); 3222 + 3223 + it('should handle non-string input types', () => { 3224 + // normalizeName only guards for falsy values (null, undefined, '', 0, false) 3225 + // It does not guard against non-string truthy values like 123, {}, [] 3226 + // which will throw TypeError. Only test the falsy cases it handles. 3227 + assert.strictEqual(normalizeName(0 as any), ''); 3228 + assert.strictEqual(normalizeName(false as any), ''); 3229 + }); 3230 + }); 1077 3231 });
+43 -11
backend/electron/ipc.ts
··· 1844 1844 // Check if this is a web page that will use the transparent canvas 1845 1845 const isWebPage = url.startsWith('http://') || url.startsWith('https://'); 1846 1846 1847 + // Determine if this web page should use the fullscreen transparent canvas model. 1848 + // Canvas pages get a fullscreen transparent BrowserWindow where all UI elements 1849 + // (webview, navbar, resize handle) are positioned via JS on an invisible surface. 1850 + // Non-canvas web pages (modals, quick-views, overlays) load their URL directly. 1851 + const isModalOrQuickView = options.modal === true || options.overlay === true; 1852 + const hasNonContentRole = options.role && !['content', 'child-content'].includes(options.role as string); 1853 + const useCanvas = isWebPage && !isModalOrQuickView && !hasNonContentRole; 1854 + 1847 1855 // Use profile-specific session for isolation 1848 1856 const profileSession = getProfileSession(); 1849 1857 1850 1858 // Prepare browser window options 1851 1859 const winOptions: Electron.BrowserWindowConstructorOptions = { 1852 - frame: isWebPage ? false : frameDefault, // Web pages use transparent canvas, no frame 1860 + frame: isWebPage ? false : frameDefault, // Web pages are always frameless 1853 1861 ...options, 1854 1862 width: parseInt(options.width) || APP_DEF_WIDTH, 1855 1863 height: parseInt(options.height) || APP_DEF_HEIGHT, 1856 1864 show: isHeadless() ? false : options.show !== false, 1857 - // Web pages use transparent fullscreen canvas 1858 - transparent: isWebPage ? true : (options.transparent || false), 1859 - backgroundColor: (isWebPage || options.transparent) ? undefined : getSystemThemeBackgroundColor(), 1865 + // Only canvas pages need transparency (for the invisible positioning surface) 1866 + transparent: useCanvas ? true : (options.transparent || false), 1867 + backgroundColor: (useCanvas || options.transparent) ? undefined : getSystemThemeBackgroundColor(), 1860 1868 webPreferences: { 1861 1869 ...options.webPreferences, 1862 1870 preload: getPreloadPath(), 1863 1871 session: profileSession, 1864 - webviewTag: true // Enable webview for peek://page container 1872 + webviewTag: useCanvas ? true : (options.webPreferences?.webviewTag || false), 1865 1873 } 1866 1874 }; 1867 1875 ··· 1956 1964 }); 1957 1965 1958 1966 try { 1959 - // Route http/https URLs through peek://page container 1960 - // This allows IZUI to manage web pages uniformly 1967 + // Route canvas web pages through peek://page container. 1968 + // Canvas pages use a fullscreen transparent window where all UI elements 1969 + // are positioned via JS. Non-canvas web pages (modals, slides) load directly. 1961 1970 let loadUrl = url; 1962 - if (isWebPage) { 1971 + if (useCanvas) { 1963 1972 // Pass position/size to the page container so it can position the webview 1964 1973 // Use the calculated center position (set above) as default 1965 1974 const pageParams = new URLSearchParams({ ··· 1971 1980 }); 1972 1981 loadUrl = `peek://app/page/index.html?${pageParams.toString()}`; 1973 1982 DEBUG && console.log('Routing web page through peek://app/page:', url, '->', loadUrl); 1983 + 1984 + // Set up fullscreen transparent canvas — size the window to cover the display 1985 + // so the page container JS can position elements anywhere on screen 1986 + if (!isHeadless()) { 1987 + const display = screen.getDisplayNearestPoint({ x: winOptions.x ?? 0, y: winOptions.y ?? 0 }); 1988 + const { width: sw, height: sh } = display.workAreaSize; 1989 + const { x: dx, y: dy } = display.workArea; 1990 + win.setSize(sw, sh); 1991 + win.setPosition(dx, dy); 1992 + win.setBackgroundColor('#00000000'); 1993 + DEBUG && console.log('Canvas setup: display', display.id, 'at', dx, dy, 'size', sw, sh); 1994 + } 1974 1995 } 1975 1996 1976 1997 // Add to window manager with modal parameter ··· 2145 2166 // 2146 2167 // Also adds Cmd+L interception on guest webContents for the floating navbar. 2147 2168 // (Keystrokes inside the webview never reach the host's before-input-event.) 2148 - if (url.startsWith('http://') || url.startsWith('https://')) { 2169 + // Only canvas pages have a <webview> guest — set up popup/Cmd+L handlers 2170 + if (useCanvas) { 2149 2171 win.webContents.on('did-attach-webview', (_event, guestWebContents) => { 2150 2172 console.log(`[webview-popup] Guest webContents attached to window ${win.id}, adding setWindowOpenHandler + Cmd+L`); 2151 2173 ··· 2218 2240 // Use profile-specific session for isolation 2219 2241 const profileSession = getProfileSession(); 2220 2242 2221 - // Create the new BrowserWindow (page container) 2243 + // Create the new BrowserWindow (page container) with fullscreen canvas 2222 2244 const popupWin = new BrowserWindow({ 2223 2245 frame: false, 2224 2246 width: 1024, ··· 2234 2256 }, 2235 2257 }); 2236 2258 2259 + // Set up fullscreen transparent canvas for the popup 2260 + if (!isHeadless()) { 2261 + const popupDisplay = screen.getDisplayNearestPoint({ x: parentBounds.x + 30, y: parentBounds.y + 30 }); 2262 + const { width: psw, height: psh } = popupDisplay.workAreaSize; 2263 + const { x: pdx, y: pdy } = popupDisplay.workArea; 2264 + popupWin.setSize(psw, psh); 2265 + popupWin.setPosition(pdx, pdy); 2266 + popupWin.setBackgroundColor('#00000000'); 2267 + } 2268 + 2237 2269 // Register in window manager with proper IZUI role 2238 2270 const popupParams: Record<string, unknown> = { 2239 2271 address: popupUrl, ··· 2295 2327 // The primary Cmd+L handler is on the webview guest (added via did-attach-webview above). 2296 2328 // This host-level handler fires when the page container's own DOM has focus (rare, 2297 2329 // but possible before the webview loads or if the user clicks outside the webview). 2298 - if (url.startsWith('http://') || url.startsWith('https://')) { 2330 + if (useCanvas) { 2299 2331 win.webContents.on('before-input-event', (event, input) => { 2300 2332 if (input.type !== 'keyDown' || !input.meta) return; 2301 2333 if (input.key === 'l') {