Add comprehensive optimization plan based on C QuickJS analysis · anil.recoil.org/unpac-myspace-ocaml@f0f3ac2

+503

1 changed file

expand all

+503

OPTIMIZATION_PLAN.md

··· 1 + # OCaml QuickJS Optimization Plan 2 + 3 + Based on analysis of C QuickJS optimization techniques and the current OCaml implementation. 4 + 5 + ## Executive Summary 6 + 7 + The OCaml implementation is ~640x faster for recursive function calls but ~500-1100x slower for array-intensive operations. The main optimization opportunities are: 8 + 9 + 1. **Value Representation** - Current ADT is clean but slow 10 + 2. **Object Property Access** - No shapes, uses Hashtbl + list 11 + 3. **Array Implementation** - No fast path, heavy allocations 12 + 4. **Bytecode Dispatch** - Pattern matching vs computed goto 13 + 5. **String Handling** - No interning, no rope concatenation 14 + 15 + --- 16 + 17 + ## Phase 1: Quick Wins (Low Effort, High Impact) 18 + 19 + ### 1.1 Unboxed Integer Fast Path 20 + 21 + **Problem**: Every `Int` value is a heap-allocated variant. 22 + 23 + **Current**: 24 + ```ocaml 25 + type value = 26 + | Int of int32 27 + | Float of float 28 + | ... 29 + ``` 30 + 31 + **Solution**: Use OCaml's unboxed types where possible: 32 + ```ocaml 33 + (* For hot paths in arithmetic *) 34 + let[@inline] fast_add a b = 35 + match a, b with 36 + | Int ai, Int bi -> 37 + let sum = Int32.add ai bi in 38 + (* Check overflow *) 39 + Int sum 40 + | _ -> slow_add a b 41 + ``` 42 + 43 + **Impact**: 2-3x faster arithmetic loops 44 + **Effort**: Low 45 + 46 + ### 1.2 Array Fast Path 47 + 48 + **Problem**: Array access goes through pattern matching and ref dereferencing. 49 + 50 + **Current**: 51 + ```ocaml 52 + | Object { data = Data_array arr; _ } -> 53 + let idx = Int32.to_int (to_int32 index) in 54 + if idx >= 0 && idx < Array.length !arr then !arr.(idx) 55 + ``` 56 + 57 + **Solution**: Add fast_array flag and inline check: 58 + ```ocaml 59 + type js_object = { 60 + ... 61 + mutable fast_array : bool; (* True if array access is fast *) 62 + mutable array_values : value array; (* Direct array storage *) 63 + mutable array_count : int; (* Actual element count *) 64 + } 65 + 66 + let[@inline] get_array_el_fast obj idx = 67 + if obj.fast_array && idx >= 0 && idx < obj.array_count then 68 + Array.unsafe_get obj.array_values idx 69 + else 70 + get_array_el_slow obj idx 71 + ``` 72 + 73 + **Impact**: 10-50x faster array access 74 + **Effort**: Medium 75 + 76 + ### 1.3 Local Variable Access Optimization 77 + 78 + **Problem**: Local access goes through array bounds check. 79 + 80 + **Current**: 81 + ```ocaml 82 + let get_local frame idx = 83 + if idx < Array.length frame.locals then frame.locals.(idx) 84 + else Undefined 85 + ``` 86 + 87 + **Solution**: Use unsafe access for verified indices: 88 + ```ocaml 89 + let[@inline] get_local frame idx = 90 + Array.unsafe_get frame.locals idx (* Compiler verifies bounds *) 91 + ``` 92 + 93 + **Impact**: 1.5x faster function execution 94 + **Effort**: Low 95 + 96 + --- 97 + 98 + ## Phase 2: Medium Effort Optimizations 99 + 100 + ### 2.1 Atom Table (String Interning) 101 + 102 + **Problem**: String comparison is O(n), property lookup hashes strings repeatedly. 103 + 104 + **Current**: 105 + ```ocaml 106 + Hashtbl.find_opt obj.properties name (* Hashes 'name' every time *) 107 + ``` 108 + 109 + **Solution**: Implement global atom table: 110 + ```ocaml 111 + module Atom : sig 112 + type t = private int 113 + val of_string : string -> t 114 + val to_string : t -> string 115 + val equal : t -> t -> bool (* O(1) integer comparison *) 116 + end = struct 117 + type t = int 118 + let table : (string, int) Hashtbl.t = Hashtbl.create 1024 119 + let reverse : string array ref = ref [||] 120 + let counter = ref 0 121 + 122 + let of_string s = 123 + match Hashtbl.find_opt table s with 124 + | Some id -> id 125 + | None -> 126 + let id = !counter in 127 + incr counter; 128 + Hashtbl.add table s id; 129 + (* Grow reverse table *) 130 + if id >= Array.length !reverse then begin 131 + let new_arr = Array.make (max 16 (id * 2)) "" in 132 + Array.blit !reverse 0 new_arr 0 (Array.length !reverse); 133 + reverse := new_arr 134 + end; 135 + (!reverse).(id) <- s; 136 + id 137 + 138 + let to_string id = (!reverse).(id) 139 + let equal a b = a = b 140 + end 141 + 142 + (* Property storage uses atoms *) 143 + type js_object = { 144 + mutable properties : (Atom.t, property) Hashtbl.t; 145 + ... 146 + } 147 + ``` 148 + 149 + **Impact**: 2-5x faster property access 150 + **Effort**: Medium (requires changing property storage throughout) 151 + 152 + ### 2.2 Shape System for Objects 153 + 154 + **Problem**: Each object has its own property hashtable, no sharing. 155 + 156 + **Current**: 157 + ```ocaml 158 + type js_object = { 159 + mutable properties : (string, property) Hashtbl.t; 160 + mutable property_order : string list; 161 + ... 162 + } 163 + ``` 164 + 165 + **Solution**: Implement shapes (hidden classes): 166 + ```ocaml 167 + type shape = { 168 + id : int; 169 + props : shape_property array; (* Property descriptors *) 170 + prop_hash : (Atom.t, int) Hashtbl.t; (* Atom -> index *) 171 + mutable transitions : (Atom.t, shape) Hashtbl.t; (* For shape transitions *) 172 + proto : value; 173 + } 174 + 175 + and shape_property = { 176 + atom : Atom.t; 177 + flags : prop_flags; 178 + offset : int; (* Index into object's prop_values array *) 179 + } 180 + 181 + type js_object = { 182 + mutable shape : shape; 183 + mutable prop_values : value array; (* Dense property values *) 184 + ... 185 + } 186 + 187 + (* Fast property lookup *) 188 + let[@inline] get_own_property_fast obj atom = 189 + match Hashtbl.find_opt obj.shape.prop_hash atom with 190 + | Some idx -> Some (Array.unsafe_get obj.prop_values idx) 191 + | None -> None 192 + 193 + (* Adding a property transitions to new shape *) 194 + let add_property obj atom value = 195 + let new_shape = get_or_create_transition obj.shape atom in 196 + let new_values = Array.append obj.prop_values [| value |] in 197 + obj.shape <- new_shape; 198 + obj.prop_values <- new_values 199 + ``` 200 + 201 + **Impact**: 3-10x faster property access, reduced memory 202 + **Effort**: High 203 + 204 + ### 2.3 String Rope for Concatenation 205 + 206 + **Problem**: String concatenation creates new string every time. 207 + 208 + **Current**: 209 + ```ocaml 210 + | `Add -> 211 + (match a, b with 212 + | String _, _ | _, String _ -> String (to_string a ^ to_string b) 213 + | _ -> ...) 214 + ``` 215 + 216 + **Solution**: Implement rope data structure: 217 + ```ocaml 218 + type js_string = 219 + | Flat of string 220 + | Rope of { left : js_string; right : js_string; len : int; depth : int } 221 + 222 + let concat a b = 223 + match a, b with 224 + | Flat "", s | s, Flat "" -> s 225 + | Flat a, Flat b when String.length a + String.length b < 512 -> 226 + Flat (a ^ b) (* Small strings: flatten immediately *) 227 + | _ -> 228 + let len = length a + length b in 229 + let depth = 1 + max (depth a) (depth b) in 230 + if depth > 60 then 231 + Flat (flatten (Rope { left = a; right = b; len; depth })) 232 + else 233 + Rope { left = a; right = b; len; depth } 234 + 235 + let rec flatten = function 236 + | Flat s -> s 237 + | Rope { left; right; len; _ } -> 238 + let buf = Buffer.create len in 239 + flatten_into buf left; 240 + flatten_into buf right; 241 + Buffer.contents buf 242 + ``` 243 + 244 + **Impact**: 10-100x faster repeated concatenation 245 + **Effort**: Medium 246 + 247 + --- 248 + 249 + ## Phase 3: Major Architectural Changes 250 + 251 + ### 3.1 NaN Boxing (Optional) 252 + 253 + **Problem**: Every value is heap-allocated OCaml variant. 254 + 255 + **Current**: Each value is a tagged pointer to variant data. 256 + 257 + **Solution**: Pack values into 64-bit integers using NaN boxing: 258 + ```ocaml 259 + (* WARNING: This is unsafe and loses type safety *) 260 + type value = int64 (* Raw 64-bit value *) 261 + 262 + (* Tags in upper 16 bits of NaN *) 263 + let tag_int = 0x0000_0000_0001_0000L 264 + let tag_bool = 0x0000_0000_0002_0000L 265 + let tag_null = 0x0000_0000_0003_0000L 266 + let tag_undef = 0x0000_0000_0004_0000L 267 + let tag_object = 0x0000_0000_0005_0000L 268 + let tag_string = 0x0000_0000_0006_0000L 269 + 270 + let[@inline] get_tag v = Int64.logand v 0xFFFF_0000_0000_0000L 271 + let[@inline] is_int v = get_tag v = tag_int 272 + let[@inline] get_int v = Int64.to_int32 v 273 + let[@inline] make_int i = Int64.logor tag_int (Int64.of_int32 i) 274 + ``` 275 + 276 + **Impact**: 2-5x faster overall, reduced memory 277 + **Effort**: Very High (rewrites entire value system) 278 + **Risk**: Loses OCaml's type safety 279 + 280 + ### 3.2 Bytecode Dispatch Optimization 281 + 282 + **Problem**: Pattern matching compiles to jump table, not direct dispatch. 283 + 284 + **Current**: 285 + ```ocaml 286 + match op with 287 + | OP_push_i32 -> ... 288 + | OP_add -> ... 289 + ``` 290 + 291 + **Solution**: Use array of functions (trampoline): 292 + ```ocaml 293 + type handler = frame -> unit 294 + 295 + let handlers : handler array = Array.make 256 (fun _ -> ()) 296 + 297 + let () = 298 + handlers.(Opcode.to_int OP_push_i32) <- (fun frame -> 299 + let v = read_i32 frame.func.bytecode frame.pc in 300 + frame.pc <- frame.pc + 4; 301 + push_value frame (Int v)); 302 + handlers.(Opcode.to_int OP_add) <- (fun frame -> 303 + let b = pop_value frame in 304 + let a = pop_value frame in 305 + push_value frame (binary_add a b)); 306 + (* ... more handlers ... *) 307 + 308 + let[@inline] dispatch frame = 309 + let op = Char.code (Bytes.unsafe_get frame.func.bytecode frame.pc) in 310 + frame.pc <- frame.pc + 1; 311 + Array.unsafe_get handlers op frame 312 + ``` 313 + 314 + **Impact**: 1.3-2x faster bytecode execution 315 + **Effort**: Medium 316 + 317 + ### 3.3 Inline Caching for Property Access 318 + 319 + **Problem**: Every property access does full lookup. 320 + 321 + **Solution**: Cache shape and offset at call sites: 322 + ```ocaml 323 + type inline_cache = { 324 + mutable cached_shape : shape option; 325 + mutable cached_offset : int; 326 + } 327 + 328 + let get_property_cached obj atom cache = 329 + match cache.cached_shape with 330 + | Some shape when shape == obj.shape -> 331 + (* Cache hit: direct access *) 332 + Array.unsafe_get obj.prop_values cache.cached_offset 333 + | _ -> 334 + (* Cache miss: full lookup and update cache *) 335 + match Hashtbl.find_opt obj.shape.prop_hash atom with 336 + | Some offset -> 337 + cache.cached_shape <- Some obj.shape; 338 + cache.cached_offset <- offset; 339 + Array.unsafe_get obj.prop_values offset 340 + | None -> 341 + (* Prototype chain lookup *) 342 + get_property_slow obj atom 343 + ``` 344 + 345 + **Impact**: 5-20x faster repeated property access (hot loops) 346 + **Effort**: High 347 + 348 + --- 349 + 350 + ## Phase 4: Memory Optimizations 351 + 352 + ### 4.1 Property Value Compaction 353 + 354 + **Problem**: Property flags stored with every property. 355 + 356 + **Solution**: Store flags in shape, values in dense array: 357 + ```ocaml 358 + (* Shape stores property metadata *) 359 + type shape_property = { 360 + atom : Atom.t; 361 + flags : int; (* Packed: writable | enumerable<<1 | configurable<<2 *) 362 + } 363 + 364 + (* Object stores only values *) 365 + type js_object = { 366 + shape : shape; 367 + prop_values : value array; (* Dense, indexed by shape offset *) 368 + } 369 + ``` 370 + 371 + **Impact**: 30-50% less memory per object 372 + **Effort**: Medium (part of shape system) 373 + 374 + ### 4.2 Small Object Optimization 375 + 376 + **Problem**: Objects with few properties still allocate hashtable. 377 + 378 + **Solution**: Inline storage for small objects: 379 + ```ocaml 380 + type js_object = { 381 + mutable shape : shape; 382 + (* Inline storage for first 4 properties *) 383 + mutable inline0 : value; 384 + mutable inline1 : value; 385 + mutable inline2 : value; 386 + mutable inline3 : value; 387 + (* Overflow to heap array *) 388 + mutable overflow : value array option; 389 + } 390 + ``` 391 + 392 + **Impact**: 2x faster small object creation, less GC pressure 393 + **Effort**: Medium 394 + 395 + ### 4.3 Stack Allocation for Frames 396 + 397 + **Problem**: Every function call allocates a new frame. 398 + 399 + **Current**: 400 + ```ocaml 401 + let frame = make_frame ~func ~args ~this_val ~new_target ~var_refs in 402 + push_frame ctx frame 403 + ``` 404 + 405 + **Solution**: Pre-allocate frame pool or use stack: 406 + ```ocaml 407 + type frame_stack = { 408 + mutable frames : frame array; 409 + mutable depth : int; 410 + mutable max_depth : int; 411 + } 412 + 413 + let push_frame stack = 414 + if stack.depth >= stack.max_depth then 415 + grow_frame_stack stack; 416 + let frame = Array.unsafe_get stack.frames stack.depth in 417 + stack.depth <- stack.depth + 1; 418 + frame 419 + 420 + let pop_frame stack = 421 + stack.depth <- stack.depth - 1 422 + ``` 423 + 424 + **Impact**: 2-5x faster function calls 425 + **Effort**: Medium 426 + 427 + --- 428 + 429 + ## Prioritized Implementation Order 430 + 431 + ### Sprint 1: Quick Wins (1-2 weeks) 432 + 1. [ ] Unsafe array access for verified bounds 433 + 2. [ ] Inline hot arithmetic operations 434 + 3. [ ] Fast path for small integers (0-7) 435 + 4. [ ] Cache array length in typed arrays 436 + 437 + ### Sprint 2: Array Optimization (2-3 weeks) 438 + 1. [ ] Add fast_array flag to objects 439 + 2. [ ] Implement direct array storage 440 + 3. [ ] Exponential growth strategy (3/2 factor) 441 + 4. [ ] Fast typed array access 442 + 443 + ### Sprint 3: String Optimization (2-3 weeks) 444 + 1. [ ] Implement atom table 445 + 2. [ ] Convert property keys to atoms 446 + 3. [ ] Implement string rope 447 + 4. [ ] Optimize string concatenation in loops 448 + 449 + ### Sprint 4: Object Optimization (3-4 weeks) 450 + 1. [ ] Design shape data structure 451 + 2. [ ] Implement shape transitions 452 + 3. [ ] Convert object storage to shape-based 453 + 4. [ ] Add inline caching for property access 454 + 455 + ### Sprint 5: Execution Optimization (2-3 weeks) 456 + 1. [ ] Implement handler array dispatch 457 + 2. [ ] Pre-allocate frame pool 458 + 3. [ ] Optimize local variable access 459 + 4. [ ] Add fast path for common opcodes 460 + 461 + --- 462 + 463 + ## Benchmark Targets 464 + 465 + | Workload | Current | Target | Improvement | 466 + |----------|---------|--------|-------------| 467 + | Fibonacci | 170 μs | 100 μs | 1.7x | 468 + | Prime sieve | 68 ms | 15 ms | 4.5x | 469 + | Array sort | 4.7 s | 50 ms | 94x | 470 + | Object property | 15 ms | 3 ms | 5x | 471 + | String concat | 3.5 ms | 500 μs | 7x | 472 + 473 + --- 474 + 475 + ## Risk Assessment 476 + 477 + | Optimization | Impact | Effort | Risk | Priority | 478 + |--------------|--------|--------|------|----------| 479 + | Unsafe array access | Low | Low | Low | P0 | 480 + | Fast array flag | High | Medium | Low | P0 | 481 + | Atom table | Medium | Medium | Low | P1 | 482 + | String rope | High | Medium | Low | P1 | 483 + | Shape system | Very High | High | Medium | P1 | 484 + | Inline caching | High | High | Medium | P2 | 485 + | NaN boxing | Very High | Very High | High | P3 | 486 + | Bytecode dispatch | Medium | Medium | Low | P2 | 487 + 488 + --- 489 + 490 + ## Appendix: C QuickJS Key Optimizations 491 + 492 + ### From quickjs.c analysis: 493 + 494 + 1. **NaN Boxing** (lines 144-213): Values packed into 64-bit using IEEE 754 NaN encoding 495 + 2. **Shapes** (lines 909-924): Shared property descriptors with hash table 496 + 3. **Fast Arrays** (line 943): `fast_array` flag for O(1) indexed access 497 + 4. **Inline Property Lookup** (lines 5699-5741): Hash chain walking inlined 498 + 5. **String Interning** (lines 243-249): Global atom table 499 + 6. **String Ropes** (lines 535-544): Lazy concatenation with depth limiting 500 + 7. **Direct Dispatch** (line 53): Computed goto for bytecode 501 + 8. **Branch Hints** (lines 36-45): `likely()`/`unlikely()` annotations 502 + 9. **Compact Properties** (lines 882-907): Bitfield packing for flags 503 + 10. **Specialized Math** (lines 1001-1032): Fast paths for `f_f`, `f_f_f` functions

Configure Feed

Configure Feed