Add tessera design docs and test pipeline · jon.recoil.org/odoc-scrollycode-extension@0f5918e

+287

docs/plans/2026-03-05-tessera-geotessera.md

··· 1 + # tessera-geotessera Implementation Plan 2 + 3 + > **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. 4 + 5 + **Goal:** A portable OCaml library for fetching and assembling GeoTessera embedding tiles. Handles tile grid math, coordinate snapping, numpy file parsing (via tessera-npy), dequantization (int8 * scales → float32), and mosaicing multiple tiles. Parameterised over I/O so it works natively and in the browser. 6 + 7 + **Architecture:** Pure OCaml, depends on tessera-npy and tessera-linalg (for the mat type). The fetch function is provided by the caller — in the browser it's Fetch API, natively it's file I/O or HTTP. The library computes which tiles to fetch, parses the .npy responses, dequantizes, and assembles a mosaic. 8 + 9 + **Tech Stack:** OCaml, tessera-npy, tessera-linalg, Alcotest, dune 10 + 11 + --- 12 + 13 + ## GeoTessera Tile Server Reference 14 + 15 + - Base URL: `https://dl2.geotessera.org` 16 + - No authentication 17 + - Tile grid: 0.1° × 0.1°, centers at multiples of 0.1° offset by 0.05° 18 + - Tile naming: `grid_{lon:.2f}_{lat:.2f}` 19 + - Embedding file: `{base}/v1/global_0.1_degree_representation/{year}/grid_{lon}_{lat}/grid_{lon}_{lat}.npy` — int8, shape (H, W, 128) 20 + - Scale file: `{base}/v1/global_0.1_degree_representation/{year}/grid_{lon}_{lat}/grid_{lon}_{lat}_scales.npy` — float32, shape (H, W) 21 + - Coordinate snapping: round to nearest 0.1° grid center (e.g., 0.12 → 0.15, 0.18 → 0.15) 22 + - Tile at center (lon, lat) covers [lon-0.05, lon+0.05] × [lat-0.05, lat+0.05] 23 + 24 + --- 25 + 26 + ### Task 1: Project scaffold + grid math 27 + 28 + **Files:** 29 + - Create: `tessera-geotessera/dune-project` 30 + - Create: `tessera-geotessera/lib/dune` 31 + - Create: `tessera-geotessera/lib/geotessera.ml` 32 + - Create: `tessera-geotessera/lib/geotessera.mli` 33 + - Create: `tessera-geotessera/test/dune` 34 + - Create: `tessera-geotessera/test/test_geotessera.ml` 35 + 36 + The .mli should expose: 37 + 38 + ```ocaml 39 + (** GeoTessera embedding tile client. 40 + 41 + Fetches, dequantizes, and mosaics GeoTessera embedding tiles. 42 + Parameterised over I/O — provide a fetch function for your platform. *) 43 + 44 + (** {1 Types} *) 45 + 46 + type bbox = { 47 + min_lon : float; 48 + min_lat : float; 49 + max_lon : float; 50 + max_lat : float; 51 + } 52 + (** A bounding box in WGS84 coordinates. *) 53 + 54 + type tile_coord = { 55 + lon : float; (** Grid center longitude *) 56 + lat : float; (** Grid center latitude *) 57 + } 58 + (** A tile's grid center coordinates. *) 59 + 60 + (** {1 Grid math} *) 61 + 62 + val snap_to_grid : float -> float 63 + (** Snap a coordinate to the nearest 0.1° grid center. 64 + E.g., [snap_to_grid 0.12] = [0.15], [snap_to_grid 0.18] = [0.15]. *) 65 + 66 + val tiles_for_bbox : bbox -> tile_coord list 67 + (** Enumerate all tile coordinates that overlap a bounding box. *) 68 + 69 + val tile_name : tile_coord -> string 70 + (** Format a tile coordinate as a grid name, e.g., ["grid_0.15_52.05"]. *) 71 + 72 + val embedding_url : base_url:string -> version:string -> year:int -> tile_coord -> string 73 + (** Construct the URL for a tile's embedding .npy file. *) 74 + 75 + val scales_url : base_url:string -> version:string -> year:int -> tile_coord -> string 76 + (** Construct the URL for a tile's scale factors .npy file. *) 77 + 78 + (** {1 Dequantization} *) 79 + 80 + val dequantize : 81 + embeddings:Npy.t -> 82 + scales:Npy.t -> 83 + Linalg.mat 84 + (** Dequantize int8 embeddings using float32 scale factors. 85 + Computes [float32(embedding) * scale] for each pixel. 86 + Input embeddings: int8 shape (H, W, 128). 87 + Input scales: float32 shape (H, W). 88 + Output: mat with rows=H*W, cols=128. *) 89 + 90 + (** {1 Mosaicing} *) 91 + 92 + val mosaic : (tile_coord * Linalg.mat) list -> bbox -> Linalg.mat * int * int 93 + (** Assemble dequantized tiles into a single matrix covering the bbox. 94 + Returns [(mosaic_mat, height, width)] where mosaic_mat has shape 95 + (height*width, 128). Tiles are placed by their grid position. 96 + Missing areas are filled with zeros. *) 97 + ``` 98 + 99 + **Tests for grid math:** 100 + 101 + ```ocaml 102 + let test_snap_to_grid () = 103 + (* 0.12 → 0.15 (round to nearest 0.05 offset by 0.05) *) 104 + let check msg expected input = 105 + Alcotest.(check (float 1e-6)) msg expected (Geotessera.snap_to_grid input) 106 + in 107 + check "0.12 → 0.15" 0.15 0.12; 108 + check "0.18 → 0.15" 0.15 0.18; 109 + check "0.02 → 0.05" 0.05 0.02; 110 + check "0.07 → 0.05" 0.05 0.07; 111 + check "-0.12 → -0.15" (-0.15) (-0.12); 112 + check "0.25 → 0.25" 0.25 0.25 113 + 114 + let test_tiles_for_bbox () = 115 + let bbox = Geotessera.{ min_lon = 0.0; min_lat = 52.0; max_lon = 0.2; max_lat = 52.1 } in 116 + let tiles = Geotessera.tiles_for_bbox bbox in 117 + (* Should cover grid centers 0.05, 0.15 in lon and 52.05 in lat *) 118 + let n = List.length tiles in 119 + Alcotest.(check bool) "at least 2 tiles" true (n >= 2) 120 + 121 + let test_tile_name () = 122 + let name = Geotessera.tile_name { lon = 0.15; lat = 52.05 } in 123 + Alcotest.(check string) "name" "grid_0.15_52.05" name 124 + 125 + let test_embedding_url () = 126 + let url = Geotessera.embedding_url 127 + ~base_url:"https://dl2.geotessera.org" 128 + ~version:"v1" ~year:2024 129 + { lon = 0.15; lat = 52.05 } 130 + in 131 + Alcotest.(check string) "url" 132 + "https://dl2.geotessera.org/v1/global_0.1_degree_representation/2024/grid_0.15_52.05/grid_0.15_52.05.npy" 133 + url 134 + ``` 135 + 136 + Build, test, commit: `tessera-geotessera: scaffold with grid math` 137 + 138 + --- 139 + 140 + ### Task 2: Dequantization 141 + 142 + **Step 1: Write tests** 143 + 144 + Generate test .npy fixtures with Python: 145 + ```bash 146 + python3 -c " 147 + import numpy as np 148 + # 2x3 int8 embeddings with 4 features (small for testing) 149 + emb = np.array([[[1,2,3,4],[5,6,7,8],[9,10,11,12]], 150 + [[13,14,15,16],[17,18,19,20],[21,22,23,24]]], dtype=np.int8) 151 + np.save('$HOME/workspace/mono/tessera-geotessera/test/fixtures/emb_2x3x4.npy', emb) 152 + # 2x3 float32 scales 153 + scales = np.array([[0.1, 0.2, 0.3],[0.4, 0.5, 0.6]], dtype=np.float32) 154 + np.save('$HOME/workspace/mono/tessera-geotessera/test/fixtures/scales_2x3.npy', scales) 155 + print('Generated') 156 + " 157 + ``` 158 + 159 + Test: 160 + ```ocaml 161 + let test_dequantize () = 162 + let emb_data = read_file (fixture "emb_2x3x4.npy") in 163 + let scales_data = read_file (fixture "scales_2x3.npy") in 164 + let emb = Result.get_ok (Npy.of_string emb_data) in 165 + let scales = Result.get_ok (Npy.of_string scales_data) in 166 + let result = Geotessera.dequantize ~embeddings:emb ~scales in 167 + (* 2*3=6 pixels, 4 features *) 168 + Alcotest.(check int) "rows" 6 result.rows; 169 + Alcotest.(check int) "cols" 4 result.cols; 170 + (* First pixel: emb=[1,2,3,4], scale=0.1 → [0.1, 0.2, 0.3, 0.4] *) 171 + let check_near msg expected actual = 172 + Alcotest.(check (float 0.01)) msg expected actual 173 + in 174 + check_near "pixel 0, feat 0" 0.1 (Linalg.mat_get result 0 0); 175 + check_near "pixel 0, feat 1" 0.2 (Linalg.mat_get result 0 1); 176 + (* Second pixel: emb=[5,6,7,8], scale=0.2 → [1.0, 1.2, 1.4, 1.6] *) 177 + check_near "pixel 1, feat 0" 1.0 (Linalg.mat_get result 1 0); 178 + check_near "pixel 1, feat 1" 1.2 (Linalg.mat_get result 1 1) 179 + ``` 180 + 181 + **Step 2: Implement dequantize** 182 + 183 + ``` 184 + For each pixel (i, j) with n_features features: 185 + scale = scales[i, j] (single float32 value) 186 + for f = 0 to n_features - 1: 187 + result[i*W+j, f] = float_of_int(embeddings[i, j, f]) * scale 188 + ``` 189 + 190 + Build, test, commit: `tessera-geotessera: implement dequantization` 191 + 192 + --- 193 + 194 + ### Task 3: Fetch + mosaic 195 + 196 + This is the higher-level API that ties grid math, fetching, dequantization, and mosaicing together. 197 + 198 + **Step 1: Write mosaic test** 199 + 200 + Test with 2 synthetic tiles placed side-by-side: 201 + 202 + ```ocaml 203 + let test_mosaic_two_tiles () = 204 + (* Two tiles side by side: (0.05, 52.05) and (0.15, 52.05) *) 205 + let t1_coord = Geotessera.{ lon = 0.05; lat = 52.05 } in 206 + let t2_coord = Geotessera.{ lon = 0.15; lat = 52.05 } in 207 + (* Each tile is 2x2 pixels with 2 features *) 208 + let t1 = Linalg.create_mat ~rows:4 ~cols:2 in 209 + let t2 = Linalg.create_mat ~rows:4 ~cols:2 in 210 + (* Fill t1 with 1.0, t2 with 2.0 *) 211 + for i = 0 to 3 do 212 + for j = 0 to 1 do 213 + Linalg.mat_set t1 i j 1.0; 214 + Linalg.mat_set t2 i j 2.0 215 + done 216 + done; 217 + let bbox = Geotessera.{ min_lon = 0.0; min_lat = 52.0; max_lon = 0.2; max_lat = 52.1 } in 218 + let mosaic, h, w = Geotessera.mosaic [(t1_coord, t1); (t2_coord, t2)] bbox in 219 + Alcotest.(check int) "height" 2 h; 220 + Alcotest.(check int) "width" 4 w; 221 + Alcotest.(check int) "rows" 8 mosaic.rows; 222 + Alcotest.(check int) "cols" 2 mosaic.cols 223 + ``` 224 + 225 + **Step 2: Implement mosaic** 226 + 227 + The mosaic function needs to: 228 + 1. Determine the grid extent from the bbox (how many tiles wide × tall) 229 + 2. Determine pixel dimensions per tile (from the mat rows and tile count) 230 + 3. Allocate output matrix: (total_H * total_W, n_features) 231 + 4. Place each tile's data at the correct offset based on its grid position 232 + 233 + For v1, assume all tiles have the same pixel dimensions. The mosaic arranges tiles in a grid: 234 + - Grid origin is (min tile lon, max tile lat) — top-left 235 + - Tile at grid position (gi, gj) fills rows [gi*tile_h .. (gi+1)*tile_h-1] 236 + 237 + Build, test, commit: `tessera-geotessera: implement mosaic` 238 + 239 + --- 240 + 241 + ### Task 4: fetch_mosaic high-level API 242 + 243 + Add the main user-facing function that orchestrates the full workflow: 244 + 245 + ```ocaml 246 + type fetch = string -> string option Lwt.t 247 + (** A function that fetches a URL and returns the response body, 248 + or None on failure. *) 249 + 250 + val fetch_mosaic : 251 + fetch:fetch -> 252 + base_url:string -> 253 + version:string -> 254 + year:int -> 255 + bbox -> 256 + (Linalg.mat * int * int) Lwt.t 257 + (** Fetch all tiles for a bounding box, dequantize, and mosaic. 258 + Returns [(mosaic, height, width)]. 259 + Uses the provided [fetch] function for HTTP requests. *) 260 + ``` 261 + 262 + Actually — this introduces an Lwt dependency which breaks portability. Instead, make it parameterised over the concurrency monad too, or just provide a synchronous version: 263 + 264 + ```ocaml 265 + val fetch_mosaic_sync : 266 + fetch:(string -> string) -> 267 + base_url:string -> 268 + version:string -> 269 + year:int -> 270 + bbox -> 271 + Linalg.mat * int * int 272 + (** Synchronous version: fetch is a blocking function. *) 273 + ``` 274 + 275 + For the browser version (tessera-geotessera-jsoo), we can add the Lwt wrapper later. The portable core should stay dependency-free. 276 + 277 + Test with a mock fetch function that returns pre-built .npy data. 278 + 279 + Build, test, commit: `tessera-geotessera: add fetch_mosaic_sync` 280 + 281 + --- 282 + 283 + ### Task 5: Documentation and opam 284 + 285 + Polish .mli, generate opam file, final verification. 286 + 287 + Commit: `tessera-geotessera: polish docs and generate opam file`

+389

docs/plans/2026-03-05-tessera-linalg.md

··· 1 + # tessera-linalg Implementation Plan 2 + 3 + > **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. 4 + 5 + **Goal:** A portable OCaml library implementing PCA dimensionality reduction and k-nearest-neighbors classification, operating on float32 Bigarray data. Designed for GeoTessera embeddings (128 dimensions) but general-purpose. 6 + 7 + **Architecture:** Pure OCaml, no C stubs, no platform dependencies. Operates on `Bigarray.Array1.t` (flat arrays with explicit shape) for compatibility with tessera-npy. PCA uses power iteration for eigendecomposition of the 128x128 covariance matrix. kNN uses brute-force distance computation with distance-weighted voting. 8 + 9 + **Tech Stack:** OCaml, Bigarray, Alcotest (testing), dune 10 + 11 + --- 12 + 13 + ## Data representation 14 + 15 + All arrays are flat `(float, Bigarray.float32_elt, Bigarray.c_layout) Bigarray.Array1.t` with shape tracked separately as `int array`. This matches tessera-npy's output format and enables zero-copy interop with js_of_ocaml TypedArrays. 16 + 17 + Helper functions for 2D indexing: `get mat ~cols ~row ~col` and `set mat ~cols ~row ~col v`. 18 + 19 + --- 20 + 21 + ### Task 1: Project scaffold + matrix helpers 22 + 23 + **Files:** 24 + - Create: `tessera-linalg/dune-project` 25 + - Create: `tessera-linalg/lib/dune` 26 + - Create: `tessera-linalg/lib/linalg.ml` 27 + - Create: `tessera-linalg/lib/linalg.mli` 28 + - Create: `tessera-linalg/test/dune` 29 + - Create: `tessera-linalg/test/test_linalg.ml` 30 + 31 + **Step 1: Create project scaffold** 32 + 33 + ``` 34 + tessera-linalg/dune-project 35 + ``` 36 + ```dune 37 + (lang dune 3.17) 38 + (name tessera-linalg) 39 + (generate_opam_files true) 40 + (license ISC) 41 + (package 42 + (name tessera-linalg) 43 + (synopsis "PCA and kNN for float32 Bigarray data") 44 + (description "Portable OCaml implementations of PCA dimensionality reduction and k-nearest-neighbors classification. Operates on flat float32 Bigarray data with explicit shapes. Designed for GeoTessera embedding classification but general-purpose.") 45 + (depends 46 + (ocaml (>= 5.2)) 47 + (alcotest (and :with-test (>= 0.8))))) 48 + ``` 49 + 50 + ``` 51 + tessera-linalg/lib/dune 52 + ``` 53 + ```dune 54 + (library 55 + (name linalg) 56 + (public_name tessera-linalg)) 57 + ``` 58 + 59 + **Step 2: Define the core types and matrix helpers** 60 + 61 + The `.mli` should expose: 62 + 63 + ```ocaml 64 + (** PCA and kNN for float32 Bigarray data. 65 + 66 + All arrays are flat 67 + [(float, Bigarray.float32_elt, Bigarray.c_layout) Bigarray.Array1.t] 68 + with shapes tracked separately. *) 69 + 70 + type vec = (float, Bigarray.float32_elt, Bigarray.c_layout) Bigarray.Array1.t 71 + (** A 1-dimensional float32 vector. *) 72 + 73 + type mat = { 74 + data : vec; 75 + rows : int; 76 + cols : int; 77 + } 78 + (** A 2-dimensional float32 matrix stored in row-major order. *) 79 + 80 + (** {1 Matrix construction} *) 81 + 82 + val create_mat : rows:int -> cols:int -> mat 83 + val mat_get : mat -> int -> int -> float 84 + val mat_set : mat -> int -> int -> float -> unit 85 + 86 + (** {1 PCA} *) 87 + 88 + type pca_model 89 + (** A fitted PCA model (mean vector + principal components). *) 90 + 91 + val pca_fit : mat -> n_components:int -> ?max_samples:int -> pca_model 92 + (** Fit PCA on a data matrix of shape (n_samples, n_features). 93 + [max_samples] limits the number of rows used for fitting (default: 100_000). 94 + Samples are taken evenly spaced if the matrix has more rows. *) 95 + 96 + val pca_transform : pca_model -> mat -> mat 97 + (** Project data through a fitted PCA model. Input shape: (n_samples, n_features). 98 + Output shape: (n_samples, n_components). *) 99 + 100 + (** {1 kNN} *) 101 + 102 + type knn_model 103 + (** A fitted kNN model (training embeddings + labels). *) 104 + 105 + val knn_fit : embeddings:mat -> labels:int array -> knn_model 106 + (** Fit a kNN model from labeled training embeddings. 107 + [embeddings] has shape (n_training, n_features). 108 + [labels] has length n_training with integer class IDs. *) 109 + 110 + type knn_result = { 111 + predictions : int array; 112 + confidences : float array; 113 + } 114 + (** Classification result: predicted class and confidence per sample. *) 115 + 116 + val knn_predict : knn_model -> k:int -> mat -> knn_result 117 + (** Classify samples using k-nearest neighbors with distance weighting. 118 + Input shape: (n_samples, n_features). 119 + Returns a prediction and confidence for each sample. *) 120 + ``` 121 + 122 + Start with stub implementations that raise `Failure "not implemented"` for PCA/kNN, but implement the matrix helpers fully. 123 + 124 + **Step 3: Write tests for matrix helpers** 125 + 126 + ```ocaml 127 + let test_mat_create_and_access () = 128 + let m = Linalg.create_mat ~rows:3 ~cols:4 in 129 + Linalg.mat_set m 1 2 42.0; 130 + let v = Linalg.mat_get m 1 2 in 131 + Alcotest.(check (float 1e-6)) "get/set" 42.0 v 132 + 133 + let test_mat_dimensions () = 134 + let m = Linalg.create_mat ~rows:5 ~cols:3 in 135 + Alcotest.(check int) "rows" 5 m.rows; 136 + Alcotest.(check int) "cols" 3 m.cols; 137 + Alcotest.(check int) "data length" 15 (Bigarray.Array1.dim m.data) 138 + ``` 139 + 140 + **Step 4: Build, test, commit** 141 + 142 + Run: `cd ~/workspace/mono && opam exec -- dune build @tessera-linalg/test/runtest` 143 + Commit: `git commit -m "tessera-linalg: initial scaffold with matrix helpers"` 144 + 145 + --- 146 + 147 + ### Task 2: PCA implementation 148 + 149 + **Files:** 150 + - Modify: `tessera-linalg/lib/linalg.ml` 151 + - Modify: `tessera-linalg/test/test_linalg.ml` 152 + 153 + **Step 1: Write PCA tests** 154 + 155 + Test with a simple known dataset where PCA results are predictable: 156 + 157 + ```ocaml 158 + let test_pca_basic () = 159 + (* 4 points in 3D, with variance mainly along first axis *) 160 + let m = Linalg.create_mat ~rows:4 ~cols:3 in 161 + (* Points: (1,0,0), (2,0,0), (3,0,0), (4,0,0) — variance only in dim 0 *) 162 + List.iteri (fun i x -> 163 + Linalg.mat_set m i 0 x; Linalg.mat_set m i 1 0.0; Linalg.mat_set m i 2 0.0 164 + ) [1.0; 2.0; 3.0; 4.0]; 165 + let model = Linalg.pca_fit m ~n_components:1 in 166 + let result = Linalg.pca_transform model m in 167 + Alcotest.(check int) "output rows" 4 result.rows; 168 + Alcotest.(check int) "output cols" 1 result.cols; 169 + (* The projected values should be proportional to the input x values *) 170 + let v0 = Linalg.mat_get result 0 0 in 171 + let v1 = Linalg.mat_get result 1 0 in 172 + let v2 = Linalg.mat_get result 2 0 in 173 + let v3 = Linalg.mat_get result 3 0 in 174 + (* Check spacing is roughly equal *) 175 + let d1 = Float.abs (v1 -. v0) in 176 + let d2 = Float.abs (v2 -. v1) in 177 + let d3 = Float.abs (v3 -. v2) in 178 + if Float.abs (d1 -. d2) > 0.1 || Float.abs (d2 -. d3) > 0.1 then 179 + Alcotest.failf "uneven spacing: %f %f %f" d1 d2 d3 180 + 181 + let test_pca_2d_to_1d () = 182 + (* Points spread along y=x diagonal, PCA should find that direction *) 183 + let n = 100 in 184 + let m = Linalg.create_mat ~rows:n ~cols:2 in 185 + for i = 0 to n - 1 do 186 + let x = Float.of_int i in 187 + Linalg.mat_set m i 0 x; 188 + Linalg.mat_set m i 1 x 189 + done; 190 + let model = Linalg.pca_fit m ~n_components:1 in 191 + let result = Linalg.pca_transform model m in 192 + (* All points should be on a line — check result is monotonically increasing or decreasing *) 193 + let increasing = Linalg.mat_get result 1 0 > Linalg.mat_get result 0 0 in 194 + for i = 1 to n - 2 do 195 + let curr = Linalg.mat_get result i 0 in 196 + let next = Linalg.mat_get result (i + 1) 0 in 197 + if increasing then 198 + (if next < curr -. 0.01 then Alcotest.failf "not monotonic at %d" i) 199 + else 200 + (if next > curr +. 0.01 then Alcotest.failf "not monotonic at %d" i) 201 + done 202 + 203 + let test_pca_preserves_n_components () = 204 + let m = Linalg.create_mat ~rows:10 ~cols:5 in 205 + for i = 0 to 9 do 206 + for j = 0 to 4 do 207 + Linalg.mat_set m i j (Float.of_int (i * 5 + j)) 208 + done 209 + done; 210 + let model = Linalg.pca_fit m ~n_components:3 in 211 + let result = Linalg.pca_transform model m in 212 + Alcotest.(check int) "output cols" 3 result.cols; 213 + Alcotest.(check int) "output rows" 10 result.rows 214 + ``` 215 + 216 + **Step 2: Implement PCA** 217 + 218 + The algorithm: 219 + 1. **Subsample** if rows > max_samples: take evenly spaced rows 220 + 2. **Compute mean**: average each column 221 + 3. **Compute covariance matrix** (n_features x n_features): `C = (X - mean)^T (X - mean) / (n - 1)` 222 + 4. **Power iteration** to find top eigenvectors: 223 + - For each component k = 0..n_components-1: 224 + - Start with random vector v 225 + - Repeat ~100 iterations: v = C*v; normalize v 226 + - Deflate: C = C - eigenvalue * v * v^T 227 + - Store eigenvector 228 + 5. **pca_transform**: `result[i, k] = dot(X[i] - mean, component[k])` 229 + 230 + Implementation notes: 231 + - The covariance matrix is at most 128x128 (small), so power iteration is fine 232 + - Use a deterministic seed for the initial random vector (e.g., based on column index) 233 + - Float32 precision is sufficient for visualization purposes 234 + 235 + **Step 3: Run tests, commit** 236 + 237 + Run: `cd ~/workspace/mono && opam exec -- dune build @tessera-linalg/test/runtest` 238 + Commit: `git commit -m "tessera-linalg: implement PCA with power iteration"` 239 + 240 + --- 241 + 242 + ### Task 3: kNN implementation 243 + 244 + **Files:** 245 + - Modify: `tessera-linalg/lib/linalg.ml` 246 + - Modify: `tessera-linalg/test/test_linalg.ml` 247 + 248 + **Step 1: Write kNN tests** 249 + 250 + ```ocaml 251 + let test_knn_simple () = 252 + (* 4 training points in 2D, 2 classes *) 253 + let train = Linalg.create_mat ~rows:4 ~cols:2 in 254 + (* Class 0: points near origin *) 255 + Linalg.mat_set train 0 0 0.0; Linalg.mat_set train 0 1 0.0; 256 + Linalg.mat_set train 1 0 0.1; Linalg.mat_set train 1 1 0.1; 257 + (* Class 1: points near (10, 10) *) 258 + Linalg.mat_set train 2 0 10.0; Linalg.mat_set train 2 1 10.0; 259 + Linalg.mat_set train 3 0 10.1; Linalg.mat_set train 3 1 10.1; 260 + let labels = [|0; 0; 1; 1|] in 261 + let model = Linalg.knn_fit ~embeddings:train ~labels in 262 + (* Test point near origin → class 0, test point near (10,10) → class 1 *) 263 + let test = Linalg.create_mat ~rows:2 ~cols:2 in 264 + Linalg.mat_set test 0 0 0.05; Linalg.mat_set test 0 1 0.05; 265 + Linalg.mat_set test 1 0 9.9; Linalg.mat_set test 1 1 9.9; 266 + let result = Linalg.knn_predict model ~k:2 test in 267 + Alcotest.(check int) "pred 0" 0 result.predictions.(0); 268 + Alcotest.(check int) "pred 1" 1 result.predictions.(1); 269 + (* Confidence should be high (both neighbors agree) *) 270 + if result.confidences.(0) < 0.9 then 271 + Alcotest.failf "low confidence: %f" result.confidences.(0) 272 + 273 + let test_knn_distance_weighting () = 274 + (* 3 training points: 2 of class 0 (far), 1 of class 1 (very close) *) 275 + let train = Linalg.create_mat ~rows:3 ~cols:1 in 276 + Linalg.mat_set train 0 0 10.0; (* class 0, far *) 277 + Linalg.mat_set train 1 0 11.0; (* class 0, far *) 278 + Linalg.mat_set train 2 0 0.01; (* class 1, very close *) 279 + let labels = [|0; 0; 1|] in 280 + let model = Linalg.knn_fit ~embeddings:train ~labels in 281 + let test = Linalg.create_mat ~rows:1 ~cols:1 in 282 + Linalg.mat_set test 0 0 0.0; (* at origin, very close to class 1 *) 283 + let result = Linalg.knn_predict model ~k:3 test in 284 + (* With distance weighting, class 1 should win despite being outnumbered *) 285 + Alcotest.(check int) "pred" 1 result.predictions.(0) 286 + 287 + let test_knn_k_equals_1 () = 288 + let train = Linalg.create_mat ~rows:2 ~cols:1 in 289 + Linalg.mat_set train 0 0 0.0; 290 + Linalg.mat_set train 1 0 10.0; 291 + let labels = [|0; 1|] in 292 + let model = Linalg.knn_fit ~embeddings:train ~labels in 293 + let test = Linalg.create_mat ~rows:1 ~cols:1 in 294 + Linalg.mat_set test 0 0 3.0; (* closer to 0 than 10 *) 295 + let result = Linalg.knn_predict model ~k:1 test in 296 + Alcotest.(check int) "pred" 0 result.predictions.(0); 297 + Alcotest.(check (float 1e-6)) "conf" 1.0 result.confidences.(0) 298 + ``` 299 + 300 + **Step 2: Implement kNN** 301 + 302 + The algorithm: 303 + 1. **knn_fit**: Store training embeddings matrix and labels array 304 + 2. **knn_predict** for each test sample: 305 + a. Compute squared Euclidean distance to every training point 306 + b. Find k smallest distances (partial sort or full sort — N is small for training data) 307 + c. Distance-weighted vote: weight = 1 / (distance + epsilon) where epsilon avoids division by zero 308 + d. Sum weights per class, predict class with highest total weight 309 + e. Confidence = max_weight / total_weight 310 + 311 + Implementation notes: 312 + - k is capped to min(k, n_training) 313 + - For the sort: since N (training points) is small (typically <1000), a full sort is fine 314 + - epsilon = 1e-10 to avoid division by zero for exact matches 315 + 316 + **Step 3: Run tests, commit** 317 + 318 + Run: `cd ~/workspace/mono && opam exec -- dune build @tessera-linalg/test/runtest` 319 + Commit: `git commit -m "tessera-linalg: implement kNN with distance-weighted voting"` 320 + 321 + --- 322 + 323 + ### Task 4: Integration test with realistic dimensions 324 + 325 + **Files:** 326 + - Modify: `tessera-linalg/test/test_linalg.ml` 327 + 328 + **Step 1: Write a test that mimics the GeoTessera workflow** 329 + 330 + ```ocaml 331 + let test_full_workflow () = 332 + (* Simulate: 100 pixels with 128-dim embeddings, PCA to 3, then kNN *) 333 + let n_pixels = 100 in 334 + let n_features = 128 in 335 + let embeddings = Linalg.create_mat ~rows:n_pixels ~cols:n_features in 336 + (* Create 2 clusters in 128D space *) 337 + for i = 0 to n_pixels - 1 do 338 + let cluster = if i < 50 then 0.0 else 10.0 in 339 + for j = 0 to n_features - 1 do 340 + (* Cluster center + small noise *) 341 + let v = cluster +. Float.of_int (j mod 3) *. 0.01 in 342 + Linalg.mat_set embeddings i j v 343 + done 344 + done; 345 + (* PCA to 3 components *) 346 + let pca = Linalg.pca_fit embeddings ~n_components:3 in 347 + let projected = Linalg.pca_transform pca embeddings in 348 + Alcotest.(check int) "pca rows" n_pixels projected.rows; 349 + Alcotest.(check int) "pca cols" 3 projected.cols; 350 + (* Label a few points: first 5 = class 0, last 5 = class 1 *) 351 + let train_indices = [|0; 1; 2; 3; 4; 95; 96; 97; 98; 99|] in 352 + let train_labels = [|0; 0; 0; 0; 0; 1; 1; 1; 1; 1|] in 353 + let train = Linalg.create_mat ~rows:10 ~cols:3 in 354 + Array.iteri (fun ti si -> 355 + for j = 0 to 2 do 356 + Linalg.mat_set train ti j (Linalg.mat_get projected si j) 357 + done 358 + ) train_indices; 359 + let knn = Linalg.knn_fit ~embeddings:train ~labels:train_labels in 360 + let result = Linalg.knn_predict knn ~k:3 projected in 361 + (* First 50 should be class 0, last 50 should be class 1 *) 362 + for i = 0 to 49 do 363 + Alcotest.(check int) (Printf.sprintf "pixel %d" i) 0 result.predictions.(i) 364 + done; 365 + for i = 50 to 99 do 366 + Alcotest.(check int) (Printf.sprintf "pixel %d" i) 1 result.predictions.(i) 367 + done 368 + ``` 369 + 370 + **Step 2: Run tests, commit** 371 + 372 + Commit: `git commit -m "tessera-linalg: add integration test with realistic dimensions"` 373 + 374 + --- 375 + 376 + ### Task 5: Documentation and opam file 377 + 378 + **Files:** 379 + - Modify: `tessera-linalg/lib/linalg.mli` — polish documentation 380 + - Verify: `tessera-linalg/tessera-linalg.opam` auto-generated 381 + 382 + **Step 1: Polish .mli with usage examples** 383 + 384 + Add a module-level example showing the PCA + kNN workflow, and document each function's preconditions and behavior. 385 + 386 + **Step 2: Final build and test** 387 + 388 + Run: `cd ~/workspace/mono && opam exec -- dune build @tessera-linalg/test/runtest` 389 + Commit: `git commit -m "tessera-linalg: polish docs and generate opam file"`

+811

docs/plans/2026-03-05-tessera-npy.md

··· 1 + # tessera-npy Implementation Plan 2 + 3 + > **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. 4 + 5 + **Goal:** A portable OCaml library for reading (and optionally writing) numpy `.npy` files, usable natively and via js_of_ocaml. 6 + 7 + **Architecture:** Pure OCaml, no C stubs, no platform dependencies. Uses Bigarray for zero-copy data representation. Parses the `.npy` header (magic bytes, version, Python dict literal for dtype/shape/order) then maps the raw data buffer into a typed Bigarray. Supports the dtypes needed for GeoTessera: int8 (`|i1`), float32 (`<f4`), float64 (`<f8`), and uint8 (`|u1`). 8 + 9 + **Tech Stack:** OCaml, Bigarray, Alcotest (testing), dune 10 + 11 + --- 12 + 13 + ## Reference: `.npy` Format 14 + 15 + The numpy `.npy` format (v1.0 and v2.0) is: 16 + 17 + 1. Magic: `\x93NUMPY` (6 bytes) 18 + 2. Major version: 1 byte 19 + 3. Minor version: 1 byte 20 + 4. Header length: 2 bytes little-endian (v1) or 4 bytes little-endian (v2) 21 + 5. Header: ASCII Python dict, padded with spaces to align data to 64 bytes. Example: 22 + `{'descr': '|i1', 'fortran_order': False, 'shape': (1139, 729, 128), }\n` 23 + 6. Raw data: contiguous array in C-order (or Fortran-order if specified) 24 + 25 + Real GeoTessera files: 26 + - Embeddings: `{'descr': '|i1', 'fortran_order': False, 'shape': (1139, 729, 128)}` 27 + - Scales: `{'descr': '<f4', 'fortran_order': False, 'shape': (1139, 729)}` 28 + 29 + --- 30 + 31 + ### Task 1: Project scaffold 32 + 33 + **Files:** 34 + - Create: `tessera-npy/dune-project` 35 + - Create: `tessera-npy/lib/dune` 36 + - Create: `tessera-npy/lib/npy.ml` 37 + - Create: `tessera-npy/lib/npy.mli` 38 + - Create: `tessera-npy/test/dune` 39 + - Create: `tessera-npy/test/test_npy.ml` 40 + 41 + **Step 1: Create directory structure and dune-project** 42 + 43 + ``` 44 + tessera-npy/dune-project 45 + ``` 46 + ```dune 47 + (lang dune 3.17) 48 + (name tessera-npy) 49 + (generate_opam_files true) 50 + (license ISC) 51 + (package 52 + (name tessera-npy) 53 + (synopsis "Read and write numpy .npy files in OCaml") 54 + (description "A portable OCaml library for parsing the numpy .npy binary format. Supports int8, uint8, float32, and float64 dtypes with arbitrary shapes. Uses Bigarray for zero-copy data representation.") 55 + (depends 56 + (ocaml (>= 5.2)) 57 + (alcotest (and :with-test (>= 0.8))))) 58 + ``` 59 + 60 + ``` 61 + tessera-npy/lib/dune 62 + ``` 63 + ```dune 64 + (library 65 + (name npy) 66 + (public_name tessera-npy)) 67 + ``` 68 + 69 + ``` 70 + tessera-npy/lib/npy.mli 71 + ``` 72 + ```ocaml 73 + (** Read and write numpy .npy files. *) 74 + 75 + (** {1 Types} *) 76 + 77 + (** Element types supported by this library. *) 78 + type _ dtype = 79 + | Int8 : int dtype 80 + | Uint8 : int dtype 81 + | Float32 : float dtype 82 + | Float64 : float dtype 83 + 84 + (** A parsed .npy file: dtype, shape, and raw data as bytes. *) 85 + type t 86 + 87 + (** {1 Reading} *) 88 + 89 + val of_string : string -> (t, string) result 90 + (** Parse a .npy file from its raw bytes. *) 91 + 92 + val shape : t -> int array 93 + (** The shape of the array. *) 94 + 95 + val fortran_order : t -> bool 96 + (** Whether the data is in Fortran (column-major) order. *) 97 + 98 + val data_int8 : t -> (int, Bigarray.int8_signed_elt, Bigarray.c_layout) Bigarray.Array1.t option 99 + (** Access data as int8 bigarray if the dtype matches. *) 100 + 101 + val data_uint8 : t -> (int, Bigarray.int8_unsigned_elt, Bigarray.c_layout) Bigarray.Array1.t option 102 + (** Access data as uint8 bigarray if the dtype matches. *) 103 + 104 + val data_float32 : t -> (float, Bigarray.float32_elt, Bigarray.c_layout) Bigarray.Array1.t option 105 + (** Access data as float32 bigarray if the dtype matches. *) 106 + 107 + val data_float64 : t -> (float, Bigarray.float64_elt, Bigarray.c_layout) Bigarray.Array1.t option 108 + (** Access data as float64 bigarray if the dtype matches. *) 109 + ``` 110 + 111 + ``` 112 + tessera-npy/lib/npy.ml 113 + ``` 114 + ```ocaml 115 + type _ dtype = 116 + | Int8 : int dtype 117 + | Uint8 : int dtype 118 + | Float32 : float dtype 119 + | Float64 : float dtype 120 + 121 + type packed_dtype = D : _ dtype -> packed_dtype 122 + 123 + type t = { 124 + shape : int array; 125 + fortran_order : bool; 126 + dtype : packed_dtype; 127 + data : string; (* raw bytes of the data section *) 128 + } 129 + 130 + let of_string _s = Error "not implemented" 131 + let shape t = t.shape 132 + let fortran_order t = t.fortran_order 133 + let data_int8 _t = None 134 + let data_uint8 _t = None 135 + let data_float32 _t = None 136 + let data_float64 _t = None 137 + ``` 138 + 139 + ``` 140 + tessera-npy/test/dune 141 + ``` 142 + ```dune 143 + (test 144 + (name test_npy) 145 + (libraries tessera-npy alcotest)) 146 + ``` 147 + 148 + ``` 149 + tessera-npy/test/test_npy.ml 150 + ``` 151 + ```ocaml 152 + let () = 153 + Alcotest.run "tessera-npy" [] 154 + ``` 155 + 156 + **Step 2: Verify it builds** 157 + 158 + Run: `cd ~/workspace/mono && opam exec -- dune build -p tessera-npy` 159 + Expected: Build succeeds (no tests yet, just scaffold). 160 + 161 + Run: `cd ~/workspace/mono && opam exec -- dune test -p tessera-npy` 162 + Expected: Test binary runs, 0 tests. 163 + 164 + **Step 3: Commit** 165 + 166 + ```bash 167 + git add tessera-npy/ 168 + git commit -m "tessera-npy: initial project scaffold" 169 + ``` 170 + 171 + --- 172 + 173 + ### Task 2: Header parsing 174 + 175 + **Files:** 176 + - Modify: `tessera-npy/test/test_npy.ml` 177 + - Modify: `tessera-npy/lib/npy.ml` 178 + 179 + **Step 1: Write failing tests for header parsing** 180 + 181 + ``` 182 + tessera-npy/test/test_npy.ml 183 + ``` 184 + ```ocaml 185 + let make_npy_v1 ~descr ~fortran_order ~shape data = 186 + let header = 187 + Printf.sprintf "{'descr': '%s', 'fortran_order': %s, 'shape': (%s), }" 188 + descr 189 + (if fortran_order then "True" else "False") 190 + (String.concat ", " (List.map string_of_int shape)) 191 + in 192 + (* Pad header to align data to 64 bytes *) 193 + let prefix_len = 6 + 2 + 2 in (* magic + version + header_len *) 194 + let raw_header_len = String.length header + 1 (* +1 for \n *) in 195 + let padded_len = 196 + let total = prefix_len + raw_header_len in 197 + let rem = total mod 64 in 198 + if rem = 0 then raw_header_len else raw_header_len + (64 - rem) 199 + in 200 + let buf = Buffer.create (prefix_len + padded_len + String.length data) in 201 + Buffer.add_string buf "\x93NUMPY"; (* magic *) 202 + Buffer.add_char buf '\x01'; (* major version *) 203 + Buffer.add_char buf '\x00'; (* minor version *) 204 + (* header length as 2-byte little-endian *) 205 + Buffer.add_char buf (Char.chr (padded_len land 0xff)); 206 + Buffer.add_char buf (Char.chr ((padded_len lsr 8) land 0xff)); 207 + Buffer.add_string buf header; 208 + (* Pad with spaces, terminate with \n *) 209 + for _ = 1 to padded_len - raw_header_len do 210 + Buffer.add_char buf ' ' 211 + done; 212 + Buffer.add_char buf '\n'; 213 + Buffer.add_string buf data; 214 + Buffer.contents buf 215 + 216 + let test_parse_int8_header () = 217 + let npy = make_npy_v1 ~descr:"|i1" ~fortran_order:false ~shape:[3; 4] "\x00" in 218 + match Npy.of_string npy with 219 + | Error e -> Alcotest.fail e 220 + | Ok t -> 221 + Alcotest.(check (array int)) "shape" [|3; 4|] (Npy.shape t); 222 + Alcotest.(check bool) "fortran_order" false (Npy.fortran_order t) 223 + 224 + let test_parse_float32_header () = 225 + let npy = make_npy_v1 ~descr:"<f4" ~fortran_order:false ~shape:[2; 3] "\x00" in 226 + match Npy.of_string npy with 227 + | Error e -> Alcotest.fail e 228 + | Ok t -> 229 + Alcotest.(check (array int)) "shape" [|2; 3|] (Npy.shape t) 230 + 231 + let test_parse_3d_shape () = 232 + let npy = make_npy_v1 ~descr:"|i1" ~fortran_order:false ~shape:[10; 20; 128] "\x00" in 233 + match Npy.of_string npy with 234 + | Error e -> Alcotest.fail e 235 + | Ok t -> 236 + Alcotest.(check (array int)) "shape" [|10; 20; 128|] (Npy.shape t) 237 + 238 + let test_bad_magic () = 239 + let npy = "NOT_NPY_DATA" in 240 + match Npy.of_string npy with 241 + | Ok _ -> Alcotest.fail "should have failed" 242 + | Error _ -> () 243 + 244 + let () = 245 + Alcotest.run "tessera-npy" 246 + [ 247 + ( "header", 248 + [ 249 + Alcotest.test_case "int8 header" `Quick test_parse_int8_header; 250 + Alcotest.test_case "float32 header" `Quick test_parse_float32_header; 251 + Alcotest.test_case "3d shape" `Quick test_parse_3d_shape; 252 + Alcotest.test_case "bad magic" `Quick test_bad_magic; 253 + ] ); 254 + ] 255 + ``` 256 + 257 + **Step 2: Run tests to verify they fail** 258 + 259 + Run: `cd ~/workspace/mono && opam exec -- dune test -p tessera-npy 2>&1` 260 + Expected: FAIL — `of_string` returns `Error "not implemented"`. 261 + 262 + **Step 3: Implement header parsing** 263 + 264 + Replace the contents of `tessera-npy/lib/npy.ml`: 265 + 266 + ```ocaml 267 + type _ dtype = 268 + | Int8 : int dtype 269 + | Uint8 : int dtype 270 + | Float32 : float dtype 271 + | Float64 : float dtype 272 + 273 + type packed_dtype = D : _ dtype -> packed_dtype 274 + 275 + type t = { 276 + shape : int array; 277 + fortran_order : bool; 278 + dtype : packed_dtype; 279 + data : string; 280 + } 281 + 282 + let shape t = t.shape 283 + let fortran_order t = t.fortran_order 284 + 285 + let get_u16_le s off = 286 + Char.code (String.get s off) lor (Char.code (String.get s (off + 1)) lsl 8) 287 + 288 + let get_u32_le s off = 289 + Char.code (String.get s off) 290 + lor (Char.code (String.get s (off + 1)) lsl 8) 291 + lor (Char.code (String.get s (off + 2)) lsl 16) 292 + lor (Char.code (String.get s (off + 3)) lsl 24) 293 + 294 + (* Minimal parser for the Python dict header. Not a general Python parser — 295 + just handles the specific format numpy writes. *) 296 + 297 + let parse_descr s = 298 + match s with 299 + | "|i1" -> Ok (D Int8) 300 + | "|u1" -> Ok (D Uint8) 301 + | "<f4" -> Ok (D Float32) 302 + | "<f8" -> Ok (D Float64) 303 + | _ -> Error (Printf.sprintf "unsupported dtype: %s" s) 304 + 305 + let find_quoted_value header key = 306 + (* Find 'key': 'value' in the header string *) 307 + let pattern = Printf.sprintf "'%s': '" key in 308 + match String.split_on_char '\'' header with 309 + | _ -> 310 + (match Str.search_forward (Str.regexp_string pattern) header 0 with 311 + | exception Not_found -> Error (Printf.sprintf "key '%s' not found" key) 312 + | i -> 313 + let start = i + String.length pattern in 314 + (match String.index_from_opt header start '\'' with 315 + | None -> Error (Printf.sprintf "unterminated value for '%s'" key) 316 + | Some end_ -> Ok (String.sub header start (end_ - start)))) 317 + 318 + let find_bool_value header key = 319 + let pattern_true = Printf.sprintf "'%s': True" key in 320 + let pattern_false = Printf.sprintf "'%s': False" key in 321 + if String.length header = 0 then Error (Printf.sprintf "key '%s' not found" key) 322 + else 323 + let has_true = 324 + match Str.search_forward (Str.regexp_string pattern_true) header 0 with 325 + | _ -> true | exception Not_found -> false 326 + in 327 + let has_false = 328 + match Str.search_forward (Str.regexp_string pattern_false) header 0 with 329 + | _ -> true | exception Not_found -> false 330 + in 331 + if has_true then Ok true 332 + else if has_false then Ok false 333 + else Error (Printf.sprintf "key '%s' not found" key) 334 + 335 + let parse_shape_string s = 336 + (* Parse "(1139, 729, 128)" or "(1139, 729)" *) 337 + let s = String.trim s in 338 + let len = String.length s in 339 + if len < 2 || s.[0] <> '(' || s.[len - 1] <> ')' then 340 + Error "shape: expected parenthesized tuple" 341 + else 342 + let inner = String.sub s 1 (len - 2) in 343 + let parts = String.split_on_char ',' inner in 344 + let parts = List.filter_map (fun p -> 345 + let p = String.trim p in 346 + if String.length p = 0 then None else Some p 347 + ) parts in 348 + try Ok (Array.of_list (List.map int_of_string parts)) 349 + with Failure _ -> Error "shape: non-integer dimension" 350 + 351 + let find_shape header = 352 + let key = "'shape': " in 353 + match Str.search_forward (Str.regexp_string key) header 0 with 354 + | exception Not_found -> Error "key 'shape' not found" 355 + | i -> 356 + let start = i + String.length key in 357 + (* Find matching closing paren *) 358 + (match String.index_from_opt header start ')' with 359 + | None -> Error "shape: unterminated tuple" 360 + | Some end_ -> 361 + parse_shape_string (String.sub header start (end_ - start + 1))) 362 + 363 + let of_string s = 364 + let len = String.length s in 365 + if len < 10 then Error "too short to be a .npy file" 366 + else if String.sub s 0 6 <> "\x93NUMPY" then Error "invalid .npy magic bytes" 367 + else 368 + let major = Char.code s.[6] in 369 + let _minor = Char.code s.[7] in 370 + let header_len, header_offset = 371 + match major with 372 + | 1 -> (get_u16_le s 8, 10) 373 + | 2 -> (get_u32_le s 8, 12) 374 + | v -> (0, 0) |> fun _ -> failwith (Printf.sprintf "unsupported .npy version %d" v) 375 + in 376 + let data_offset = header_offset + header_len in 377 + if len < data_offset then Error "file truncated: header extends past end" 378 + else 379 + let header = String.sub s header_offset header_len in 380 + match find_descr header, find_bool_value header "fortran_order", find_shape header with 381 + | Ok dtype, Ok fortran_order, Ok shape -> 382 + let data = String.sub s data_offset (len - data_offset) in 383 + Ok { shape; fortran_order; dtype; data } 384 + | Error e, _, _ | _, Error e, _ | _, _, Error e -> Error e 385 + 386 + and find_descr header = 387 + match find_quoted_value header "descr" with 388 + | Error e -> Error e 389 + | Ok descr_str -> parse_descr descr_str 390 + 391 + let data_int8 t = 392 + match t.dtype with 393 + | D Int8 -> 394 + let ba = Bigarray.Array1.create Bigarray.int8_signed Bigarray.c_layout (String.length t.data) in 395 + for i = 0 to String.length t.data - 1 do 396 + (* int8 is signed: interpret byte as signed *) 397 + let b = Char.code t.data.[i] in 398 + let v = if b > 127 then b - 256 else b in 399 + Bigarray.Array1.set ba i v 400 + done; 401 + Some ba 402 + | _ -> None 403 + 404 + let data_uint8 t = 405 + match t.dtype with 406 + | D Uint8 -> 407 + let ba = Bigarray.Array1.create Bigarray.int8_unsigned Bigarray.c_layout (String.length t.data) in 408 + for i = 0 to String.length t.data - 1 do 409 + Bigarray.Array1.set ba i (Char.code t.data.[i]) 410 + done; 411 + Some ba 412 + | _ -> None 413 + 414 + let data_float32 t = 415 + match t.dtype with 416 + | D Float32 -> 417 + let n = String.length t.data / 4 in 418 + let ba = Bigarray.Array1.create Bigarray.float32 Bigarray.c_layout n in 419 + for i = 0 to n - 1 do 420 + let bits = Int32.of_int (get_u32_le t.data (i * 4)) in 421 + Bigarray.Array1.set ba i (Int32.float_of_bits bits) 422 + done; 423 + Some ba 424 + | _ -> None 425 + 426 + let data_float64 t = 427 + match t.dtype with 428 + | D Float64 -> 429 + let n = String.length t.data / 8 in 430 + let ba = Bigarray.Array1.create Bigarray.float64 Bigarray.c_layout n in 431 + for i = 0 to n - 1 do 432 + let lo = Int64.of_int (get_u32_le t.data (i * 8)) in 433 + let hi = Int64.of_int (get_u32_le t.data (i * 8 + 4)) in 434 + let bits = Int64.logor lo (Int64.shift_left hi 32) in 435 + Bigarray.Array1.set ba i (Int64.float_of_bits bits) 436 + done; 437 + Some ba 438 + | _ -> None 439 + ``` 440 + 441 + Note: This uses `Str` for simple string matching. Add it to the dune file: 442 + 443 + ``` 444 + tessera-npy/lib/dune 445 + ``` 446 + ```dune 447 + (library 448 + (name npy) 449 + (public_name tessera-npy) 450 + (libraries str)) 451 + ``` 452 + 453 + **Step 4: Run tests to verify they pass** 454 + 455 + Run: `cd ~/workspace/mono && opam exec -- dune test -p tessera-npy 2>&1` 456 + Expected: All 4 tests pass. 457 + 458 + **Step 5: Commit** 459 + 460 + ```bash 461 + git add tessera-npy/ 462 + git commit -m "tessera-npy: implement header parsing with tests" 463 + ``` 464 + 465 + --- 466 + 467 + ### Task 3: Data access tests 468 + 469 + **Files:** 470 + - Modify: `tessera-npy/test/test_npy.ml` 471 + 472 + **Step 1: Write failing tests for data access** 473 + 474 + Add these tests to the existing test file, adding them to the test list: 475 + 476 + ```ocaml 477 + let test_int8_data () = 478 + (* 2x3 array: [[1, -1, 127], [-128, 0, 42]] *) 479 + let data = String.init 6 (fun i -> 480 + Char.chr ([| 1; 255; 127; 128; 0; 42 |].(i)) 481 + ) in 482 + let npy = make_npy_v1 ~descr:"|i1" ~fortran_order:false ~shape:[2; 3] data in 483 + match Npy.of_string npy with 484 + | Error e -> Alcotest.fail e 485 + | Ok t -> 486 + (match Npy.data_int8 t with 487 + | None -> Alcotest.fail "expected int8 data" 488 + | Some ba -> 489 + Alcotest.(check int) "elem 0" 1 (Bigarray.Array1.get ba 0); 490 + Alcotest.(check int) "elem 1" (-1) (Bigarray.Array1.get ba 1); 491 + Alcotest.(check int) "elem 2" 127 (Bigarray.Array1.get ba 2); 492 + Alcotest.(check int) "elem 3" (-128) (Bigarray.Array1.get ba 3); 493 + Alcotest.(check int) "elem 4" 0 (Bigarray.Array1.get ba 4); 494 + Alcotest.(check int) "elem 5" 42 (Bigarray.Array1.get ba 5)) 495 + 496 + let test_float32_data () = 497 + (* Single float32 value: 3.14 *) 498 + let bits = Int32.bits_of_float 3.14 in 499 + let data = String.init 4 (fun i -> 500 + Char.chr (Int32.to_int (Int32.logand (Int32.shift_right_logical bits (i * 8)) 0xffl)) 501 + ) in 502 + let npy = make_npy_v1 ~descr:"<f4" ~fortran_order:false ~shape:[1] data in 503 + match Npy.of_string npy with 504 + | Error e -> Alcotest.fail e 505 + | Ok t -> 506 + (match Npy.data_float32 t with 507 + | None -> Alcotest.fail "expected float32 data" 508 + | Some ba -> 509 + let v = Bigarray.Array1.get ba 0 in 510 + (* float32 precision: check within epsilon *) 511 + if Float.abs (v -. 3.14) > 0.001 then 512 + Alcotest.failf "expected ~3.14, got %f" v) 513 + 514 + let test_wrong_dtype_returns_none () = 515 + let npy = make_npy_v1 ~descr:"|i1" ~fortran_order:false ~shape:[2] "\x00\x01" in 516 + match Npy.of_string npy with 517 + | Error e -> Alcotest.fail e 518 + | Ok t -> 519 + Alcotest.(check bool) "float32 is None" true (Option.is_none (Npy.data_float32 t)); 520 + Alcotest.(check bool) "int8 is Some" true (Option.is_some (Npy.data_int8 t)) 521 + ``` 522 + 523 + Add to the test list: 524 + ```ocaml 525 + ( "data", 526 + [ 527 + Alcotest.test_case "int8 data" `Quick test_int8_data; 528 + Alcotest.test_case "float32 data" `Quick test_float32_data; 529 + Alcotest.test_case "wrong dtype returns none" `Quick test_wrong_dtype_returns_none; 530 + ] ); 531 + ``` 532 + 533 + **Step 2: Run tests** 534 + 535 + Run: `cd ~/workspace/mono && opam exec -- dune test -p tessera-npy 2>&1` 536 + Expected: All 7 tests pass (the implementation from Task 2 already handles data access). 537 + 538 + **Step 3: Commit** 539 + 540 + ```bash 541 + git add tessera-npy/ 542 + git commit -m "tessera-npy: add data access tests" 543 + ``` 544 + 545 + --- 546 + 547 + ### Task 4: Test against real GeoTessera tiles 548 + 549 + **Files:** 550 + - Create: `tessera-npy/test/test_real_tiles.ml` 551 + - Modify: `tessera-npy/test/dune` 552 + 553 + This test downloads actual GeoTessera tiles and validates parsing. It's a slow test (`\`Slow`) that requires network access. 554 + 555 + **Step 1: Download test fixtures** 556 + 557 + Run: 558 + ```bash 559 + mkdir -p ~/workspace/mono/tessera-npy/test/fixtures 560 + curl -s 'https://dl2.geotessera.org/v1/global_0.1_degree_representation/2024/grid_0.15_52.05/grid_0.15_52.05_scales.npy' -o ~/workspace/mono/tessera-npy/test/fixtures/scales.npy 561 + # Embedding file is ~100MB, grab just a small tile instead. First, find a small one: 562 + # For testing, create a small synthetic .npy from the real header format 563 + ``` 564 + 565 + Actually, the embedding file is ~100MB which is too large for a test fixture. Instead, create a Python script to generate small test fixtures: 566 + 567 + Run: 568 + ```bash 569 + python3 -c " 570 + import numpy as np 571 + # Small int8 embedding-like array 572 + emb = np.random.randint(-128, 127, size=(10, 8, 128), dtype=np.int8) 573 + np.save('$HOME/workspace/mono/tessera-npy/test/fixtures/small_embedding.npy', emb) 574 + # Small float32 scales 575 + scales = np.random.rand(10, 8).astype(np.float32) * 0.1 576 + np.save('$HOME/workspace/mono/tessera-npy/test/fixtures/small_scales.npy', scales) 577 + # 1D float64 578 + arr = np.array([1.0, 2.0, 3.0, 4.0, 5.0], dtype=np.float64) 579 + np.save('$HOME/workspace/mono/tessera-npy/test/fixtures/float64_1d.npy', arr) 580 + # uint8 581 + arr = np.array([[0, 128, 255], [1, 127, 254]], dtype=np.uint8) 582 + np.save('$HOME/workspace/mono/tessera-npy/test/fixtures/uint8_2d.npy', arr) 583 + print('Generated test fixtures') 584 + " 585 + ``` 586 + 587 + **Step 2: Write tests using fixtures** 588 + 589 + ``` 590 + tessera-npy/test/test_real_tiles.ml 591 + ``` 592 + ```ocaml 593 + let read_file path = 594 + let ic = open_in_bin path in 595 + let len = in_channel_length ic in 596 + let buf = Bytes.create len in 597 + really_input ic buf 0 len; 598 + close_in ic; 599 + Bytes.to_string buf 600 + 601 + let fixture name = 602 + let dir = Sys.getenv_opt "NPY_TEST_FIXTURES" 603 + |> Option.value ~default:"tessera-npy/test/fixtures" 604 + in 605 + Filename.concat dir name 606 + 607 + let test_small_embedding () = 608 + let data = read_file (fixture "small_embedding.npy") in 609 + match Npy.of_string data with 610 + | Error e -> Alcotest.fail e 611 + | Ok t -> 612 + Alcotest.(check (array int)) "shape" [|10; 8; 128|] (Npy.shape t); 613 + Alcotest.(check bool) "fortran_order" false (Npy.fortran_order t); 614 + (match Npy.data_int8 t with 615 + | None -> Alcotest.fail "expected int8 data" 616 + | Some ba -> 617 + Alcotest.(check int) "length" (10 * 8 * 128) (Bigarray.Array1.dim ba)) 618 + 619 + let test_small_scales () = 620 + let data = read_file (fixture "small_scales.npy") in 621 + match Npy.of_string data with 622 + | Error e -> Alcotest.fail e 623 + | Ok t -> 624 + Alcotest.(check (array int)) "shape" [|10; 8|] (Npy.shape t); 625 + (match Npy.data_float32 t with 626 + | None -> Alcotest.fail "expected float32 data" 627 + | Some ba -> 628 + Alcotest.(check int) "length" (10 * 8) (Bigarray.Array1.dim ba); 629 + (* All values should be between 0 and 0.1 *) 630 + for i = 0 to Bigarray.Array1.dim ba - 1 do 631 + let v = Bigarray.Array1.get ba i in 632 + if v < 0.0 || v > 0.1 then 633 + Alcotest.failf "scale value out of range: %f" v 634 + done) 635 + 636 + let test_float64_1d () = 637 + let data = read_file (fixture "float64_1d.npy") in 638 + match Npy.of_string data with 639 + | Error e -> Alcotest.fail e 640 + | Ok t -> 641 + Alcotest.(check (array int)) "shape" [|5|] (Npy.shape t); 642 + (match Npy.data_float64 t with 643 + | None -> Alcotest.fail "expected float64 data" 644 + | Some ba -> 645 + for i = 0 to 4 do 646 + let expected = Float.of_int (i + 1) in 647 + let actual = Bigarray.Array1.get ba i in 648 + if Float.abs (actual -. expected) > 1e-10 then 649 + Alcotest.failf "elem %d: expected %f, got %f" i expected actual 650 + done) 651 + 652 + let test_uint8_2d () = 653 + let data = read_file (fixture "uint8_2d.npy") in 654 + match Npy.of_string data with 655 + | Error e -> Alcotest.fail e 656 + | Ok t -> 657 + Alcotest.(check (array int)) "shape" [|2; 3|] (Npy.shape t); 658 + (match Npy.data_uint8 t with 659 + | None -> Alcotest.fail "expected uint8 data" 660 + | Some ba -> 661 + Alcotest.(check int) "elem 0" 0 (Bigarray.Array1.get ba 0); 662 + Alcotest.(check int) "elem 1" 128 (Bigarray.Array1.get ba 1); 663 + Alcotest.(check int) "elem 2" 255 (Bigarray.Array1.get ba 2)) 664 + 665 + let () = 666 + Alcotest.run "tessera-npy-fixtures" 667 + [ 668 + ( "fixtures", 669 + [ 670 + Alcotest.test_case "small embedding" `Quick test_small_embedding; 671 + Alcotest.test_case "small scales" `Quick test_small_scales; 672 + Alcotest.test_case "float64 1d" `Quick test_float64_1d; 673 + Alcotest.test_case "uint8 2d" `Quick test_uint8_2d; 674 + ] ); 675 + ] 676 + ``` 677 + 678 + Update the test dune to add the second test executable: 679 + ``` 680 + tessera-npy/test/dune 681 + ``` 682 + ```dune 683 + (test 684 + (name test_npy) 685 + (libraries tessera-npy alcotest)) 686 + 687 + (test 688 + (name test_real_tiles) 689 + (libraries tessera-npy alcotest)) 690 + ``` 691 + 692 + **Step 3: Run tests** 693 + 694 + Run: `cd ~/workspace/mono && opam exec -- dune test -p tessera-npy 2>&1` 695 + Expected: All 11 tests pass (7 from test_npy + 4 from test_real_tiles). 696 + 697 + **Step 4: Commit** 698 + 699 + ```bash 700 + git add tessera-npy/ 701 + git commit -m "tessera-npy: add fixture-based tests with real numpy dtypes" 702 + ``` 703 + 704 + --- 705 + 706 + ### Task 5: Review and refine 707 + 708 + **Step 1: Review the implementation** 709 + 710 + Check for: 711 + - Does `Str` dependency add unnecessary weight? If so, replace with manual string searching (no regex needed — we're just finding literal substrings). 712 + - Is the `.mli` interface clean and well-documented? 713 + - Are error messages helpful? 714 + - Edge cases: empty shape `()` (scalar), single-element shape `(N,)` with trailing comma. 715 + 716 + **Step 2: Add scalar and 1D trailing-comma edge case tests** 717 + 718 + Add to `test_npy.ml`: 719 + 720 + ```ocaml 721 + let test_parse_1d_trailing_comma () = 722 + (* numpy writes 1D shapes as "(5,)" with trailing comma *) 723 + let header = "{'descr': '|i1', 'fortran_order': False, 'shape': (5,), }" in 724 + let npy = make_npy_v1_raw header "\x00\x00\x00\x00\x00" in 725 + match Npy.of_string npy with 726 + | Error e -> Alcotest.fail e 727 + | Ok t -> 728 + Alcotest.(check (array int)) "shape" [|5|] (Npy.shape t) 729 + ``` 730 + 731 + This requires a `make_npy_v1_raw` helper that takes the raw header string. Add it near the top of the test file: 732 + 733 + ```ocaml 734 + let make_npy_v1_raw header data = 735 + let prefix_len = 6 + 2 + 2 in 736 + let raw_header_len = String.length header + 1 in 737 + let padded_len = 738 + let total = prefix_len + raw_header_len in 739 + let rem = total mod 64 in 740 + if rem = 0 then raw_header_len else raw_header_len + (64 - rem) 741 + in 742 + let buf = Buffer.create (prefix_len + padded_len + String.length data) in 743 + Buffer.add_string buf "\x93NUMPY\x01\x00"; 744 + Buffer.add_char buf (Char.chr (padded_len land 0xff)); 745 + Buffer.add_char buf (Char.chr ((padded_len lsr 8) land 0xff)); 746 + Buffer.add_string buf header; 747 + for _ = 1 to padded_len - raw_header_len do 748 + Buffer.add_char buf ' ' 749 + done; 750 + Buffer.add_char buf '\n'; 751 + Buffer.add_string buf data; 752 + Buffer.contents buf 753 + ``` 754 + 755 + **Step 3: Consider replacing `Str` with manual string search** 756 + 757 + `Str` is a standard library module but uses global state which is not ideal. Replace `Str.search_forward (Str.regexp_string pattern)` with a simple `find_substring` function: 758 + 759 + ```ocaml 760 + let find_substring haystack needle = 761 + let nlen = String.length needle in 762 + let hlen = String.length haystack in 763 + let rec search i = 764 + if i + nlen > hlen then None 765 + else if String.sub haystack i nlen = needle then Some i 766 + else search (i + 1) 767 + in 768 + search 0 769 + ``` 770 + 771 + Then remove the `str` library from the dune file. 772 + 773 + **Step 4: Run tests, commit** 774 + 775 + Run: `cd ~/workspace/mono && opam exec -- dune test -p tessera-npy 2>&1` 776 + Expected: All tests pass. 777 + 778 + ```bash 779 + git add tessera-npy/ 780 + git commit -m "tessera-npy: edge cases, remove Str dependency" 781 + ``` 782 + 783 + --- 784 + 785 + ### Task 6: Add .mli documentation and opam file 786 + 787 + **Files:** 788 + - Modify: `tessera-npy/lib/npy.mli` — add comprehensive documentation 789 + - Verify: `tessera-npy/tessera-npy.opam` is auto-generated 790 + 791 + **Step 1: Update the .mli with proper docs** 792 + 793 + Refine the `.mli` to include usage examples and clear documentation. The interface from Task 1 is already a good starting point — ensure all functions have docstrings explaining edge cases (e.g., data accessors return `None` for dtype mismatch, `of_string` error cases). 794 + 795 + **Step 2: Generate opam file** 796 + 797 + Run: `cd ~/workspace/mono && opam exec -- dune build tessera-npy/tessera-npy.opam` 798 + 799 + Verify the file exists and looks reasonable. 800 + 801 + **Step 3: Final test run** 802 + 803 + Run: `cd ~/workspace/mono && opam exec -- dune build -p tessera-npy && opam exec -- dune test -p tessera-npy` 804 + Expected: Build and all tests pass. 805 + 806 + **Step 4: Commit** 807 + 808 + ```bash 809 + git add tessera-npy/ 810 + git commit -m "tessera-npy: polish docs and generate opam file" 811 + ```

+4

test-pipeline/dune

··· 1 + (test 2 + (name test_pipeline) 3 + (libraries tessera-npy tessera-linalg tessera-geotessera tessera-viz alcotest) 4 + (deps (glob_files fixtures/*.npy)))

+2

test-pipeline/dune-project

··· 1 + (lang dune 3.17) 2 + (name test-pipeline)

test-pipeline/fixtures/tile1_emb.npy

This is a binary file and will not be displayed.

test-pipeline/fixtures/tile1_scales.npy

This is a binary file and will not be displayed.

test-pipeline/fixtures/tile2_emb.npy

This is a binary file and will not be displayed.

test-pipeline/fixtures/tile2_scales.npy

This is a binary file and will not be displayed.

+275

test-pipeline/test_pipeline.ml

··· 1 + (** End-to-end pipeline test exercising all four tessera libraries together: 2 + npy parsing → dequantization → PCA → kNN → visualization *) 3 + 4 + let read_file path = 5 + let ic = open_in_bin path in 6 + let len = in_channel_length ic in 7 + let buf = Bytes.create len in 8 + really_input ic buf 0 len; 9 + close_in ic; 10 + Bytes.to_string buf 11 + 12 + let fixture name = Filename.concat "fixtures" name 13 + 14 + (* ---- Test 1: Full pipeline with synthetic tiles ---- *) 15 + 16 + let test_full_pipeline () = 17 + (* Step 1: Parse .npy files *) 18 + let tile1_emb = Npy.of_string (read_file (fixture "tile1_emb.npy")) |> Result.get_ok in 19 + let tile1_scales = Npy.of_string (read_file (fixture "tile1_scales.npy")) |> Result.get_ok in 20 + let tile2_emb = Npy.of_string (read_file (fixture "tile2_emb.npy")) |> Result.get_ok in 21 + let tile2_scales = Npy.of_string (read_file (fixture "tile2_scales.npy")) |> Result.get_ok in 22 + 23 + Printf.printf "Tile 1 emb shape: %s\n" 24 + (String.concat "x" (Array.to_list (Array.map string_of_int (Npy.shape tile1_emb)))); 25 + Printf.printf "Tile 1 scales shape: %s\n" 26 + (String.concat "x" (Array.to_list (Array.map string_of_int (Npy.shape tile1_scales)))); 27 + Printf.printf "Tile 2 emb shape: %s\n" 28 + (String.concat "x" (Array.to_list (Array.map string_of_int (Npy.shape tile2_emb)))); 29 + 30 + (* Verify shapes *) 31 + Alcotest.(check (array int)) "tile1 emb shape" [|10; 10; 128|] (Npy.shape tile1_emb); 32 + Alcotest.(check (array int)) "tile1 scales shape" [|10; 10|] (Npy.shape tile1_scales); 33 + Alcotest.(check (array int)) "tile2 emb shape" [|10; 10; 128|] (Npy.shape tile2_emb); 34 + 35 + (* Step 2: Dequantize each tile *) 36 + let mat1 = Geotessera.dequantize ~embeddings:tile1_emb ~scales:tile1_scales in 37 + let mat2 = Geotessera.dequantize ~embeddings:tile2_emb ~scales:tile2_scales in 38 + 39 + Printf.printf "Tile 1 dequantized: %d rows x %d cols\n" mat1.rows mat1.cols; 40 + Printf.printf "Tile 2 dequantized: %d rows x %d cols\n" mat2.rows mat2.cols; 41 + 42 + Alcotest.(check int) "mat1 rows" 100 mat1.rows; (* 10*10 pixels *) 43 + Alcotest.(check int) "mat1 cols" 128 mat1.cols; 44 + Alcotest.(check int) "mat2 rows" 100 mat2.rows; 45 + Alcotest.(check int) "mat2 cols" 128 mat2.cols; 46 + 47 + (* Step 3: Combine into a single embedding matrix (stacking tiles vertically) *) 48 + let combined = Linalg.create_mat ~rows:200 ~cols:128 in 49 + for i = 0 to 99 do 50 + for j = 0 to 127 do 51 + Linalg.mat_set combined i j (Linalg.mat_get mat1 i j); 52 + Linalg.mat_set combined (i + 100) j (Linalg.mat_get mat2 i j) 53 + done 54 + done; 55 + Printf.printf "Combined embeddings: %d rows x %d cols\n" combined.rows combined.cols; 56 + 57 + (* Step 4: PCA to 3 components *) 58 + let pca_model = Linalg.pca_fit combined ~n_components:3 in 59 + let projected = Linalg.pca_transform pca_model combined in 60 + 61 + Printf.printf "PCA projected: %d rows x %d cols\n" projected.rows projected.cols; 62 + Alcotest.(check int) "projected rows" 200 projected.rows; 63 + Alcotest.(check int) "projected cols" 3 projected.cols; 64 + 65 + (* Verify PCA separates the two clusters *) 66 + let mean_pc1_tile1 = 67 + let sum = ref 0.0 in 68 + for i = 0 to 99 do sum := !sum +. Linalg.mat_get projected i 0 done; 69 + !sum /. 100.0 70 + in 71 + let mean_pc1_tile2 = 72 + let sum = ref 0.0 in 73 + for i = 100 to 199 do sum := !sum +. Linalg.mat_get projected i 0 done; 74 + !sum /. 100.0 75 + in 76 + Printf.printf "Mean PC1 tile1: %f, tile2: %f\n" mean_pc1_tile1 mean_pc1_tile2; 77 + let separation = Float.abs (mean_pc1_tile1 -. mean_pc1_tile2) in 78 + Printf.printf "PCA separation: %f\n" separation; 79 + if separation < 0.1 then 80 + Alcotest.failf "PCA did not separate clusters: separation = %f" separation; 81 + 82 + (* Step 5: PCA visualization → RGBA *) 83 + let pca_image = Viz.pca_to_rgba projected ~width:10 ~height:20 in 84 + Printf.printf "PCA image: %dx%d, data length: %d\n" 85 + pca_image.width pca_image.height (Bigarray.Array1.dim pca_image.data); 86 + Alcotest.(check int) "image width" 10 pca_image.width; 87 + Alcotest.(check int) "image height" 20 pca_image.height; 88 + Alcotest.(check int) "image data length" (10 * 20 * 4) (Bigarray.Array1.dim pca_image.data); 89 + 90 + (* Verify alpha channel is 255 for all pixels *) 91 + for i = 0 to 10 * 20 - 1 do 92 + let a = Bigarray.Array1.get pca_image.data (i * 4 + 3) in 93 + if a <> 255 then Alcotest.failf "pixel %d alpha = %d, expected 255" i a 94 + done; 95 + 96 + (* Step 6: kNN classification *) 97 + (* Use 5 labeled points from each tile as training data *) 98 + let n_train = 10 in 99 + let train = Linalg.create_mat ~rows:n_train ~cols:3 in 100 + let train_labels = Array.make n_train 0 in 101 + (* First 5: from tile 1 (class 0 = "water") *) 102 + for i = 0 to 4 do 103 + for j = 0 to 2 do 104 + Linalg.mat_set train i j (Linalg.mat_get projected (i * 10) j) 105 + done; 106 + train_labels.(i) <- 0 107 + done; 108 + (* Next 5: from tile 2 (class 1 = "land") *) 109 + for i = 0 to 4 do 110 + for j = 0 to 2 do 111 + Linalg.mat_set train (i + 5) j (Linalg.mat_get projected (100 + i * 10) j) 112 + done; 113 + train_labels.(i + 5) <- 1 114 + done; 115 + 116 + let knn_model = Linalg.knn_fit ~embeddings:train ~labels:train_labels in 117 + let result = Linalg.knn_predict knn_model ~k:3 projected in 118 + 119 + Printf.printf "kNN predictions: %d values\n" (Array.length result.predictions); 120 + Alcotest.(check int) "prediction count" 200 (Array.length result.predictions); 121 + 122 + (* Count class assignments *) 123 + let n_water = Array.fold_left (fun acc p -> if p = 0 then acc + 1 else acc) 0 result.predictions in 124 + let n_land = Array.fold_left (fun acc p -> if p = 1 then acc + 1 else acc) 0 result.predictions in 125 + Printf.printf "Classifications: %d water, %d land\n" n_water n_land; 126 + 127 + (* Tile 1 pixels (0-99) should mostly be class 0, tile 2 (100-199) mostly class 1 *) 128 + let tile1_correct = 129 + let count = ref 0 in 130 + for i = 0 to 99 do if result.predictions.(i) = 0 then incr count done; 131 + !count 132 + in 133 + let tile2_correct = 134 + let count = ref 0 in 135 + for i = 100 to 199 do if result.predictions.(i) = 1 then incr count done; 136 + !count 137 + in 138 + Printf.printf "Tile 1 correct: %d/100, Tile 2 correct: %d/100\n" tile1_correct tile2_correct; 139 + if tile1_correct < 80 then 140 + Alcotest.failf "tile 1 accuracy too low: %d/100" tile1_correct; 141 + if tile2_correct < 80 then 142 + Alcotest.failf "tile 2 accuracy too low: %d/100" tile2_correct; 143 + 144 + (* Check confidence values are reasonable *) 145 + let avg_confidence = 146 + Array.fold_left ( +. ) 0.0 result.confidences /. Float.of_int (Array.length result.confidences) 147 + in 148 + Printf.printf "Average confidence: %f\n" avg_confidence; 149 + if avg_confidence < 0.5 then 150 + Alcotest.failf "average confidence too low: %f" avg_confidence; 151 + 152 + (* Step 7: Classification visualization → RGBA *) 153 + let colors = [(0, Viz.{ r = 0; g = 0; b = 255 }); (1, Viz.{ r = 0; g = 255; b = 0 })] in 154 + let class_image = Viz.classification_to_rgba 155 + ~predictions:result.predictions ~colors ~width:10 ~height:20 () in 156 + 157 + Printf.printf "Classification image: %dx%d\n" class_image.width class_image.height; 158 + Alcotest.(check int) "class image data length" (10 * 20 * 4) (Bigarray.Array1.dim class_image.data); 159 + 160 + (* Verify some pixels have the expected colors *) 161 + (* Pixel 0 should be water (blue) if classified correctly *) 162 + if result.predictions.(0) = 0 then begin 163 + let r = Bigarray.Array1.get class_image.data 0 in 164 + let g = Bigarray.Array1.get class_image.data 1 in 165 + let b = Bigarray.Array1.get class_image.data 2 in 166 + Alcotest.(check int) "water R" 0 r; 167 + Alcotest.(check int) "water G" 0 g; 168 + Alcotest.(check int) "water B" 255 b 169 + end; 170 + 171 + Printf.printf "\n=== Full pipeline test PASSED ===\n" 172 + 173 + (* ---- Test 2: Geotessera fetch_mosaic_sync with mock ---- *) 174 + 175 + let test_fetch_mosaic_mock () = 176 + (* Mock fetch that returns our fixture data based on URL *) 177 + let tile1_emb_bytes = read_file (fixture "tile1_emb.npy") in 178 + let tile1_scales_bytes = read_file (fixture "tile1_scales.npy") in 179 + 180 + let fetch url = 181 + if String.length url > 0 then begin 182 + (* Return tile1 data for any request *) 183 + if String.contains url 's' && String.sub url (String.length url - 11) 11 = "_scales.npy" then 184 + tile1_scales_bytes 185 + else 186 + tile1_emb_bytes 187 + end else 188 + failwith "empty URL" 189 + in 190 + 191 + let bbox = Geotessera.{ min_lon = 0.0; min_lat = 52.0; max_lon = 0.05; max_lat = 52.05 } in 192 + let mosaic, h, w = Geotessera.fetch_mosaic_sync ~fetch ~year:2024 bbox in 193 + 194 + Printf.printf "Mosaic from fetch: %d rows x %d cols, %dx%d pixels\n" mosaic.rows mosaic.cols h w; 195 + Alcotest.(check int) "mosaic cols" 128 mosaic.cols; 196 + Alcotest.(check bool) "mosaic has data" true (mosaic.rows > 0); 197 + 198 + (* Run PCA on the fetched mosaic *) 199 + let pca = Linalg.pca_fit mosaic ~n_components:3 in 200 + let proj = Linalg.pca_transform pca mosaic in 201 + Printf.printf "PCA on mosaic: %d rows x %d cols\n" proj.rows proj.cols; 202 + Alcotest.(check int) "proj cols" 3 proj.cols; 203 + 204 + Printf.printf "\n=== Fetch mosaic mock test PASSED ===\n" 205 + 206 + (* ---- Test 3: Real tile from GeoTessera server (slow) ---- *) 207 + 208 + let test_real_tile () = 209 + (* Try to fetch a real tile — skip if no network *) 210 + let url = "https://dl2.geotessera.org/v1/global_0.1_degree_representation/2024/grid_0.15_52.05/grid_0.15_52.05_scales.npy" in 211 + Printf.printf "Attempting to fetch: %s\n" url; 212 + let can_fetch = 213 + try 214 + let _ic = Unix.open_connection (Unix.ADDR_INET (Unix.inet_addr_of_string "0.0.0.0", 0)) in 215 + (* Can't actually test connectivity this way, just try the fetch *) 216 + true 217 + with _ -> true (* assume network available *) 218 + in 219 + if not can_fetch then 220 + Printf.printf "Skipping real tile test (no network)\n" 221 + else begin 222 + (* Use curl to fetch since we don't have an HTTP library *) 223 + let fetch_with_curl url = 224 + let tmp = Filename.temp_file "geotessera" ".npy" in 225 + let cmd = Printf.sprintf "curl -sf '%s' -o '%s'" url tmp in 226 + let ret = Sys.command cmd in 227 + if ret <> 0 then begin 228 + (try Sys.remove tmp with _ -> ()); 229 + failwith (Printf.sprintf "curl failed for %s (exit %d)" url ret) 230 + end; 231 + let data = read_file tmp in 232 + Sys.remove tmp; 233 + data 234 + in 235 + try 236 + let scales_url = "https://dl2.geotessera.org/v1/global_0.1_degree_representation/2024/grid_0.15_52.05/grid_0.15_52.05_scales.npy" in 237 + Printf.printf "Fetching scales...\n%!"; 238 + let scales_data = fetch_with_curl scales_url in 239 + Printf.printf "Scales: %d bytes\n" (String.length scales_data); 240 + 241 + let scales = Npy.of_string scales_data |> Result.get_ok in 242 + Printf.printf "Scales shape: %s\n" 243 + (String.concat "x" (Array.to_list (Array.map string_of_int (Npy.shape scales)))); 244 + 245 + (* Verify it's a valid 2D float32 array *) 246 + let shape = Npy.shape scales in 247 + Alcotest.(check int) "scales ndim" 2 (Array.length shape); 248 + Alcotest.(check bool) "scales has float32" true (Option.is_some (Npy.data_float32 scales)); 249 + 250 + let h = shape.(0) in 251 + let w = shape.(1) in 252 + Printf.printf "Scales: %dx%d\n" h w; 253 + 254 + (* Check some values are reasonable (scale factors are small positive floats) *) 255 + let ba = Option.get (Npy.data_float32 scales) in 256 + let v0 = Bigarray.Array1.get ba 0 in 257 + Printf.printf "First scale value: %f\n" v0; 258 + if v0 < 0.0 || v0 > 1.0 then 259 + Alcotest.failf "scale value out of range: %f" v0; 260 + 261 + Printf.printf "\n=== Real tile test PASSED ===\n" 262 + with exn -> 263 + Printf.printf "Skipping real tile test (network error: %s)\n" (Printexc.to_string exn) 264 + end 265 + 266 + let () = 267 + Alcotest.run "tessera-pipeline" 268 + [ 269 + ( "full_pipeline", 270 + [ Alcotest.test_case "end to end" `Quick test_full_pipeline ] ); 271 + ( "fetch_mosaic", 272 + [ Alcotest.test_case "mock fetch" `Quick test_fetch_mosaic_mock ] ); 273 + ( "real_tile", 274 + [ Alcotest.test_case "fetch from server" `Slow test_real_tile ] ); 275 + ]

Configure Feed

Configure Feed