A research repository into geolabels, not for wide use yet
0
fork

Configure Feed

Select the types of activity you want to include in your feed.

init

+6856
+6
.gitignore
··· 1 + _build/ 2 + *.install 3 + *.swp 4 + *.swo 5 + *~ 6 + .merlin
+1
.ocamlformat
··· 1 + version = 0.26.2
+2
README.md
··· 1 + A playground repository where I experiment with label schemas. 2 + Not for any third-party usage just yet!
+1359
docs/plans/gbif-mapping.md
··· 1 + # GBIF to Terradots Mapping Plan 2 + 3 + ## 1. GBIF and Darwin Core — Primary Sources 4 + 5 + - Darwin Core terms: https://dwc.tdwg.org/terms/ 6 + - Darwin Core text guide (DwC-A format): https://dwc.tdwg.org/text/ 7 + - GBIF occurrence API: https://techdocs.gbif.org/en/openapi/v1/occurrence 8 + - GBIF species/taxonomy API: https://techdocs.gbif.org/en/openapi/v1/species 9 + - GBIF registry API: https://techdocs.gbif.org/en/openapi/v1/registry 10 + - GBIF download formats: https://techdocs.gbif.org/en/data-use/download-formats 11 + - GBIF download API: https://techdocs.gbif.org/en/data-use/api-downloads 12 + - GBIF occurrence issues and flags: https://techdocs.gbif.org/en/data-use/occurrence-issues-and-flags 13 + - GBIF taxonomy interpretation: https://techdocs.gbif.org/en/data-processing/taxonomy-interpretation 14 + - GBIF temporal interpretation: https://techdocs.gbif.org/en/data-processing/temporal-interpretation 15 + - DwC-A guide (IPT manual): https://ipt.gbif.org/manual/en/ipt/latest/dwca-guide 16 + - OccurrenceSearchParameter Java enum: https://gbif.github.io/gbif-api/apidocs/org/gbif/api/model/occurrence/search/OccurrenceSearchParameter.html 17 + - GBIF multimedia extension: https://rs.gbif.org/extension/gbif/1.0/multimedia.xml 18 + 19 + --- 20 + 21 + ## 2. Darwin Core Standard 22 + 23 + Darwin Core (DwC) is the vocabulary standard maintained by Biodiversity Information Standards (TDWG) that GBIF uses as its primary exchange format. It defines a set of terms for biodiversity data — each term has a stable URI (e.g., `http://rs.tdwg.org/dwc/terms/occurrenceID`) and a simple local name used in practice (e.g., `occurrenceID`). 24 + 25 + ### 2.1 Record-level Terms 26 + 27 + | Term | Description | 28 + |------|-------------| 29 + | `type` | Dublin Core type (typically `PhysicalObject` or `Event`) | 30 + | `modified` | ISO 8601 date-time the record was last changed by the publisher | 31 + | `language` | Language of the record (`en`, etc.) | 32 + | `license` | URI of the license document (e.g. `http://creativecommons.org/licenses/by/4.0/legalcode`) | 33 + | `rightsHolder` | Person or organisation owning or managing rights over the resource | 34 + | `accessRights` | Information about access restrictions | 35 + | `bibliographicCitation` | Citation for the occurrence record itself | 36 + | `references` | URL to a web page or further information about this record | 37 + | `institutionID` | URI identifying the institution that holds the object | 38 + | `collectionID` | URI identifying the collection | 39 + | `datasetID` | Identifier for the dataset within the publisher's system | 40 + | `institutionCode` | Acronym or name of the institution (e.g. `NHM`, `MO`) | 41 + | `collectionCode` | Acronym or name of the collection within the institution | 42 + | `datasetName` | Name of the dataset from which the record was derived | 43 + | `ownerInstitutionCode` | Institution that owns the specimen when it differs from the holding institution | 44 + | `basisOfRecord` | Nature of the record (see §2.6) | 45 + | `informationWithheld` | Free-text note on withheld information | 46 + | `dataGeneralizations` | Actions taken to make the data less specific (e.g. coordinate generalisation) | 47 + | `dynamicProperties` | Additional data as JSON or similar key-value pairs | 48 + 49 + ### 2.2 Occurrence Terms 50 + 51 + | Term | Description | 52 + |------|-------------| 53 + | `occurrenceID` | Globally unique persistent identifier for the occurrence; publisher-assigned | 54 + | `catalogNumber` | Identifier within the collection (e.g. museum accession number) | 55 + | `recordNumber` | Field number given by the collector at time of recording | 56 + | `recordedBy` | Semicolon-separated list of names of people who made the observation/collection | 57 + | `recordedByID` | Semicolon-separated list of identifiers (e.g. ORCID URIs) for `recordedBy` persons | 58 + | `individualCount` | Number of individuals present at the time of the occurrence | 59 + | `organismQuantity` | Quantity of organisms (e.g. `5%` cover) | 60 + | `organismQuantityType` | Units for `organismQuantity` (e.g. `percentageCover`) | 61 + | `sex` | Sex of the biological individual | 62 + | `lifeStage` | Age or life stage of the organism | 63 + | `reproductiveCondition` | Reproductive condition (e.g. `flowering`, `pregnant`) | 64 + | `caste` | Caste (for eusocial organisms) | 65 + | `behavior` | Behaviour exhibited by the organism at the time of occurrence | 66 + | `vitality` | Indication of survival of the organism (`alive`, `dead`) | 67 + | `establishmentMeans` | How the organism came to be at the location (native, introduced, etc.) | 68 + | `degreeOfEstablishment` | Degree to which organism has established in a given place | 69 + | `pathway` | Process by which an organism arrived at a given place | 70 + | `occurrenceStatus` | `present` or `absent` | 71 + | `preparations` | List of preparations and preservation methods for specimens | 72 + | `disposition` | Current disposition of specimen (e.g. `in collection`, `missing`) | 73 + | `associatedMedia` | List of media associated with the occurrence | 74 + | `associatedOccurrences` | List of identifiers of other occurrences associated with this one | 75 + | `associatedReferences` | List of publication identifiers associated with this occurrence | 76 + | `associatedSequences` | List of genetic sequence identifiers | 77 + | `associatedTaxa` | List of taxa associated with the occurrence | 78 + | `otherCatalogNumbers` | List of other catalog numbers for the same item | 79 + | `occurrenceRemarks` | Comments or notes about the occurrence | 80 + 81 + ### 2.3 Organism Terms 82 + 83 + | Term | Description | 84 + |------|-------------| 85 + | `organismID` | Identifier for the organism (persistent across occurrences of the same individual) | 86 + | `organismName` | Text name for the organism (e.g. band number for a bird) | 87 + | `organismScope` | Description of the kind of organism instance | 88 + 89 + ### 2.4 Event Terms 90 + 91 + | Term | Description | 92 + |------|-------------| 93 + | `eventID` | Identifier for the event (a sampling event or visit) | 94 + | `parentEventID` | Identifier of the parent event (e.g. a campaign containing this visit) | 95 + | `eventType` | Type of the event (e.g. `Survey`, `Transect`) | 96 + | `fieldNumber` | Identifier given to the event in the field | 97 + | `eventDate` | ISO 8601 date or interval when the event occurred (see §2.7) | 98 + | `eventTime` | Time of day of the event | 99 + | `startDayOfYear` | Earliest ordinal day of year (1–366) | 100 + | `endDayOfYear` | Latest ordinal day of year (1–366) | 101 + | `year` | Four-digit year | 102 + | `month` | Integer month (1–12) | 103 + | `day` | Integer day of month | 104 + | `verbatimEventDate` | Original verbatim text of the event date | 105 + | `habitat` | Category or description of the habitat | 106 + | `samplingProtocol` | Names, references, or descriptions of methods used to sample the occurrence | 107 + | `sampleSizeValue` | Numeric value of the sample size | 108 + | `sampleSizeUnit` | Units for the sample size value | 109 + | `samplingEffort` | Amount of effort expended during the event | 110 + | `fieldNotes` | Transcription of field notes or reference to their location | 111 + | `eventRemarks` | Comments or notes about the event | 112 + 113 + ### 2.5 Location Terms 114 + 115 + | Term | Description | 116 + |------|-------------| 117 + | `locationID` | Identifier for the location | 118 + | `higherGeographyID` | Identifier for the broader geographic region | 119 + | `higherGeography` | Combination of geographic place names (broader to more specific) | 120 + | `continent` | Name of the continent | 121 + | `waterBody` | Name of the water body | 122 + | `islandGroup` | Name of the island group | 123 + | `island` | Name of the island | 124 + | `country` | Name of the country | 125 + | `countryCode` | ISO 3166-1 alpha-2 country code | 126 + | `stateProvince` | Name of the state, province, or region | 127 + | `county` | Name of the county, shire, or department | 128 + | `municipality` | Name of the city, town, or municipality | 129 + | `locality` | Specific textual description of the place | 130 + | `verbatimLocality` | Original textual description of the place | 131 + | `verbatimElevation` | Original description of the elevation | 132 + | `minimumElevationInMeters` | Lower limit of elevation range in metres | 133 + | `maximumElevationInMeters` | Upper limit of elevation range in metres | 134 + | `verbatimDepth` | Original description of the depth | 135 + | `minimumDepthInMeters` | Minimum depth below the water surface | 136 + | `maximumDepthInMeters` | Maximum depth below the water surface | 137 + | `decimalLatitude` | Latitude in decimal degrees (WGS84); range −90 to +90 | 138 + | `decimalLongitude` | Longitude in decimal degrees (WGS84); range −180 to +180 | 139 + | `geodeticDatum` | Ellipsoid, geodetic datum, or SRS of the coordinates | 140 + | `coordinateUncertaintyInMeters` | Horizontal radius (in metres) of the smallest circle containing the whole location | 141 + | `coordinatePrecision` | Decimal precision of the coordinates (0.0001 = ~11 m precision) | 142 + | `pointRadiusSpatialFit` | Ratio of the area of the supplied point-radius to the true footprint | 143 + | `verbatimCoordinates` | Original verbatim coordinates | 144 + | `verbatimCoordinateSystem` | Coordinate format of the verbatim coordinates (e.g. `decimal degrees`) | 145 + | `verbatimSRS` | Spatial reference system of the verbatim coordinates | 146 + | `footprintWKT` | Well-Known Text representation of the full footprint of the location | 147 + | `footprintSRS` | SRS for the WKT footprint | 148 + | `footprintSpatialFit` | Ratio of the footprint WKT area to the true footprint | 149 + | `georeferencedBy` | Name(s) of the person(s) who georeference the location | 150 + | `georeferencedDate` | Date when the location was georeferenced | 151 + | `georeferenceProtocol` | Description of the method used to determine coordinates | 152 + | `georeferenceSources` | Resources used in the georeference | 153 + | `georeferenceRemarks` | Notes on the georeference | 154 + 155 + ### 2.6 basisOfRecord Values 156 + 157 + The `basisOfRecord` term is a controlled vocabulary indicating the nature of the evidence: 158 + 159 + | Value | Meaning | 160 + |-------|---------| 161 + | `HumanObservation` | Observation recorded by a human in the field | 162 + | `MachineObservation` | Observation made by a machine (camera trap, acoustic sensor) | 163 + | `PreservedSpecimen` | A preserved specimen in a collection (herbarium, museum) | 164 + | `FossilSpecimen` | A fossil specimen | 165 + | `LivingSpecimen` | A living specimen in cultivation or captivity | 166 + | `MaterialSample` | A sample (e.g. DNA extract, soil sample) | 167 + | `MaterialCitation` | A citation of occurrence in the literature | 168 + | `Occurrence` | Unclassified occurrence record | 169 + 170 + ### 2.7 eventDate Format 171 + 172 + GBIF follows Darwin Core's ISO 8601-1:2019 convention for `eventDate`. Terradots already adopts this same convention for `event_date`. Supported forms: 173 + 174 + | Format | Example | 175 + |--------|---------| 176 + | Year only | `2023` | 177 + | Year-Month | `2023-09` | 178 + | Date | `2023-09-18` | 179 + | Date-time (UTC) | `2023-09-18T13:27:00Z` | 180 + | Date-time with offset | `2023-09-18T13:27:00+05:30` | 181 + | Interval | `2023-09-05/2023-09-18` | 182 + 183 + The eventDate records when the observation took place, not when it was entered into a database. 184 + 185 + ### 2.8 Identification Terms 186 + 187 + | Term | Description | 188 + |------|-------------| 189 + | `identificationID` | Identifier for the determination of the taxon | 190 + | `verbatimIdentification` | Taxon identification as originally given by the identifier | 191 + | `identificationQualifier` | Brief phrase to express uncertainty about the identification (e.g. `cf.`, `aff.`) | 192 + | `identifiedBy` | Name(s) of the person who assigned the taxon to the specimen/observation | 193 + | `identifiedByID` | Identifier(s) for `identifiedBy` persons (e.g. ORCID URIs) | 194 + | `dateIdentified` | Date the taxonomic determination was made | 195 + | `identificationReferences` | References used in the determination | 196 + | `identificationVerificationStatus` | Categorical assessment of the quality of the identification | 197 + | `identificationRemarks` | Comments or notes about the identification | 198 + | `typeStatus` | Nomenclatural type status of the specimen | 199 + 200 + ### 2.9 Taxon Terms 201 + 202 + | Term | Description | 203 + |------|-------------| 204 + | `taxonID` | Identifier for the taxon concept in the publisher's system | 205 + | `scientificNameID` | Identifier for the nomenclatural details of the name | 206 + | `acceptedNameUsageID` | Identifier for the accepted name, when the record is a synonym | 207 + | `parentNameUsageID` | Identifier for the direct parent in the classification | 208 + | `taxonConceptID` | Identifier for the taxon concept (as opposed to name) | 209 + | `scientificName` | Full scientific name including authorship | 210 + | `acceptedNameUsage` | The accepted name of the taxon if this record is a synonym | 211 + | `parentNameUsage` | The name of the immediate parent taxon | 212 + | `originalNameUsage` | Protonym or basionym | 213 + | `verbatimTaxonRank` | The taxon rank as it appeared in the original record | 214 + | `taxonRank` | The rank of the most specific name in the scientificName (SPECIES, GENUS, etc.) | 215 + | `kingdom` | Kingdom in the classification | 216 + | `phylum` | Phylum (Division) in the classification | 217 + | `class` | Class in the classification | 218 + | `order` | Order in the classification | 219 + | `superfamily` | Superfamily | 220 + | `family` | Family in the classification | 221 + | `subfamily` | Subfamily | 222 + | `tribe` | Tribe | 223 + | `subtribe` | Subtribe | 224 + | `genus` | Genus in the classification | 225 + | `genericName` | Genus portion of the scientific name | 226 + | `subgenus` | Subgenus in the classification | 227 + | `infragenericEpithet` | Infrageneric epithet | 228 + | `specificEpithet` | Species epithet | 229 + | `infraspecificEpithet` | Infraspecific epithet | 230 + | `cultivarEpithet` | Cultivar name | 231 + | `vernacularName` | Common name | 232 + | `nomenclaturalCode` | Nomenclatural code governing the name (ICNafp, ICZN, etc.) | 233 + | `taxonomicStatus` | Whether the name is accepted or a synonym | 234 + | `nomenclaturalStatus` | Status of the name per the relevant nomenclatural code | 235 + | `taxonRemarks` | Comments or notes about the taxon | 236 + 237 + --- 238 + 239 + ## 3. GBIF Occurrence Record Structure 240 + 241 + GBIF interprets Darwin Core records from publishers and adds its own enrichment fields. A GBIF occurrence record (as returned by the Occurrence API or in a DwC-A download) contains the Darwin Core fields above plus the following GBIF-specific additions. 242 + 243 + ### 3.1 GBIF-added Identifiers 244 + 245 + | Field | Description | 246 + |-------|-------------| 247 + | `gbifID` (or `key`) | GBIF's own integer identifier for this occurrence record; globally unique | 248 + | `datasetKey` | UUID of the dataset in the GBIF registry | 249 + | `publishingOrgKey` | UUID of the publishing organisation | 250 + | `installationKey` | UUID of the IPT/installation that hosts the dataset | 251 + | `hostingOrganizationKey` | UUID of the hosting organisation | 252 + | `networkKeys` | Array of UUIDs of networks the dataset belongs to | 253 + | `protocol` | Protocol used to harvest the data (`DWC_ARCHIVE`, `EML`, etc.) | 254 + 255 + ### 3.2 GBIF Taxonomy Backbone Fields 256 + 257 + GBIF matches every occurrence to its taxonomic backbone and adds integer keys at each taxonomic rank: 258 + 259 + | Field | Description | 260 + |-------|-------------| 261 + | `taxonKey` | GBIF backbone key for the matched taxon at its stated rank | 262 + | `acceptedTaxonKey` | Key of the accepted name (equals `taxonKey` if already accepted) | 263 + | `kingdomKey` | Backbone key for kingdom | 264 + | `phylumKey` | Backbone key for phylum | 265 + | `classKey` | Backbone key for class | 266 + | `orderKey` | Backbone key for order | 267 + | `superfamilyKey` | Backbone key for superfamily | 268 + | `familyKey` | Backbone key for family | 269 + | `subfamilyKey` | Backbone key for subfamily | 270 + | `tribeKey` | Backbone key for tribe | 271 + | `subtribeKey` | Backbone key for subtribe | 272 + | `genusKey` | Backbone key for genus | 273 + | `subgenusKey` | Backbone key for subgenus | 274 + | `speciesKey` | Backbone key for the species (even if record is a subspecies) | 275 + | `acceptedScientificName` | Scientific name of the accepted taxon on the backbone | 276 + | `verbatimScientificName` | Scientific name as originally supplied by publisher | 277 + | `verbatimScientificNameAuthorship` | Authorship as originally supplied | 278 + | `taxonomicStatus` | `ACCEPTED`, `SYNONYM`, `DOUBTFUL`, etc. as determined by GBIF | 279 + | `iucnRedListCategory` | IUCN Red List category if available (LC, NT, VU, EN, CR, EW, EX) | 280 + 281 + ### 3.3 GBIF Geospatial Fields 282 + 283 + | Field | Description | 284 + |-------|-------------| 285 + | `hasCoordinate` | Boolean; whether the record has non-null decimal lat/lon | 286 + | `hasGeospatialIssues` | Boolean; whether any geospatial issue flags are present | 287 + | `distanceFromCentroidInMeters` | Distance from the nearest country/region centroid (centroid-mismatch indicator) | 288 + | `repatriated` | Whether the occurrence comes from a country that does not own the dataset | 289 + | `gbifRegion` | GBIF region of the publishing organisation | 290 + | `publishedByGbifRegion` | GBIF region of the publisher | 291 + | `level0Gid`, `level0Name` | GADM level 0 (country) grid identifier and name | 292 + | `level1Gid`, `level1Name` | GADM level 1 (state/province) grid identifier and name | 293 + | `level2Gid`, `level2Name` | GADM level 2 (county/district) grid identifier and name | 294 + | `level3Gid`, `level3Name` | GADM level 3 (municipality) grid identifier and name | 295 + | `continent` | Interpreted continent name | 296 + | `publishingCountry` | ISO country code of the publishing organisation | 297 + 298 + ### 3.4 GBIF Processing Metadata 299 + 300 + | Field | Description | 301 + |-------|-------------| 302 + | `lastCrawled` | ISO 8601 timestamp when GBIF last crawled the dataset | 303 + | `lastParsed` | ISO 8601 timestamp when GBIF last parsed this record | 304 + | `lastInterpreted` | ISO 8601 timestamp when GBIF last interpreted this record | 305 + | `crawlId` | Internal crawl batch identifier | 306 + | `isInCluster` | Whether the record has been matched to a cluster of duplicate/related records | 307 + | `isSequenced` | Whether a DNA sequence is associated with this record | 308 + | `isInvasive` | Whether the species is listed as invasive in any source | 309 + | `relativeOrganismQuantity` | Normalised organism quantity across the dataset | 310 + | `projectId` | Identifier for the project associated with the occurrence | 311 + 312 + ### 3.5 GBIF Issue Fields (Download-specific) 313 + 314 + In DwC-A and CSV downloads, issues are split into two columns: 315 + 316 + | Field | Description | 317 + |-------|-------------| 318 + | `issue` | Comma-separated list of all issue flags present on the record | 319 + | `taxonomicIssue` | Comma-separated list of taxonomy-related issue flags only | 320 + | `nonTaxonomicIssue` | Comma-separated list of all other issue flags | 321 + 322 + In the JSON API, `issues` is an array of string enum values. 323 + 324 + ### 3.6 Complete Field Example (JSON API) 325 + 326 + A real GBIF occurrence record (gbifID: 3034438331) from Xeno-canto via the Netherlands Biodiversity Data Centre: 327 + 328 + ```json 329 + { 330 + "key": 3034438331, 331 + "datasetKey": "...", 332 + "occurrenceID": "https://data.biodiversitydata.nl/xeno-canto/observation/XC165501", 333 + "basisOfRecord": "HUMAN_OBSERVATION", 334 + "occurrenceStatus": "PRESENT", 335 + "scientificName": "Tephrodornis pondicerianus pondicerianus (Gmelin, 1789)", 336 + "acceptedScientificName": "Tephrodornis pondicerianus pondicerianus", 337 + "taxonRank": "SUBSPECIES", 338 + "taxonomicStatus": "ACCEPTED", 339 + "kingdom": "Animalia", "kingdomKey": 1, 340 + "class": "Aves", "classKey": 212, 341 + "order": "Passeriformes", "orderKey": 729, 342 + "family": "Tephrodornithidae", 343 + "genus": "Tephrodornis", 344 + "species": "Tephrodornis pondicerianus", 345 + "speciesKey": 2489935, 346 + "decimalLatitude": 18.3669, 347 + "decimalLongitude": 73.7512, 348 + "continent": "ASIA", 349 + "country": "India", 350 + "countryCode": "IN", 351 + "stateProvince": "Maharashtra", 352 + "eventDate": "2026-01-14", 353 + "recordedBy": "Sarthak Awhad", 354 + "behavior": "call", 355 + "license": "http://creativecommons.org/licenses/by-nc/4.0/legalcode", 356 + "issues": ["CONTINENT_DERIVED_FROM_COORDINATES"], 357 + "media": [ 358 + { 359 + "type": "StillImage", 360 + "format": "image/png", 361 + "identifier": "https://...", 362 + "title": "Spectrogram ...", 363 + "created": "2026-01-14", 364 + "creator": "Sarthak Awhad", 365 + "license": "http://creativecommons.org/licenses/by-nc-sa/3.0/" 366 + }, 367 + { 368 + "type": "Sound", 369 + "format": "audio/mpeg", 370 + "identifier": "https://...", 371 + "title": "Bird call recording" 372 + } 373 + ], 374 + "lastInterpreted": "2026-03-03T11:18:18.520Z", 375 + "isInCluster": false, 376 + "isSequenced": false 377 + } 378 + ``` 379 + 380 + --- 381 + 382 + ## 4. GBIF Taxonomic Backbone 383 + 384 + ### 4.1 Overview 385 + 386 + The GBIF Backbone Taxonomy (also called the GBIF Taxonomic Backbone or GBIF Checklist) is a single synthetic checklist derived from approximately 100 authoritative taxonomic sources. It provides a stable reference for resolving taxonomic names used in occurrence records. 387 + 388 + **Key URL:** `https://api.gbif.org/v1/species/{nubKey}` returns a backbone taxon record. 389 + 390 + ### 4.2 Taxon Key Fields 391 + 392 + Every taxon in the backbone has a unique integer key (`nubKey` or `usageKey`). When GBIF matches an occurrence to the backbone, it populates keys at each taxonomic rank: 393 + 394 + | API field | Meaning | 395 + |-----------|---------| 396 + | `nubKey` | The backbone key (also called `usageKey` in the species API) | 397 + | `taxonKey` | On occurrences: the key for the matched rank; equals `nubKey` | 398 + | `speciesKey` | Key for the species-level ancestor (set even for subspecies records) | 399 + | `genusKey` | Key for genus; `familyKey`; `orderKey`; `classKey`; `phylumKey`; `kingdomKey` | 400 + | `acceptedTaxonKey` | Key of the accepted name if the matched name is a synonym | 401 + | `parentKey` | Key of the immediate taxonomic parent in the backbone | 402 + | `basionymKey` | Key of the original name on which this name is based | 403 + | `nameKey` | Key for the name itself (distinct from the usage in the taxonomy) | 404 + 405 + ### 4.3 Taxonomic Status Values 406 + 407 + | Value | Meaning | 408 + |-------|---------| 409 + | `ACCEPTED` | This is the current accepted name | 410 + | `SYNONYM` | The name is a synonym; use `acceptedTaxonKey` for the accepted name | 411 + | `DOUBTFUL` | Uncertain status; may or may not be accepted | 412 + | `HETEROTYPIC_SYNONYM` | Synonym based on a different type | 413 + | `HOMOTYPIC_SYNONYM` | Synonym based on the same type (objective synonym) | 414 + | `PROPARTE_SYNONYM` | Only a part of the concept is synonymised | 415 + | `MISAPPLIED` | The name has been misapplied to a different taxon | 416 + 417 + ### 4.4 Taxonomy Matching 418 + 419 + GBIF's taxonomy interpretation process: 420 + 421 + 1. Tries to match using the identifier field if present (`taxonID`, `scientificNameID`, `taxonConceptID`) — this is preferred over name-matching. 422 + 2. Falls back to matching the `scientificName` string (with authorship stripped if necessary). 423 + 3. If a full species match fails, tries genus, family, and higher ranks. 424 + 4. Records the quality of the match in `issues` flags (e.g., `TAXON_MATCH_FUZZY`, `TAXON_MATCH_HIGHERRANK`). 425 + 426 + GBIF now supports two taxonomies: the legacy GBIF Backbone and the Catalogue of Life Extended Release (COL XR), the latter integrated through ChecklistBank. 427 + 428 + ### 4.5 Example Backbone Record (Passer domesticus) 429 + 430 + ```json 431 + { 432 + "nubKey": 5231190, 433 + "taxonID": "gbif:5231190", 434 + "nameKey": 8290258, 435 + "parentKey": 2492321, 436 + "kingdom": "Animalia", "kingdomKey": 1, 437 + "phylum": "Chordata", "phylumKey": 44, 438 + "class": "Aves", "classKey": 212, 439 + "order": "Passeriformes", "orderKey": 729, 440 + "family": "Passeridae", "familyKey": 5264, 441 + "genus": "Passer", "genusKey": 2492321, 442 + "species": "Passer domesticus", "speciesKey": 5231190, 443 + "scientificName": "Passer domesticus (Linnaeus, 1758)", 444 + "canonicalName": "Passer domesticus", 445 + "rank": "SPECIES", 446 + "taxonomicStatus": "ACCEPTED", 447 + "numDescendants": 15, 448 + "datasetKey": "d7dddbf4-2cf0-4f39-9b2a-bb099caae36c" 449 + } 450 + ``` 451 + 452 + --- 453 + 454 + ## 5. GBIF Dataset Structure 455 + 456 + ### 5.1 Dataset Record Fields 457 + 458 + A GBIF dataset is published by an organisation and registered in the GBIF Registry. Key fields: 459 + 460 + | Field | Description | 461 + |-------|-------------| 462 + | `key` (UUID) | The `datasetKey` referenced on every occurrence in the dataset | 463 + | `doi` | Dataset DOI (e.g. `10.15468/rxbp4w`) — citable identifier | 464 + | `title` | Human-readable dataset name | 465 + | `type` | `OCCURRENCE`, `CHECKLIST`, `SAMPLING_EVENT`, or `METADATA` | 466 + | `subtype` | Further specialisation (e.g. `SPECIMEN`, `OBSERVATION`) | 467 + | `publishingOrganizationKey` | UUID of the organisation that published the dataset | 468 + | `license` | License string (e.g. `http://creativecommons.org/licenses/by/4.0/legalcode`) | 469 + | `description` | Free-text description | 470 + | `language` | Primary language of the metadata | 471 + | `version` | Dataset version | 472 + | `modified` | Date of last modification | 473 + | `pubDate` | Publication date | 474 + | `endpoints` | Array of URLs where the data is available (DwC-A endpoint, EML endpoint) | 475 + | `contacts` | List of people with roles (creator, metadata author, point of contact) | 476 + | `identifiers` | Additional identifiers (DOI, UUID, etc.) | 477 + 478 + ### 5.2 Publishing Organisation 479 + 480 + The `publishingOrganizationKey` links to the GBIF Registry entry for the institution. Example keys: 481 + - `90fd6680-349f-11d8-aa2d-b8a03c50a862` — Missouri Botanical Garden 482 + - ROR identifiers can sometimes be found in organisation records 483 + 484 + **API:** `GET https://api.gbif.org/v1/organization/{key}` returns organisation details. 485 + 486 + ### 5.3 Dataset Networks 487 + 488 + Datasets may belong to one or more GBIF networks (e.g., eBird, Ocean Biodiversity Information System, Xeno-canto). The `networkKeys` array on occurrence records links to these. 489 + 490 + --- 491 + 492 + ## 6. GBIF API 493 + 494 + ### 6.1 Base URL 495 + 496 + All GBIF REST API calls use: `https://api.gbif.org/v1/` 497 + 498 + ### 6.2 Occurrence Search API 499 + 500 + **Endpoint:** `GET https://api.gbif.org/v1/occurrence/search` 501 + 502 + **Key parameters:** 503 + 504 + | Parameter | Description | Example | 505 + |-----------|-------------|---------| 506 + | `taxonKey` | Backbone taxon key; includes all descendants and synonyms | `taxonKey=5231190` | 507 + | `scientificName` | Fuzzy name search | `scientificName=Passer+domesticus` | 508 + | `datasetKey` | Filter to a specific dataset UUID | | 509 + | `country` | ISO 3166-1 alpha-2 country code | `country=GB` | 510 + | `continent` | One of AFRICA, ANTARCTICA, ASIA, EUROPE, NORTH_AMERICA, OCEANIA, SOUTH_AMERICA | | 511 + | `geometry` | WKT geometry (POLYGON, MULTIPOLYGON, LINESTRING, POINT) for spatial filter | | 512 + | `geoDistance` | Distance filter: `lat,lon,distance` | `geoDistance=51.5,-0.1,10km` | 513 + | `decimalLatitude` | Latitude range | `decimalLatitude=50,60` | 514 + | `decimalLongitude` | Longitude range | | 515 + | `hasCoordinate` | `true` to require coordinates | `hasCoordinate=true` | 516 + | `hasGeospatialIssue` | `false` to exclude records with coordinate problems | `hasGeospatialIssue=false` | 517 + | `basisOfRecord` | Filter by basis of record | `basisOfRecord=HUMAN_OBSERVATION` | 518 + | `occurrenceStatus` | `PRESENT` or `ABSENT` | | 519 + | `mediaType` | `StillImage`, `MovingImage`, or `Sound` | | 520 + | `year` | Year range | `year=2020,2025` | 521 + | `month` | Month (1–12) | | 522 + | `eventDate` | Date or range in ISO 8601 | `eventDate=2023-01-01,2023-12-31` | 523 + | `institutionCode` | Institution code abbreviation | | 524 + | `collectionCode` | Collection code abbreviation | | 525 + | `catalogNumber` | Catalog number in the collection | | 526 + | `recordedBy` | Name of the recorder | | 527 + | `identifiedBy` | Name of the identifier | | 528 + | `typeStatus` | Type status of the specimen | | 529 + | `occurrenceId` | The publisher's `occurrenceID` value | | 530 + | `gadmGid` | GADM geographic unit identifier | | 531 + | `gadmLevel0Gid` | GADM country-level code | | 532 + | `iucnRedListCategory` | IUCN category (LC, NT, VU, EN, CR, EW, EX) | | 533 + | `isSequenced` | `true` for records with DNA sequences | | 534 + | `isInCluster` | `true` for records matched to a cluster | | 535 + | `establishmentMeans` | How the organism arrived (NATIVE, INTRODUCED, etc.) | | 536 + | `degreeOfEstablishment` | Level of establishment | | 537 + | `pathway` | Pathway of introduction | | 538 + | `kingdomKey`, `phylumKey`, `classKey`, `orderKey`, `familyKey`, `genusKey` | Higher taxon filters by backbone key | | 539 + | `acceptedTaxonKey` | Filter by accepted backbone taxon | | 540 + | `verbatimScientificName` | Match on the verbatim name before GBIF interpretation | | 541 + 542 + **Pagination:** 543 + 544 + | Parameter | Description | Default / Max | 545 + |-----------|-------------|---------------| 546 + | `offset` | Number of results to skip | 0 | 547 + | `limit` | Maximum results per page | 20 / 300 | 548 + 549 + The response `count` field gives the total matching records. The `endOfRecords` boolean signals the last page. Pagination is capped at `offset + limit ≤ 100,000` for the search API; use the download API for larger extracts. 550 + 551 + **Response structure:** 552 + 553 + ```json 554 + { 555 + "offset": 0, 556 + "limit": 20, 557 + "endOfRecords": false, 558 + "count": 1234567, 559 + "results": [ { /* occurrence record */ }, ... ] 560 + } 561 + ``` 562 + 563 + ### 6.3 Single Occurrence Lookup 564 + 565 + `GET https://api.gbif.org/v1/occurrence/{gbifID}` — returns full occurrence JSON. 566 + 567 + `GET https://api.gbif.org/v1/occurrence/{gbifID}/verbatim` — returns the uninterpreted Darwin Core record as received from the publisher. 568 + 569 + `GET https://api.gbif.org/v1/occurrence/{gbifID}/fragment` — returns the raw fragment as harvested. 570 + 571 + ### 6.4 Species (Taxonomy) API 572 + 573 + **Endpoint:** `GET https://api.gbif.org/v1/species/{nubKey}` — returns a backbone taxon record. 574 + 575 + **Name matching:** `GET https://api.gbif.org/v1/species/match?name=Passer+domesticus` — fuzzy match a name to the backbone. 576 + 577 + **Children:** `GET https://api.gbif.org/v1/species/{nubKey}/children` — direct children in the taxonomy. 578 + 579 + **Vernacular names:** `GET https://api.gbif.org/v1/species/{nubKey}/vernacularNames` 580 + 581 + **Distributions:** `GET https://api.gbif.org/v1/species/{nubKey}/distributions` 582 + 583 + ### 6.5 Dataset API 584 + 585 + `GET https://api.gbif.org/v1/dataset/{datasetKey}` — dataset metadata. 586 + 587 + `GET https://api.gbif.org/v1/dataset/search?q=birds` — search datasets. 588 + 589 + `GET https://api.gbif.org/v1/organization/{orgKey}` — organisation metadata. 590 + 591 + ### 6.6 Download API 592 + 593 + **Create a download request** (authenticated, POST): 594 + 595 + ``` 596 + POST https://api.gbif.org/v1/occurrence/download/request 597 + Content-Type: application/json 598 + Authorization: Basic {base64(user:password)} 599 + ``` 600 + 601 + ```json 602 + { 603 + "creator": "username", 604 + "notificationAddresses": ["user@example.com"], 605 + "format": "DWCA", 606 + "predicate": { 607 + "type": "and", 608 + "predicates": [ 609 + { "type": "equals", "key": "TAXON_KEY", "value": "212" }, 610 + { "type": "equals", "key": "HAS_COORDINATE", "value": "true" }, 611 + { "type": "equals", "key": "HAS_GEOSPATIAL_ISSUE", "value": "false" } 612 + ] 613 + } 614 + } 615 + ``` 616 + 617 + **Available formats:** `DWCA`, `SIMPLE_CSV`, `SIMPLE_PARQUET`, `SPECIES_LIST` 618 + 619 + **Poll status:** `GET https://api.gbif.org/v1/occurrence/download/{downloadKey}` 620 + 621 + When `status = "SUCCEEDED"`, the `downloadLink` field provides the download URL and `doi` provides the citable DOI. 622 + 623 + --- 624 + 625 + ## 7. GBIF Data Quality Flags (Issues) 626 + 627 + GBIF assigns issue flags to occurrence records during interpretation. These indicate data quality problems. More than 60 flags exist, grouped into the following categories. 628 + 629 + ### 7.1 Geospatial Issues 630 + 631 + | Flag | Severity | Meaning | 632 + |------|----------|---------| 633 + | `ZERO_COORDINATE` | Error | Coordinates are exactly 0°N 0°E — likely a null placeholder | 634 + | `COORDINATE_OUT_OF_RANGE` | Error | Latitude outside −90..90 or longitude outside −180..180 | 635 + | `COORDINATE_INVALID` | Error | Coordinate value cannot be interpreted at all | 636 + | `COORDINATE_ROUNDED` | Info | Original coordinates were rounded to 6 decimal places (~11 cm precision) | 637 + | `COORDINATE_REPROJECTED` | Info | Coordinates successfully converted from a non-WGS84 datum to WGS84 | 638 + | `COORDINATE_REPROJECTION_SUSPICIOUS` | Warning | Reprojection succeeded but caused a shift of more than 0.1° | 639 + | `COORDINATE_REPROJECTION_FAILED` | Error | Cannot reproject from the stated datum to WGS84 | 640 + | `GEODETIC_DATUM_ASSUMED_WGS84` | Info | No datum supplied; WGS84 assumed | 641 + | `GEODETIC_DATUM_INVALID` | Warning | Stated datum cannot be matched to a known SRS | 642 + | `FOOTPRINT_SRS_INVALID` | Warning | SRS for footprint WKT cannot be matched | 643 + | `FOOTPRINT_WKT_MISMATCH` | Warning | Footprint WKT conflicts with given decimal coordinates | 644 + | `FOOTPRINT_WKT_INVALID` | Warning | Footprint WKT cannot be parsed | 645 + | `COORDINATE_UNCERTAINTY_METERS_INVALID` | Warning | Uncertainty value is non-numeric or implausible | 646 + | `COORDINATE_PRECISION_INVALID` | Warning | Precision value is non-numeric or unreasonably extreme | 647 + | `PRESUMED_NEGATED_LONGITUDE` | Warning | Negating longitude would resolve country mismatch | 648 + | `PRESUMED_NEGATED_LATITUDE` | Warning | Negating latitude would resolve country mismatch | 649 + | `PRESUMED_SWAPPED_COORDINATE` | Warning | Latitude and longitude appear transposed | 650 + | `COUNTRY_COORDINATE_MISMATCH` | Error | Coordinates do not fall within the stated country | 651 + | `COUNTRY_MISMATCH` | Warning | Country name and country code are contradictory | 652 + | `COUNTRY_DERIVED_FROM_COORDINATES` | Info | Country name was determined from coordinates, not supplied | 653 + | `COUNTRY_INVALID` | Warning | Country name/code does not match known vocabulary | 654 + | `CONTINENT_COORDINATE_MISMATCH` | Warning | Coordinates outside the stated continent | 655 + | `CONTINENT_DERIVED_FROM_COORDINATES` | Info | Continent determined from coordinates | 656 + | `CONTINENT_DERIVED_FROM_COUNTRY` | Info | Continent determined from country | 657 + | `CONTINENT_INVALID` | Warning | Continent does not match known vocabulary | 658 + | `CONTINENT_COUNTRY_MISMATCH` | Warning | Interpreted continent and country do not correspond | 659 + | `DEPTH_MIN_MAX_SWAPPED` | Warning | Minimum depth greater than maximum depth | 660 + | `DEPTH_NON_NUMERIC` | Warning | Depth cannot be interpreted as a number | 661 + | `DEPTH_UNLIKELY` | Warning | Depth outside plausible range (below Mariana Trench) | 662 + | `DEPTH_NOT_METRIC` | Warning | Depth appears to be in feet, not metres | 663 + | `ELEVATION_MIN_MAX_SWAPPED` | Warning | Minimum elevation greater than maximum | 664 + | `ELEVATION_NON_NUMERIC` | Warning | Elevation cannot be interpreted as a number | 665 + | `ELEVATION_UNLIKELY` | Warning | Elevation above 17,000 m or below −11,000 m | 666 + | `ELEVATION_NOT_METRIC` | Warning | Elevation appears to be in feet, not metres | 667 + 668 + ### 7.2 Taxonomic Issues 669 + 670 + | Flag | Meaning | 671 + |------|---------| 672 + | `TAXON_MATCH_HIGHERRANK` | Only a higher-rank match was found on the backbone (e.g. genus-level match for a species record) | 673 + | `TAXON_MATCH_NONE` | No match found on the backbone | 674 + | `TAXON_MATCH_FUZZY` | Only an imprecise, non-exact match was found | 675 + | `TAXON_MATCH_AGGREGATE` | Match only possible at species aggregate/complex level | 676 + | `SCIENTIFIC_NAME_AND_ID_INCONSISTENT` | Scientific name does not match the name for the supplied identifier | 677 + | `TAXON_MATCH_NAME_AND_ID_AMBIGUOUS` | Backbone ID and name-based lookup give different results | 678 + | `SCIENTIFIC_NAME_ID_NOT_FOUND` | Supplied scientific name ID could not be found in any checklist | 679 + | `TAXON_CONCEPT_ID_NOT_FOUND` | Taxon concept ID not found | 680 + | `TAXON_ID_NOT_FOUND` | Taxon ID not found | 681 + | `TAXON_MATCH_SCIENTIFIC_NAME_ID_IGNORED` | Scientific name ID was not used in matching | 682 + | `TAXON_MATCH_TAXON_CONCEPT_ID_IGNORED` | Taxon concept ID was not used in matching | 683 + | `TAXON_MATCH_TAXON_ID_IGNORED` | Taxon ID was not used in matching | 684 + 685 + ### 7.3 Date Issues 686 + 687 + | Flag | Meaning | 688 + |------|---------| 689 + | `RECORDED_DATE_INVALID` | Event date cannot be interpreted (invalid date, wrong format, missing parts) | 690 + | `RECORDED_DATE_MISMATCH` | Event date string contradicts individual year/month/day fields | 691 + | `RECORDED_DATE_UNLIKELY` | Date is in the future or predates 1600 | 692 + | `IDENTIFIED_DATE_UNLIKELY` | Identification date is in the future or predates 1700 | 693 + | `IDENTIFIED_DATE_INVALID` | Identification date cannot be interpreted | 694 + | `MULTIMEDIA_DATE_INVALID` | Media creation date is invalid | 695 + | `MODIFIED_DATE_INVALID` | Modified date is invalid | 696 + | `MODIFIED_DATE_UNLIKELY` | Modified date is in the future or predates Unix epoch | 697 + | `GEOREFERENCED_DATE_INVALID` | Georeference date is invalid | 698 + | `GEOREFERENCED_DATE_UNLIKELY` | Georeference date is in the future or predates 1700 | 699 + 700 + ### 7.4 Vocabulary Issues 701 + 702 + | Flag | Meaning | 703 + |------|---------| 704 + | `BASIS_OF_RECORD_INVALID` | Basis of record value not in the controlled vocabulary | 705 + | `TYPE_STATUS_INVALID` | Type status value not in the controlled vocabulary | 706 + | `OCCURRENCE_STATUS_UNPARSABLE` | Occurrence status cannot be matched to `present` or `absent` | 707 + | `OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT` | `occurrenceStatus` was inferred from `individualCount` rather than explicitly supplied | 708 + 709 + ### 7.5 Collection/Institution Issues 710 + 711 + | Flag | Meaning | 712 + |------|---------| 713 + | `COLLECTION_MATCH_NONE` | Institution and collection codes could not be matched to GRSciColl | 714 + | `COLLECTION_MATCH_FUZZY` | Institution and collection codes matched GRSciColl only fuzzily | 715 + | `INSTITUTION_MATCH_NONE` | Institution code alone could not be matched to GRSciColl | 716 + | `INSTITUTION_MATCH_FUZZY` | Institution code matched GRSciColl only fuzzily | 717 + 718 + ### 7.6 Recommended Quality Filters 719 + 720 + For most ML training purposes, apply these filters in the occurrence search API: 721 + 722 + ``` 723 + hasCoordinate=true 724 + hasGeospatialIssue=false 725 + occurrenceStatus=PRESENT 726 + ``` 727 + 728 + Or equivalently, use the download predicate API with: 729 + 730 + ```json 731 + { "type": "equals", "key": "HAS_COORDINATE", "value": "true" }, 732 + { "type": "equals", "key": "HAS_GEOSPATIAL_ISSUE", "value": "false" } 733 + ``` 734 + 735 + --- 736 + 737 + ## 8. Licensing 738 + 739 + ### 8.1 License Options 740 + 741 + GBIF datasets use one of three Creative Commons licenses (or CC0 public domain dedication). The license applies to the occurrence data in the dataset; individual multimedia assets may carry their own separate license. 742 + 743 + | License | SPDX Identifier | Restrictions | 744 + |---------|----------------|--------------| 745 + | CC0 1.0 Universal | `CC0-1.0` | No restrictions; public domain dedication | 746 + | CC BY 4.0 | `CC-BY-4.0` | Attribution required | 747 + | CC BY-NC 4.0 | `CC-BY-NC-4.0` | Attribution + non-commercial use only | 748 + 749 + GBIF stores the license as a full URI in occurrence records: 750 + 751 + | URI | SPDX | 752 + |-----|------| 753 + | `http://creativecommons.org/publicdomain/zero/1.0/legalcode` | `CC0-1.0` | 754 + | `http://creativecommons.org/licenses/by/4.0/legalcode` | `CC-BY-4.0` | 755 + | `http://creativecommons.org/licenses/by-nc/4.0/legalcode` | `CC-BY-NC-4.0` | 756 + 757 + Older datasets may use CC BY-NC-SA or CC BY-SA; these appear as `http://creativecommons.org/licenses/by-nc-sa/3.0/` etc. 758 + 759 + ### 8.2 License per Record 760 + 761 + In a GBIF download, every occurrence row carries the dataset's license in the `license` field. The license is inherited from the dataset, not set per-occurrence. 762 + 763 + ### 8.3 Media Licensing 764 + 765 + Individual media objects (images, sounds) carried in the multimedia extension may have a different license from the occurrence itself. For example, a CC0 dataset might include CC BY-NC-SA spectrograms produced by Xeno-canto. The `license` and `rightsHolder` fields in the multimedia extension apply specifically to each media item. 766 + 767 + ### 8.4 Implications for Training Data 768 + 769 + - **CC0**: Can be used freely including for commercial ML training. 770 + - **CC-BY**: Can be used for commercial ML training; output model or downstream products must attribute the data source. 771 + - **CC-BY-NC**: Cannot be used for commercial ML training. Suitable for academic/research use only. 772 + 773 + A mixed-license Terradots document must propagate the most restrictive license of its constituent labels to the document as a whole, or track per-label licensing. 774 + 775 + --- 776 + 777 + ## 9. Multimedia Extensions 778 + 779 + ### 9.1 Overview 780 + 781 + GBIF uses the GBIF Multimedia Extension to attach images, sounds, and videos to occurrence records. In the JSON API, media items are in the `media` array of each occurrence. In DwC-A downloads, they are in the `multimedia.txt` extension file. 782 + 783 + ### 9.2 Multimedia Fields 784 + 785 + | Field | Description | 786 + |-------|-------------| 787 + | `type` | Media type from Dublin Core: `StillImage`, `MovingImage`, `Sound` | 788 + | `format` | MIME type (e.g. `image/jpeg`, `audio/mpeg`, `video/mp4`) | 789 + | `identifier` | Direct URL to the media file | 790 + | `references` | URL to a web page about the media item | 791 + | `title` | Title of the media item | 792 + | `description` | Description of the item (e.g. "Dorsal view of specimen") | 793 + | `source` | Original source of the item | 794 + | `audience` | Target audience | 795 + | `created` | ISO 8601 date-time when the media was created | 796 + | `creator` | Person or organisation who created the media | 797 + | `contributor` | Person who contributed the media | 798 + | `publisher` | Publisher of the media | 799 + | `license` | License URI for this specific media item | 800 + | `rightsHolder` | Person or organisation holding rights over the media | 801 + 802 + ### 9.3 Example Media Object 803 + 804 + ```json 805 + { 806 + "type": "StillImage", 807 + "format": "image/jpeg", 808 + "identifier": "https://data.nhm.ac.uk/media/BMNHE_1900813.jpg", 809 + "title": "BMNHE_1900813_465514", 810 + "created": "2016-03-02", 811 + "license": "http://creativecommons.org/licenses/by/4.0/legalcode", 812 + "rightsHolder": "The Trustees of the Natural History Museum, London" 813 + } 814 + ``` 815 + 816 + ### 9.4 Media Types for Biodiversity 817 + 818 + The three GBIF media types correspond to biodiversity use cases: 819 + 820 + | Type | Example | 821 + |------|---------| 822 + | `StillImage` | Specimen photograph, field photo, drone image, camera trap frame | 823 + | `MovingImage` | Camera trap video | 824 + | `Sound` | Xeno-canto bird call recording, bat echolocation recording | 825 + 826 + --- 827 + 828 + ## 10. Darwin Core Archive (DwC-A) Format 829 + 830 + ### 10.1 Overview 831 + 832 + A Darwin Core Archive is a ZIP file containing tab-separated text files and XML metadata. It implements a **star schema**: one core file (e.g., `occurrence.txt`) at the centre, with optional extension files (e.g., `multimedia.txt`) linked by a shared identifier. 833 + 834 + ### 10.2 ZIP Contents (GBIF Download) 835 + 836 + | File | Required | Description | 837 + |------|----------|-------------| 838 + | `occurrence.txt` | Yes | Interpreted occurrence records (after GBIF processing) | 839 + | `verbatim.txt` | No | Original unmodified records as received from publishers | 840 + | `multimedia.txt` | No | Media extension — images, audio, video | 841 + | `meta.xml` | Yes | Archive descriptor: file formats, column names, term URIs | 842 + | `metadata.xml` | Yes | Dataset-level metadata in Ecological Metadata Language (EML) | 843 + | `rights.txt` | No | License and rights information | 844 + | `citations.txt` | No | Citation text for each dataset included | 845 + | `dataset/*.xml` | No | Individual EML files for each contributing dataset | 846 + 847 + ### 10.3 meta.xml Structure 848 + 849 + The `meta.xml` file describes the files in the archive using XML. Simplified structure: 850 + 851 + ```xml 852 + <archive xmlns="http://rs.tdwg.org/dwc/text/" 853 + metadata="metadata.xml"> 854 + <core encoding="UTF-8" fieldsTerminatedBy="\t" 855 + linesTerminatedBy="\n" fieldsEnclosedBy="" 856 + ignoreHeaderLines="1" 857 + rowType="http://rs.tdwg.org/dwc/terms/Occurrence"> 858 + <files> 859 + <location>occurrence.txt</location> 860 + </files> 861 + <id index="0"/> <!-- gbifID is the core record identifier --> 862 + <field index="1" term="http://rs.tdwg.org/dwc/terms/datasetKey"/> 863 + <field index="2" term="http://rs.tdwg.org/dwc/terms/occurrenceID"/> 864 + <field index="3" term="http://rs.tdwg.org/dwc/terms/kingdom"/> 865 + <!-- ... one <field> per column ... --> 866 + </core> 867 + <extension encoding="UTF-8" fieldsTerminatedBy="\t" 868 + linesTerminatedBy="\n" fieldsEnclosedBy="" 869 + ignoreHeaderLines="1" 870 + rowType="http://rs.gbif.org/terms/1.0/Multimedia"> 871 + <files> 872 + <location>multimedia.txt</location> 873 + </files> 874 + <coreid index="0"/> <!-- links back to occurrence.txt gbifID --> 875 + <field index="1" term="http://purl.org/dc/terms/type"/> 876 + <field index="2" term="http://purl.org/dc/terms/format"/> 877 + <field index="3" term="http://purl.org/dc/elements/1.1/identifier"/> 878 + <!-- ... --> 879 + </extension> 880 + </archive> 881 + ``` 882 + 883 + ### 10.4 occurrence.txt Simple CSV Columns 884 + 885 + The Simple CSV download (or the core of a DwC-A) contains these columns in order: 886 + 887 + `gbifID`, `datasetKey`, `occurrenceID`, `kingdom`, `phylum`, `class`, `order`, `family`, `genus`, `species`, `infraspecificEpithet`, `taxonRank`, `scientificName`, `verbatimScientificName`, `verbatimScientificNameAuthorship`, `countryCode`, `locality`, `stateProvince`, `occurrenceStatus`, `individualCount`, `publishingOrgKey`, `decimalLatitude`, `decimalLongitude`, `coordinateUncertaintyInMeters`, `coordinatePrecision`, `elevation`, `elevationAccuracy`, `depth`, `depthAccuracy`, `eventDate`, `day`, `month`, `year`, `taxonKey`, `speciesKey`, `basisOfRecord`, `institutionCode`, `collectionCode`, `catalogNumber`, `recordNumber`, `identifiedBy`, `dateIdentified`, `license`, `rightsHolder`, `recordedBy`, `typeStatus`, `establishmentMeans`, `lastInterpreted`, `mediaType`, `issue` 888 + 889 + The full DwC-A `occurrence.txt` contains all the above plus approximately 150 additional columns including verbatim fields, georeference fields, geological fields, and all GBIF-added backbone keys. 890 + 891 + ### 10.5 multimedia.txt Columns 892 + 893 + `gbifID`, `type`, `format`, `identifier`, `references`, `title`, `description`, `source`, `audience`, `created`, `creator`, `contributor`, `publisher`, `license`, `rightsHolder` 894 + 895 + The `gbifID` in multimedia.txt is a foreign key referencing `gbifID` in occurrence.txt. 896 + 897 + ### 10.6 Downloading and Citing DwC-A 898 + 899 + GBIF generates a DOI for each download (e.g., `10.15468/dl.xxxxx`). This DOI should be cited in any publication using the data. The `citations.txt` file in the archive contains the suggested citation text for each dataset. 900 + 901 + --- 902 + 903 + ## 11. Terradots Mapping Plan 904 + 905 + This section maps every GBIF/Darwin Core concept to the Terradots type system defined in `terradots.mli`. 906 + 907 + ### 11.1 Origin Type 908 + 909 + Every GBIF occurrence is a **direct empirical observation or specimen** — not computed or simulated. It therefore maps to `Measured`. 910 + 911 + ```ocaml 912 + Measured { 913 + observer : string option; (* recordedBy / recordedByID *) 914 + via : string option; (* "gbif:{gbifID}" *) 915 + license : string option; (* SPDX from license URI *) 916 + accuracy_m : float option; (* coordinateUncertaintyInMeters *) 917 + } 918 + ``` 919 + 920 + ### 11.2 Geometry 921 + 922 + | GBIF field | Terradots | Notes | 923 + |------------|-----------|-------| 924 + | `decimalLongitude` | `Point { x = lon; ... }` | x is longitude in EPSG:4326 | 925 + | `decimalLatitude` | `Point { ...; y = lat }` | y is latitude in EPSG:4326 | 926 + | `footprintWKT` | `Polygon` or `Multi` | If present, use footprint geometry instead of point | 927 + 928 + GBIF coordinates are always WGS84 (EPSG:4326) after interpretation. Set `document.crs = "EPSG:4326"`. 929 + 930 + Most GBIF occurrence records are point observations. The `footprintWKT` field, when populated, describes the full spatial extent and should be preferred when it is valid (not flagged with `FOOTPRINT_WKT_INVALID` or `FOOTPRINT_WKT_MISMATCH`). 931 + 932 + ```ocaml 933 + (* Typical point occurrence *) 934 + let geom = Point { x = decimalLongitude; y = decimalLatitude } 935 + 936 + (* Occurrence with valid footprint *) 937 + let geom = parse_wkt footprintWKT (* → Polygon or Multi *) 938 + ``` 939 + 940 + ### 11.3 Identity: label.id and origin.via 941 + 942 + | GBIF field | Terradots field | Value | 943 + |------------|----------------|-------| 944 + | `gbifID` | `origin.via` | `"gbif:{gbifID}"` (matches the URI scheme in the mli doc) | 945 + | `occurrenceID` | `properties` | `("dwc:occurrenceID", value)` — publisher's ID; not stable as a Terradots id | 946 + | — | `label.id` | Generate UUID at import time for stable local identity | 947 + 948 + The `via` URI format `gbif:4023589127` is already listed as a recognised scheme in the Terradots `.mli` documentation comment: 949 + 950 + ``` 951 + gbif:4023589127 GBIF occurrence 952 + ``` 953 + 954 + The `label.id` should be a UUID generated at import time — do not use `gbifID` directly as the Terradots id, because the gbifID is an integer and UUIDs are recommended for the id field. 955 + 956 + Alternatively use `"gbif-" ^ string_of_int gbifID` as the Terradots id if stability over re-imports is preferred. 957 + 958 + ### 11.4 Scientific Name → class_dist 959 + 960 + | GBIF field | Terradots field | Value | 961 + |------------|----------------|-------| 962 + | `species` | `class_dist` | `[("Passer domesticus", 1.0)]` — canonical binomial at species level | 963 + | `acceptedScientificName` | `class_dist` | Use when `species` is empty (subspecies/genus-level records) | 964 + | `scientificName` | `class_dist` | Fallback when above are empty | 965 + 966 + **Recommendation:** Prefer the species-level name from the `species` column/field for the class, so that subspecies records and species records can be compared and deduplicated. If the match is only at genus or higher (`TAXON_MATCH_HIGHERRANK` flag), use the accepted scientific name at the matched rank. 967 + 968 + ```ocaml 969 + let class_name = 970 + if species <> "" then species (* "Passer domesticus" *) 971 + else if acceptedScientificName <> "" then acceptedScientificName 972 + else scientificName 973 + in 974 + let class_dist = [(class_name, 1.0)] (* definite classification *) 975 + ``` 976 + 977 + GBIF classifications are determinate — a taxonomist identified the specimen. Confidence = 1.0 for confirmed identifications. 978 + 979 + ### 11.5 eventDate → event_date 980 + 981 + Both Darwin Core `eventDate` and Terradots `event_date` use the same ISO 8601 convention. Pass through verbatim. 982 + 983 + | GBIF field | Terradots field | Notes | 984 + |------------|----------------|-------| 985 + | `eventDate` | `event_date` | Direct passthrough; supports `2023`, `2023-09`, `2023-09-18`, intervals | 986 + | `year`/`month`/`day` | (reconstruct if `eventDate` is absent) | Combine as `"{year}-{month:02}-{day:02}"` | 987 + 988 + ```ocaml 989 + let event_date = 990 + if eventDate <> "" then Some (event_date_of_string eventDate) 991 + else match year, month, day with 992 + | Some y, Some m, Some d -> 993 + Some (event_date_of_string (Printf.sprintf "%04d-%02d-%02d" y m d)) 994 + | Some y, Some m, None -> 995 + Some (event_date_of_string (Printf.sprintf "%04d-%02d" y m)) 996 + | Some y, None, None -> 997 + Some (event_date_of_string (string_of_int y)) 998 + | _ -> None 999 + ``` 1000 + 1001 + ### 11.6 coordinateUncertaintyInMeters → accuracy_m 1002 + 1003 + | GBIF field | Terradots field | Notes | 1004 + |------------|----------------|-------| 1005 + | `coordinateUncertaintyInMeters` | `origin.accuracy_m` | Direct; metres | 1006 + 1007 + When this field is absent (many older records lack it), use `None`. Do not default to zero — zero would be misleading. The `COORDINATE_UNCERTAINTY_METERS_INVALID` issue flag indicates the value was unparseable. 1008 + 1009 + ```ocaml 1010 + let accuracy_m = 1011 + match coordinateUncertaintyInMeters with 1012 + | Some v when v > 0.0 -> Some v 1013 + | _ -> None 1014 + ``` 1015 + 1016 + ### 11.7 basisOfRecord → properties / origin distinction 1017 + 1018 + `basisOfRecord` does not change the Terradots `origin` type — all GBIF records are `Measured`. However it provides important context about the nature of the measurement. 1019 + 1020 + | basisOfRecord | Recommended handling | 1021 + |---------------|---------------------| 1022 + | `HumanObservation` | `observer` = `recordedByID` or free-text `recordedBy`; accuracy from GPS uncertainty | 1023 + | `MachineObservation` | `observer` = instrument URI or `institutionCode`; camera trap, acoustic sensor | 1024 + | `PreservedSpecimen` | accuracy_m typically large (georeferenced from label/locality); use `coordinateUncertaintyInMeters` | 1025 + | `FossilSpecimen` | Same as PreservedSpecimen; also note in `properties` | 1026 + | `LivingSpecimen` | Living plant/animal in a garden or zoo | 1027 + | `MaterialSample` | A physical sample (e.g. DNA extract, water sample) | 1028 + | `MaterialCitation` | A citation of an occurrence in published literature | 1029 + 1030 + Store in properties for downstream inspection: 1031 + 1032 + ```ocaml 1033 + ("gbif:basisOfRecord", basisOfRecord) 1034 + ``` 1035 + 1036 + ### 11.8 occurrenceID / gbifID → id and via URI 1037 + 1038 + | GBIF field | Terradots field | Value | 1039 + |------------|----------------|-------| 1040 + | `gbifID` | `origin.via` | `"gbif:" ^ string_of_int gbifID` | 1041 + | `occurrenceID` | `properties` | `("dwc:occurrenceID", occurrenceID)` | 1042 + | (generated) | `label.id` | UUID or `"gbif-" ^ string_of_int gbifID` | 1043 + 1044 + Prefer `origin.via` for the GBIF URI because it signals "imported from external registry" and is how the Terradots model is designed for registry imports. The `occurrenceID` (publisher's identifier) is stored as a property for cross-referencing. 1045 + 1046 + ### 11.9 recordedBy → origin.observer 1047 + 1048 + | GBIF field | Terradots field | Value | 1049 + |------------|----------------|-------| 1050 + | `recordedByID` (ORCID URIs) | `origin.observer` | Use the first ORCID URI: `"orcid:0000-0001-2345-6789"` | 1051 + | `recordedBy` (free text) | `origin.observer` | Use as fallback: `"gbif:recorder/" ^ url_encode(recordedBy)` | 1052 + | `identifiedBy` | `properties` | `("dwc:identifiedBy", identifiedBy)` | 1053 + | `recordedBy` | `properties` | `("dwc:recordedBy", recordedBy)` — always store the human-readable name | 1054 + 1055 + If `recordedByID` contains multiple ORCID URIs (semicolon-separated), use the first as `observer` and store the full list in `properties`. 1056 + 1057 + For institutionally-collected specimens where `recordedBy` is absent, use the institution as the observer: `"https://ror.org/{ror_id}"` if a ROR ID is available, or `"gbif:institution/" ^ institutionCode`. 1058 + 1059 + ### 11.10 license → origin.license 1060 + 1061 + Convert the GBIF license URI to an SPDX identifier: 1062 + 1063 + ```ocaml 1064 + let spdx_of_gbif_license = function 1065 + | s when String.is_prefix s "http://creativecommons.org/publicdomain/zero/" -> "CC0-1.0" 1066 + | s when String.is_prefix s "http://creativecommons.org/licenses/by/4.0/" -> "CC-BY-4.0" 1067 + | s when String.is_prefix s "http://creativecommons.org/licenses/by-nc/4.0/"-> "CC-BY-NC-4.0" 1068 + | s when String.is_prefix s "http://creativecommons.org/licenses/by-sa/4.0/"-> "CC-BY-SA-4.0" 1069 + | s when String.is_prefix s "http://creativecommons.org/licenses/by-nc-sa/" -> "CC-BY-NC-SA-4.0" 1070 + | s -> s (* store the raw URI if not recognised *) 1071 + ``` 1072 + 1073 + Note: licence comes from the **dataset**, not the occurrence. The record-level `license` field on the occurrence is that dataset license propagated down. 1074 + 1075 + ### 11.11 Dataset → activity 1076 + 1077 + A GBIF dataset maps to a Terradots `activity` representing the import of that dataset. 1078 + 1079 + | GBIF dataset field | Terradots `activity` field | Value | 1080 + |--------------------|---------------------------|-------| 1081 + | `datasetKey` | `activity_id` | `"gbif:dataset:" ^ datasetKey` | 1082 + | Organisation/person from dataset | `agent` | `"https://ror.org/..." ^ publishingOrg` or `institutionCode` | 1083 + | Download date or `pubDate` | `date` | ISO 8601 date of the GBIF download or dataset publication | 1084 + | Dataset `title` | `description` | Human-readable dataset name | 1085 + 1086 + All occurrence labels from the same dataset share the same `activity_id`. 1087 + 1088 + ```ocaml 1089 + let act = { 1090 + activity_id = "gbif:dataset:e053ff53-c156-4e2e-b9b5-4462e9625424"; 1091 + agent = "gbif:org/90fd6680-349f-11d8-aa2d-b8a03c50a862"; 1092 + date = "2026-03-04"; (* date of import into Terradots *) 1093 + description = Some "Tropicos Specimens Non-MO (Missouri Botanical Garden)"; 1094 + } 1095 + ``` 1096 + 1097 + Alternatively, when doing a bulk GBIF download across many datasets, create a single activity for the whole download: 1098 + 1099 + ```ocaml 1100 + let act = { 1101 + activity_id = "gbif:download:10.15468/dl.xxxxx"; (* the download DOI *) 1102 + agent = "gbif:download"; 1103 + date = download_date; 1104 + description = Some "GBIF occurrence download, Aves, 2020-2025"; 1105 + } 1106 + ``` 1107 + 1108 + ### 11.12 Issues/Flags → annotations and properties 1109 + 1110 + GBIF issue flags are stored in `label.properties` for programmatic filtering, and may optionally generate Terradots `annotation` records. 1111 + 1112 + **In properties (always):** 1113 + 1114 + ```ocaml 1115 + ("gbif:issues", String.concat "," issues_list) 1116 + ``` 1117 + 1118 + **As annotations (optional, for important flags):** 1119 + 1120 + ```ocaml 1121 + (* Example: coordinate mismatch warrants a human-readable annotation *) 1122 + if List.mem "COUNTRY_COORDINATE_MISMATCH" issues then 1123 + let ann = { 1124 + id = uuid (); 1125 + text = "GBIF flag COUNTRY_COORDINATE_MISMATCH: coordinates do not fall within the stated country"; 1126 + anchors = [label.id]; 1127 + } 1128 + ``` 1129 + 1130 + **Recommended issue filters at import time:** Skip records where `hasGeospatialIssue = true` (i.e., any of `ZERO_COORDINATE`, `COORDINATE_OUT_OF_RANGE`, `COORDINATE_INVALID`, `COUNTRY_COORDINATE_MISMATCH`, `COORDINATE_REPROJECTION_FAILED` are present) unless the use case specifically requires imprecise records. 1131 + 1132 + ### 11.13 Taxonomic Hierarchy → properties 1133 + 1134 + The full taxonomic hierarchy should be stored in `properties` for downstream filtering and analysis: 1135 + 1136 + ```ocaml 1137 + let taxonomy_properties = List.filter_map (fun (k, v) -> 1138 + if v = "" then None else Some (k, v)) 1139 + [ 1140 + ("dwc:kingdom", kingdom); 1141 + ("dwc:phylum", phylum); 1142 + ("dwc:class", class_); 1143 + ("dwc:order", order); 1144 + ("dwc:family", family); 1145 + ("dwc:genus", genus); 1146 + ("dwc:species", species); 1147 + ("dwc:taxonRank", taxonRank); 1148 + ("gbif:taxonKey", string_of_int taxonKey); 1149 + ("gbif:speciesKey", string_of_int speciesKey); 1150 + ("gbif:kingdomKey", string_of_int kingdomKey); 1151 + ("gbif:familyKey", string_of_int familyKey); 1152 + ("gbif:genusKey", string_of_int genusKey); 1153 + ("dwc:taxonomicStatus", taxonomicStatus); 1154 + ("dwc:vernacularName", vernacularName); 1155 + ("gbif:iucnCategory", iucnRedListCategory); 1156 + ] 1157 + ``` 1158 + 1159 + ### 11.14 Complete Import Example 1160 + 1161 + **GBIF occurrence (Common Woodshrike, Xeno-canto):** 1162 + 1163 + ```json 1164 + { 1165 + "gbifID": 3034438331, 1166 + "occurrenceID": "https://data.biodiversitydata.nl/xeno-canto/observation/XC165501", 1167 + "basisOfRecord": "HUMAN_OBSERVATION", 1168 + "species": "Tephrodornis pondicerianus", 1169 + "scientificName": "Tephrodornis pondicerianus pondicerianus (Gmelin, 1789)", 1170 + "decimalLatitude": 18.3669, 1171 + "decimalLongitude": 73.7512, 1172 + "coordinateUncertaintyInMeters": null, 1173 + "eventDate": "2026-01-14", 1174 + "recordedBy": "Sarthak Awhad", 1175 + "license": "http://creativecommons.org/licenses/by-nc/4.0/legalcode", 1176 + "issues": ["CONTINENT_DERIVED_FROM_COORDINATES"], 1177 + "kingdom": "Animalia", "class": "Aves", 1178 + "order": "Passeriformes", "family": "Tephrodornithidae", 1179 + "datasetKey": "..." 1180 + } 1181 + ``` 1182 + 1183 + **Terradots output:** 1184 + 1185 + ```ocaml 1186 + let label = make_imported 1187 + ~cell:(hilbert_cell ~level:12 ~crs:"EPSG:4326" {x=73.7512; y=18.3669}) 1188 + ~id:"<uuid>" 1189 + ~geometry:(Point {x=73.7512; y=18.3669}) 1190 + ~via:"gbif:3034438331" 1191 + ~observer:"gbif:recorder/Sarthak+Awhad" (* no ORCID available *) 1192 + ~license:"CC-BY-NC-4.0" 1193 + (* accuracy_m omitted: coordinateUncertaintyInMeters absent *) 1194 + ~event_date:(event_date_of_string "2026-01-14") 1195 + ~class_dist:[("Tephrodornis pondicerianus", 1.0)] 1196 + ~confidence:1.0 1197 + ~activity:"gbif:dataset:..." 1198 + ~properties:[ 1199 + ("dwc:occurrenceID", "https://data.biodiversitydata.nl/xeno-canto/observation/XC165501"); 1200 + ("gbif:basisOfRecord", "HUMAN_OBSERVATION"); 1201 + ("dwc:kingdom", "Animalia"); 1202 + ("dwc:class", "Aves"); 1203 + ("dwc:order", "Passeriformes"); 1204 + ("dwc:family", "Tephrodornithidae"); 1205 + ("dwc:genus", "Tephrodornis"); 1206 + ("dwc:species", "Tephrodornis pondicerianus"); 1207 + ("dwc:taxonRank", "SUBSPECIES"); 1208 + ("dwc:recordedBy", "Sarthak Awhad"); 1209 + ("dwc:behavior", "call"); 1210 + ("gbif:issues", "CONTINENT_DERIVED_FROM_COORDINATES"); 1211 + ("gbif:datasetKey", "..."); 1212 + ] 1213 + () 1214 + ``` 1215 + 1216 + --- 1217 + 1218 + ## 12. Mapping Summary Table 1219 + 1220 + | GBIF / Darwin Core Concept | Terradots Mapping | Status | 1221 + |---|---|---| 1222 + | GBIF occurrence | `label` with `Measured` origin | Direct | 1223 + | `decimalLongitude` | `Point { x = lon; ... }` in EPSG:4326 | Direct | 1224 + | `decimalLatitude` | `Point { ...; y = lat }` in EPSG:4326 | Direct | 1225 + | `footprintWKT` | `Polygon` or `Multi` geometry | Direct (if valid) | 1226 + | `coordinateUncertaintyInMeters` | `origin.accuracy_m` | Direct | 1227 + | `species` / `acceptedScientificName` | `class_dist = [(name, 1.0)]` | Direct | 1228 + | `eventDate` | `event_date` | Direct (same ISO 8601 convention) | 1229 + | `year`/`month`/`day` | `event_date` (reconstructed) | Direct | 1230 + | `basisOfRecord` | `properties[("gbif:basisOfRecord", ...)]` | Property | 1231 + | `gbifID` | `origin.via = "gbif:{gbifID}"` | Direct (URI scheme documented) | 1232 + | `occurrenceID` | `properties[("dwc:occurrenceID", ...)]` | Property | 1233 + | `label.id` | UUID generated at import | UUID | 1234 + | `license` URI | `origin.license` (as SPDX string) | Direct (with URI→SPDX conversion) | 1235 + | Dataset (`datasetKey`) | `activity` with `activity_id = "gbif:dataset:{key}"` | Direct | 1236 + | GBIF download DOI | `activity` with `activity_id = "gbif:download:{doi}"` | Direct | 1237 + | `recordedByID` (ORCID) | `origin.observer = "orcid:{orcid}"` | Direct | 1238 + | `recordedBy` (free text) | `origin.observer` (with `gbif:recorder/` prefix) | Workaround | 1239 + | `institutionCode` | `origin.observer` fallback or `properties` | Partial | 1240 + | `kingdom`/`phylum`/`class`/`order`/`family`/`genus` | `properties[("dwc:{rank}", name)]` | Property | 1241 + | `taxonKey`/`speciesKey`/etc. backbone keys | `properties[("gbif:{rank}Key", key)]` | Property | 1242 + | `taxonomicStatus` | `properties[("dwc:taxonomicStatus", ...)]` | Property | 1243 + | `vernacularName` | `properties[("dwc:vernacularName", ...)]` | Property | 1244 + | `iucnRedListCategory` | `properties[("gbif:iucnCategory", ...)]` | Property | 1245 + | Issue flags | `properties[("gbif:issues", comma_list)]` | Property | 1246 + | Issue flags (serious) | `annotation` records | Optional | 1247 + | `media` array items | No direct mapping | Gap (§13.1) | 1248 + | `occurrenceStatus = ABSENT` | No direct mapping | Gap (§13.2) | 1249 + | `individualCount` | `properties[("dwc:individualCount", ...)]` | Property | 1250 + | `sex` / `lifeStage` / `behavior` | `properties[("dwc:{field}", ...)]` | Property | 1251 + | `locality` / `stateProvince` / `country` | `properties[("dwc:{field}", ...)]` | Property | 1252 + | `samplingProtocol` | `properties[("dwc:samplingProtocol", ...)]` | Property | 1253 + | `identifiedBy` / `dateIdentified` | `properties[("dwc:{field}", ...)]` | Property | 1254 + | `catalogNumber` / `collectionCode` | `properties[("dwc:{field}", ...)]` | Property | 1255 + | `isInCluster` | `properties[("gbif:isInCluster", "true")]` | Property | 1256 + | `dynamicProperties` | `properties[("dwc:dynamicProperties", json_str)]` | Property | 1257 + 1258 + --- 1259 + 1260 + ## 13. Gaps and Required Extensions 1261 + 1262 + ### 13.1 Multimedia / Media Objects 1263 + 1264 + **Problem:** GBIF occurrences carry a `media` array with URLs to specimen photographs, spectrograms, and audio recordings. These are first-class data in biodiversity ML (e.g., training image classifiers on iNaturalist photos). Terradots has no dedicated field for attached media. 1265 + 1266 + **Options:** 1267 + 1. **Store media URLs in properties:** `("gbif:media:0:url", "https://...")`, `("gbif:media:0:type", "StillImage")`, etc. Works but ugly and not structured. 1268 + 2. **Add `media` field to `label`:** A list of `{ url: string; media_type: string; license: string option; ... }` records. Clean but requires a schema extension. 1269 + 3. **Create a separate media label:** For each media item, create a second `label` at the same geometry with `class_dist = [(media_type, 1.0)]` and `via` pointing to the media URL. The two labels are linked via a `group`. 1270 + 1271 + **Recommendation:** Short-term, use option 1 with a fixed property naming convention. Longer-term, add a `media : media_item list` field to `label` where `media_item` carries URL, type, format, license, and creator. 1272 + 1273 + ### 13.2 Absent Occurrences (occurrenceStatus = ABSENT) 1274 + 1275 + **Problem:** GBIF includes absence records (`occurrenceStatus = ABSENT`) from systematic surveys where a species was looked for but not found. These are important for species distribution modelling. Terradots has no direct way to represent absence. 1276 + 1277 + **Options:** 1278 + 1. **Property flag:** `("gbif:occurrenceStatus", "ABSENT")`. Requires consumers to filter. 1279 + 2. **Negative confidence:** Use `confidence = 0.0` to signal absence. Unconventional. 1280 + 3. **Add a boolean `absent` field to `label`**, or an `OccurrenceStatus` variant to complement `origin`. 1281 + 1282 + **Recommendation:** For now, store `("gbif:occurrenceStatus", status)` in properties and document the convention. Flag this as requiring a first-class extension for biodiversity SDM workflows. 1283 + 1284 + ### 13.3 Observer Lists (Multiple Recorders) 1285 + 1286 + **Problem:** GBIF `recordedBy` is a semicolon-separated list of names (e.g. `"Alice Smith; Bob Jones"`). Terradots `origin.observer` is a single URI string. There is no support for multiple co-observers. 1287 + 1288 + **Options:** 1289 + 1. **Use the first recorder** as observer and store the full list in properties. 1290 + 2. **Concatenate** as a custom URI: `"gbif:recorders/" ^ url_encode(recordedBy)`. Ugly. 1291 + 3. **Add `observers : string list` to `Measured` origin.** 1292 + 1293 + **Recommendation:** Use option 1 short-term. For citizen science datasets (iNaturalist, eBird) with many multi-observer records, a list field would be valuable. 1294 + 1295 + ### 13.4 Subspecies and Infraspecific Taxa in class_dist 1296 + 1297 + **Problem:** GBIF may match an occurrence to a subspecies (e.g., `Tephrodornis pondicerianus pondicerianus`). The `species` field gives the species-level name. Using species-level for `class_dist` loses subspecies precision; using the full trinomial makes deduplication harder (different subspecies authorities won't match). 1298 + 1299 + **Recommendation:** Store the species-level name in `class_dist` (for matching/deduplication), store the full `scientificName` and `taxonRank` in `properties`. When subspecies identity matters, query via the `gbif:taxonKey` property. 1300 + 1301 + ### 13.5 Cluster Detection (isInCluster) 1302 + 1303 + **Problem:** GBIF detects clusters of probable duplicate occurrences across datasets (`isInCluster = true`). These should map to Terradots deduplication via `Derived` labels, but GBIF does not expose the cluster members directly in the occurrence record — only whether a record is in a cluster. 1304 + 1305 + **Recommendation:** Store `("gbif:isInCluster", "true")` as a property and flag these records as deduplication candidates. Use Terradots fingerprinting (spatial cell + class) to find the intra-document candidates for review. 1306 + 1307 + ### 13.6 Verbatim vs. Interpreted Fields 1308 + 1309 + **Problem:** GBIF provides both the original publisher data (`verbatimScientificName`, `verbatimEventDate`, `verbatimCoordinates`) and its interpreted versions. Terradots has no parallel verbatim/interpreted distinction. 1310 + 1311 + **Recommendation:** Store important verbatim fields in properties with a `verbatim:` prefix: `("verbatim:scientificName", verbatimScientificName)`, `("verbatim:eventDate", verbatimEventDate)`. The interpreted values go into the primary fields. 1312 + 1313 + ### 13.7 Sampling Events and Effort 1314 + 1315 + **Problem:** GBIF sampling-event datasets carry structured survey effort information (`samplingProtocol`, `sampleSizeValue`, `sampleSizeUnit`, `samplingEffort`, `eventID`, `parentEventID`). These describe the survey design, not just individual records. Terradots `activity` is close but lacks these structured fields. 1316 + 1317 + **Recommendation:** Store sampling metadata in `properties`. The `eventID` → Terradots `group` mapping (group all occurrences from the same survey event) would be a useful convention but requires `group` to support arbitrary membership links (currently `group.members` is a list of label IDs). 1318 + 1319 + ### 13.8 GBIF Download Citation 1320 + 1321 + **Problem:** GBIF requires citation of downloads (DOI). The Terradots `activity` `description` field can hold this, but there is no structured citation field. 1322 + 1323 + **Recommendation:** Store the download DOI in the activity and document the convention: 1324 + 1325 + ```ocaml 1326 + { 1327 + activity_id = "gbif:download:10.15468/dl.xxxxx"; 1328 + agent = "GBIF.org"; 1329 + date = "2026-03-04"; 1330 + description = Some "GBIF Occurrence Download https://doi.org/10.15468/dl.xxxxx"; 1331 + } 1332 + ``` 1333 + 1334 + --- 1335 + 1336 + ## 14. Recommended Property Key Conventions 1337 + 1338 + When storing GBIF/Darwin Core metadata in `label.properties`, use the following prefix conventions to avoid key collisions and aid downstream processing: 1339 + 1340 + | Prefix | Meaning | Example | 1341 + |--------|---------|---------| 1342 + | `dwc:` | Darwin Core standard term | `dwc:kingdom`, `dwc:recordedBy` | 1343 + | `gbif:` | GBIF-added or GBIF-specific field | `gbif:taxonKey`, `gbif:basisOfRecord`, `gbif:issues` | 1344 + | `gbif:dataset` | Dataset-level information | `gbif:datasetKey`, `gbif:datasetName` | 1345 + | `verbatim:` | Verbatim (pre-interpretation) value | `verbatim:scientificName` | 1346 + | `media:N:` | N-th media item fields | `gbif:media:0:url`, `gbif:media:0:type` | 1347 + 1348 + --- 1349 + 1350 + ## 15. Recommended Priority for Extensions 1351 + 1352 + Ordered by impact on faithful GBIF import: 1353 + 1354 + 1. **Add `media` field to `label`** (§13.1) — critical for photo-based and audio-based biodiversity ML (iNaturalist, Xeno-canto, camera traps). High priority. 1355 + 2. **Add absence support to `label`** (§13.2) — required for species distribution modelling, which depends on both presence and absence records. High priority for SDM use cases. 1356 + 3. **Add `observers : string list` to `Measured` origin** (§13.3) — medium priority for large citizen-science imports. 1357 + 4. **Add structured sampling-event metadata to `activity`** (§13.7) — needed for systematic survey datasets. Medium priority. 1358 + 5. **Add `properties` to `group`** (already identified in the OSM mapping) — needed for GBIF network and dataset-group metadata. Medium priority. 1359 + 6. **Structured citation field in `activity`** (§13.8) — low priority; the description field is an adequate workaround.
+990
docs/plans/inaturalist-mapping.md
··· 1 + # iNaturalist to Terradots Mapping Plan 2 + 3 + ## 1. iNaturalist Data Model Overview 4 + 5 + ### Primary Sources 6 + 7 + - API v1 reference (Swagger UI): https://api.inaturalist.org/v1/docs/ 8 + - API recommended practices: https://www.inaturalist.org/pages/api+recommended+practices 9 + - Help — quality grades: https://help.inaturalist.org/en/support/solutions/articles/151000169936-what-is-the-data-quality-assessment-and-how-do-observations-qualify-to-become-research-grade- 10 + - Help — identifications: https://help.inaturalist.org/en/support/solutions/articles/151000194901-how-do-identifications-work- 11 + - Help — community taxon: https://help.inaturalist.org/en/support/solutions/articles/151000173076-what-are-the-community-taxon-and-the-observation-taxon- 12 + - Help — geoprivacy: https://help.inaturalist.org/en/support/solutions/articles/151000169938-what-is-geoprivacy-what-does-it-mean-for-an-observation-to-be-obscured- 13 + - Help — licenses: https://help.inaturalist.org/en/support/solutions/articles/151000173511-how-do-licenses-work-on-inaturalist-should-i-change-my-licenses- 14 + - Help — annotations: https://help.inaturalist.org/en/support/solutions/articles/151000191830-what-are-the-definitions-of-inaturalist-annotations- 15 + - Annotation values: https://www.inaturalist.org/pages/annotationvalues 16 + - Help — projects: https://help.inaturalist.org/en/support/solutions/articles/151000176472-understanding-projects-on-inaturalist 17 + - Open data (S3): https://github.com/inaturalist/inaturalist-open-data 18 + - GBIF occurrence export: https://www.gbif.org/dataset/50c9509d-22c7-4a22-a47d-8c48425ef4a7 19 + 20 + --- 21 + 22 + ## 2. Observation Structure 23 + 24 + An observation is the core record in iNaturalist: a single organism (or sign of one) observed at a place and time, with photographic or audio evidence, zero or more community identifications, and extensible metadata. 25 + 26 + ### 2.1 Top-level Observation Fields 27 + 28 + | Field | Type | Description | 29 + |-------|------|-------------| 30 + | `id` | integer | Auto-incrementing primary key; stable, globally unique within iNaturalist. | 31 + | `uuid` | UUID string | RFC 4122 UUID; used in some export formats. The `id` field is more commonly used in API calls. | 32 + | `created_at` | ISO 8601 datetime | When the record was created on the server (not when observed). | 33 + | `updated_at` | ISO 8601 datetime | Last server-side modification time. | 34 + | `observed_on` | date string `YYYY-MM-DD` | Calendar date of observation as stated by the observer. | 35 + | `time_observed_at` | ISO 8601 datetime or null | Date+time of observation in UTC, if the observer supplied a time. | 36 + | `observed_time_zone` | IANA timezone string | Observer's local timezone (e.g. `"America/Los_Angeles"`). | 37 + | `created_time_zone` | IANA timezone string | Server's timezone at record creation. | 38 + | `species_guess` | string | Free-text identification entered by the observer; may not match any taxon. | 39 + | `taxon_id` | integer | ID of the taxon the observer believes this is (from their own identification). | 40 + | `community_taxon_id` | integer or null | ID of the taxon the community algorithm has converged on. May differ from `taxon_id`. | 41 + | `quality_grade` | enum string | One of `"casual"`, `"needs_id"`, or `"research"`. | 42 + | `description` | string or null | Free-text notes from the observer. | 43 + | `place_guess` | string or null | Human-readable location name entered by the observer. | 44 + | `location` | string | `"latitude,longitude"` in WGS84; null for private observations. | 45 + | `latitude` | float or null | WGS84 decimal degrees. May be randomised if `obscured=true`. | 46 + | `longitude` | float or null | WGS84 decimal degrees. May be randomised if `obscured=true`. | 47 + | `positional_accuracy` | integer (metres) or null | Stated GPS accuracy radius. For obscured observations iNaturalist sets this to the diameter of the obscuration cell (~22,000 m at equator). | 48 + | `geoprivacy` | enum string or null | `"open"`, `"obscured"`, or `"private"`. `null` means the observer has not set it (defaults to open). | 49 + | `taxon_geoprivacy` | enum string or null | Geoprivacy applied automatically because the identified taxon has an at-risk conservation status. Same values as `geoprivacy`. | 50 + | `obscured` | boolean | `true` if coordinates have been offset (either by `geoprivacy` or `taxon_geoprivacy`). | 51 + | `coordinates_obscured` | boolean | Synonym for `obscured` in some API contexts. | 52 + | `map_scale` | integer or null | Map zoom level at which the observation was pinned; rarely used. | 53 + | `captive` | boolean | `true` if the organism is not wild (captive animal, cultivated plant). Captive/cultivated observations receive quality grade `"casual"`. | 54 + | `license_code` | string or null | CC license code for the observation record itself. Separate from photo licenses. Values: `"cc-by"`, `"cc-by-nc"`, `"cc-by-sa"`, `"cc-by-nd"`, `"cc-by-nc-sa"`, `"cc-by-nc-nd"`, `"cc0"`, or `null` for "all rights reserved". | 55 + | `out_of_range` | boolean | Community has flagged that the taxon is outside its expected range at this location. | 56 + | `spam` | boolean | Flagged as spam. | 57 + | `mappable` | boolean | Whether to show on public maps (requires non-private geoprivacy and verifiable status). | 58 + | `uri` | string | Canonical URL: `"https://www.inaturalist.org/observations/{id}"`. | 59 + | `url` | string | Same as `uri` in some contexts. | 60 + | `num_identification_agreements` | integer | Count of identifications that agree with the community taxon. | 61 + | `num_identification_disagreements` | integer | Count of identifications that disagree. | 62 + | `identifications_most_agree` | boolean | `true` if `num_agreements > num_disagreements`. | 63 + | `identifications_most_disagree` | boolean | `true` if `num_disagreements >= num_agreements`. | 64 + | `place_ids` | integer array | IDs of all places (at any admin level) that contain this observation's coordinates. | 65 + | `project_ids` | integer array | IDs of projects that include this observation. | 66 + | `reviewed_by` | integer array | User IDs of people who have "reviewed" this observation. | 67 + | `faves_count` | integer | Number of users who have faved this observation. | 68 + | `comments_count` | integer | Number of comments. | 69 + | `identifications_count` | integer | Total identification records (including withdrawn). | 70 + 71 + ### 2.2 Nested Objects Embedded in an Observation 72 + 73 + - `user` — observer user object (see §6) 74 + - `taxon` — the observer's current identified taxon (see §4) 75 + - `community_taxon` — same structure as `taxon`; the community consensus taxon (may be null) 76 + - `identifications` — array of identification objects (see §3) 77 + - `photos` — array of photo objects (see §5.1) 78 + - `sounds` — array of sound objects (see §5.2) 79 + - `annotations` — array of annotation objects (see §8) 80 + - `observation_field_values` — array of observation field value objects (see §9) 81 + - `ofvs` — shorthand alias for `observation_field_values` 82 + - `tags` — array of free-text tag strings 83 + - `comments` — array of comment objects (user + body + timestamp) 84 + - `faves` — array of fave objects (user + vote score) 85 + - `outlinks` — array of external links (source + url) 86 + - `votes` — array of vote objects (vote_scope, vote_flag, user) 87 + - `quality_metrics` — array of data quality votes cast via the DQA 88 + - `conservation_status` — conservation status info for the identified taxon 89 + 90 + --- 91 + 92 + ## 3. Identification System 93 + 94 + Each observation can have multiple identifications from different users. Identifications are the mechanism by which iNaturalist assigns a community consensus taxon. 95 + 96 + ### 3.1 Identification Object Fields 97 + 98 + | Field | Type | Description | 99 + |-------|------|-------------| 100 + | `id` | integer | Unique identification record ID. | 101 + | `uuid` | UUID string | | 102 + | `created_at` | ISO 8601 datetime | | 103 + | `updated_at` | ISO 8601 datetime | | 104 + | `taxon_id` | integer | Taxon this identification proposes. | 105 + | `taxon` | object | Embedded taxon object. | 106 + | `user` | object | User who made the identification. | 107 + | `user_id` | integer | | 108 + | `observation_id` | integer | | 109 + | `current` | boolean | `false` if the user has since superseded or withdrawn it. | 110 + | `current_taxon` | boolean | `true` if the proposed taxon is still the current observation taxon. | 111 + | `withdrawn` | boolean | User explicitly withdrew this identification. | 112 + | `hidden` | boolean | Hidden by moderators or the observer. | 113 + | `disagreement` | boolean | `true` if this ID explicitly disagrees with an existing broader ID (the user was prompted "does this look like X?" and answered no). | 114 + | `vision` | boolean | Identification was made with the help of iNaturalist's computer vision suggestions. | 115 + | `category` | enum string | One of: `"improving"`, `"supporting"`, `"leading"`, `"maverick"`. See §3.2. | 116 + | `body` | string or null | Optional text comment explaining the identification. | 117 + | `flags` | array | Flags (spam/inappropriate) on this identification. | 118 + | `moderator_actions` | array | Moderator actions taken on this identification. | 119 + | `taxon_change` | object or null | If the taxon was subsequently merged/split, links to the taxon change. | 120 + 121 + ### 3.2 Identification Categories 122 + 123 + iNaturalist classifies each current (non-withdrawn) identification into one of four categories: 124 + 125 + | Category | Meaning | 126 + |----------|---------| 127 + | `improving` | The first ID that moved the community taxon to a finer rank than it was before. Typically the first identification at species level when the community was previously at genus. | 128 + | `supporting` | Agrees with the current community taxon at the same or finer rank. Adds weight to the existing consensus. | 129 + | `leading` | The most specific current ID, but the community has not yet converged to that rank (not enough agreeing IDs). Essentially a proposal waiting for support. | 130 + | `maverick` | Disagrees with the community taxon at a rank that conflicts with the majority view. These "outlier" IDs do not count toward the community taxon but are preserved and visible. | 131 + 132 + ### 3.3 Community Taxon Algorithm 133 + 134 + The community taxon is the finest-rank taxon that more than 2/3 of the current (non-withdrawn) identifications agree with. The algorithm: 135 + 136 + 1. Collect all current identifications. 137 + 2. For each identification, the identifier implicitly agrees with all taxa ancestral to their proposed taxon (e.g., an ID of *Panthera leo* implies agreement with *Panthera*, *Felidae*, *Carnivora*, etc.). 138 + 3. Starting from the finest rank proposed and working upward, find the most specific taxon where the proportion of agreeing identifiers exceeds 2/3. 139 + 4. That taxon becomes the community taxon. 140 + 5. If no taxon reaches the 2/3 threshold at any rank, the community taxon is null. 141 + 142 + The **observation taxon** (the one displayed as the primary label) is normally set to the community taxon, but if the observer opts out of community ID, it stays as their own identification. 143 + 144 + ### 3.4 Disagreement Mechanics 145 + 146 + When a user adds an ID that is a sibling or cousin of an existing ID (not an ancestor), iNaturalist prompts them: "This is a different {rank} from the previous ID. Is this organism not a {previous taxon}?" If they confirm, the `disagreement` flag is set to `true` on their identification, and the community taxon rolls back to the lowest common ancestor of the two conflicting IDs. 147 + 148 + ### 3.5 Maverick IDs 149 + 150 + An identification marked `"maverick"` is one that is inconsistent with the current community taxon (i.e., the identifier has chosen a taxon that is not a descendant of, or ancestor to, the community taxon). Maverick IDs are retained and visible, and can shift the community taxon if subsequent identifiers agree with them. 151 + 152 + --- 153 + 154 + ## 4. Quality Grades 155 + 156 + iNaturalist assigns one of three quality grades to every observation. 157 + 158 + ### 4.1 Grade Definitions 159 + 160 + | Grade | Meaning | 161 + |-------|---------| 162 + | `casual` | Fails minimum verifiability criteria, or community has voted it down via the DQA. Not shared with GBIF or other data partners. | 163 + | `needs_id` | Verifiable (has date, location, media, wild organism) but community has not yet reached consensus at species level with sufficient agreement. | 164 + | `research` | Verifiable AND the community taxon is at species rank or finer with more than 2/3 agreement, OR the community has voted "as good as it can be" via the DQA. Shared with GBIF and downstream partners. | 165 + 166 + ### 4.2 Criteria for "Verifiable" (prerequisite for needs_id and research) 167 + 168 + An observation must have all four of: 169 + 1. A date 170 + 2. Geographic coordinates 171 + 3. At least one photo or sound 172 + 4. Organism is wild or naturalised (not captive/cultivated) 173 + 174 + ### 4.3 Data Quality Assessment (DQA) Votes 175 + 176 + Any user can vote on a set of quality flags, each of which can push an observation toward `casual`: 177 + 178 + | DQA flag | Effect if majority votes "no" | 179 + |----------|------------------------------| 180 + | Date is accurate | → casual | 181 + | Location is accurate | → casual | 182 + | Organism is wild/naturalized | → casual | 183 + | Evidence of organism (not just habitat photo) | → casual | 184 + | Evidence is recent (not > ~100 years old) | → casual | 185 + | Single subject (not multiple unrelated taxa) | → casual | 186 + | No artificial manipulation of image/sound | → casual | 187 + | ID is supported (community can ID to family or below) | → casual | 188 + 189 + The DQA can also vote "as good as it can be" (community agrees no further identification is possible), which allows research grade at genus or higher. 190 + 191 + ### 4.4 Automatic Captive/Cultivated Voting 192 + 193 + If 80% or more of all observations of a taxon in a 0.2° × 0.2° cell are voted captive/cultivated, new observations of that taxon in that cell are automatically flagged captive (pushed to casual) unless the observer asserts otherwise. 194 + 195 + --- 196 + 197 + ## 5. Taxon Model 198 + 199 + ### 5.1 Taxon Object Fields 200 + 201 + | Field | Type | Description | 202 + |-------|------|-------------| 203 + | `id` | integer | Unique taxon ID within iNaturalist's taxonomy. | 204 + | `uuid` | UUID string | | 205 + | `name` | string | Scientific name (binomial for species, uninomial for higher ranks). | 206 + | `rank` | string | Taxonomic rank: `"kingdom"`, `"phylum"`, `"class"`, `"order"`, `"family"`, `"genus"`, `"species"`, `"subspecies"`, and many intermediate ranks (`"tribe"`, `"subtribe"`, `"variety"`, etc.). | 207 + | `rank_level` | integer | Numeric rank for comparison. Lower = finer. Species = 10, genus = 20, family = 30, order = 40, class = 50, phylum = 60, kingdom = 70. | 208 + | `ancestry` | string | Slash-separated string of ancestor taxon IDs from root to immediate parent, e.g. `"48460/1/2/355675/3"`. | 209 + | `ancestor_ids` | integer array | Same information as `ancestry` but as an array, in order from root to immediate parent. | 210 + | `parent_id` | integer | Immediate parent taxon ID. | 211 + | `iconic_taxon_id` | integer | ID of the iconic taxon (broad grouping used for display). | 212 + | `iconic_taxon_name` | string | One of: `"Animalia"`, `"Plantae"`, `"Fungi"`, `"Chromista"`, `"Protozoa"`, `"Mollusca"`, `"Reptilia"`, `"Aves"`, `"Amphibia"`, `"Actinopterygii"`, `"Mammalia"`, `"Insecta"`, `"Arachnida"`, `"unknown"`. | 213 + | `preferred_common_name` | string or null | Preferred common name in the user's locale. | 214 + | `names` | array | All common names across all locales (only included if `all_names` param requested). | 215 + | `is_active` | boolean | `false` if taxon has been synonymised, split, or otherwise inactivated. | 216 + | `extinct` | boolean | Taxon is considered extinct. | 217 + | `gbif_id` | integer or null | Corresponding GBIF backbone taxon ID, if linked. | 218 + | `wikipedia_url` | string or null | | 219 + | `wikipedia_summary` | string or null | Short excerpt from Wikipedia. | 220 + | `complete_rank` | string or null | For "complete" taxa (where all children are represented), the finest rank of completeness. | 221 + | `complete_species_count` | integer | Count of species in this taxon's subtree. | 222 + | `observations_count` | integer | Total observations of this taxon and its descendants on iNaturalist. | 223 + | `vision` | boolean | Taxon is included in iNaturalist's computer vision model. | 224 + | `default_photo` | object | A single photo representing this taxon (thumbnail). | 225 + | `taxon_photos` | array | All photos curated for this taxon page. | 226 + | `conservation_status` | object or null | Conservation status in the observer's jurisdiction (IUCN category, source, authority). | 227 + | `listed_taxa` | array | Species list associations. | 228 + | `establishment_means` | object or null | Whether the taxon is native, introduced, or endemic at the observation location. | 229 + 230 + ### 5.2 Rank Level Reference 231 + 232 + ``` 233 + stateofmatter = 100 234 + kingdom = 70 235 + subkingdom = 67 236 + phylum = 60 237 + subphylum = 57 238 + superclass = 53 239 + class = 50 240 + subclass = 47 241 + infraclass = 45 242 + subterclass = 44 243 + superorder = 43 244 + order = 40 245 + suborder = 37 246 + infraorder = 35 247 + parvorder = 34 248 + zoosection = 33 249 + zoosubsection = 32 250 + superfamily = 30 (same level as family in rank_level) 251 + epifamily = 27 252 + family = 30 253 + subfamily = 27 254 + supertribe = 26 255 + tribe = 25 256 + subtribe = 24 257 + genus = 20 258 + subgenus = 15 259 + section = 13 260 + subsection = 12 261 + species = 10 262 + subspecies = 5 263 + variety = 5 264 + form = 5 265 + ``` 266 + 267 + --- 268 + 269 + ## 6. Photo and Sound Attachments 270 + 271 + ### 6.1 Photo Object 272 + 273 + | Field | Type | Description | 274 + |-------|------|-------------| 275 + | `id` | integer | Photo record ID. | 276 + | `uuid` | UUID string | | 277 + | `url` | string | URL of the square thumbnail (75×75 px). Pattern: `"https://static.inaturalist.org/photos/{id}/square.jpg?..."`. Replace `square` with `thumb` (100px), `small` (240px), `medium` (500px), `large` (1024px), or `original` for other sizes. | 278 + | `original_dimensions` | object | `{ width: int, height: int }` of the original image. | 279 + | `license_code` | string or null | Per-photo CC license code. May differ from the observation's `license_code`. Values: `"cc-by"`, `"cc-by-nc"`, `"cc-by-sa"`, `"cc-by-nd"`, `"cc-by-nc-sa"`, `"cc-by-nc-nd"`, `"cc0"`, or `null` ("all rights reserved"). | 280 + | `attribution` | string | Human-readable attribution string, e.g. `"(c) Alice Smith, some rights reserved (CC BY-NC)"`. | 281 + | `flags` | array | Moderation flags. | 282 + | `native_photo_id` | string or null | ID in the originating service (Flickr, etc.) if synced. | 283 + 284 + ### 6.2 Sound Object 285 + 286 + | Field | Type | Description | 287 + |-------|------|-------------| 288 + | `id` | integer | Sound record ID. | 289 + | `uuid` | UUID string | | 290 + | `file_url` | string | URL to the audio file (typically MP3 or WAV). | 291 + | `file_content_type` | string | MIME type, e.g. `"audio/mpeg"`. | 292 + | `license_code` | string or null | Same CC license codes as photos. | 293 + | `attribution` | string | Human-readable attribution. | 294 + 295 + ### 6.3 iNaturalist Open Data (AWS S3) 296 + 297 + iNaturalist publishes a daily snapshot of openly-licensed photos to a public S3 bucket (`s3://inaturalist-open-data/`). The snapshot has four TSV files: 298 + 299 + - `observations.csv` — `observation_uuid`, `observer_id`, `latitude`, `longitude`, `positional_accuracy`, `taxon_id`, `quality_grade`, `observed_on` 300 + - `photos.csv` — `photo_uuid`, `photo_id`, `observation_uuid`, `observer_id`, `extension`, `license`, `width`, `height`, `position` 301 + - `taxa.csv` — `taxon_id`, `ancestry`, `rank_level`, `rank`, `name`, `active` 302 + - `observers.csv` — `observer_id`, `login`, `name` 303 + 304 + Photo files are at `s3://inaturalist-open-data/photos/{photo_id}/{size}.{ext}` where size ∈ `{square, small, medium, large, original}`. 305 + 306 + --- 307 + 308 + ## 7. User Model 309 + 310 + | Field | Type | Description | 311 + |-------|------|-------------| 312 + | `id` | integer | User ID (stable, numeric). | 313 + | `uuid` | UUID string | | 314 + | `login` | string | Username (URL-safe, changeable but rarely changed in practice). | 315 + | `name` | string or null | Display name (free text). | 316 + | `icon` | string or null | URL to profile thumbnail. | 317 + | `icon_url` | string or null | Same as `icon` in some contexts. | 318 + | `observations_count` | integer | Total non-deleted observations. | 319 + | `identifications_count` | integer | Total identifications made. | 320 + | `journal_posts_count` | integer | | 321 + | `species_count` | integer | Distinct species observed. | 322 + | `created_at` | ISO 8601 datetime | Account creation date. | 323 + | `site_id` | integer | Which iNaturalist network node (iNat.org=1, iNat.ca=3, etc.) the user belongs to. | 324 + | `roles` | array of strings | Site roles, e.g. `"curator"`, `"admin"`. | 325 + | `orcid` | string or null | ORCID iD if the user has linked one. | 326 + 327 + --- 328 + 329 + ## 8. Projects 330 + 331 + iNaturalist has three project types: 332 + 333 + ### 8.1 Traditional Projects 334 + 335 + - Observations must be manually added by a project member. 336 + - The observer must join the project to add observations. 337 + - Administrators can access private/obscured coordinates of member observations. 338 + - Can require observation fields to be filled in. 339 + - Can maintain a species checklist. 340 + 341 + ### 8.2 Collection Projects 342 + 343 + - Automatically aggregates observations matching a saved filter (species, place, date range, etc.). 344 + - Observations are not explicitly added; membership is not required. 345 + - Administrators cannot access private/obscured coordinates. 346 + - Equivalent to a bookmarked search with a project page. 347 + 348 + ### 8.3 Umbrella Projects 349 + 350 + - A collection of other projects (traditional or collection). 351 + - Aggregates their observations into one page. 352 + 353 + ### 8.4 Project Object Fields (key fields) 354 + 355 + | Field | Type | Description | 356 + |-------|------|-------------| 357 + | `id` | integer | Project ID. | 358 + | `title` | string | Human-readable name. | 359 + | `slug` | string | URL-safe identifier: `inaturalist.org/projects/{slug}`. | 360 + | `project_type` | string | `"collection"`, `"umbrella"`, or `null` (traditional). | 361 + | `description` | string | | 362 + | `icon` | string | URL to project icon. | 363 + | `banner_color` | string | Hex colour. | 364 + | `location` | string | Lat/lng if pinned to a location. | 365 + | `place_id` | integer or null | Associated place. | 366 + | `user` | object | Project admin/creator. | 367 + | `members_count` | integer | | 368 + | `observations_count` | integer | | 369 + | `species_count` | integer | | 370 + 371 + --- 372 + 373 + ## 9. Annotations 374 + 375 + Annotations are structured key-value tags applied to observations. Each attribute has a controlled vocabulary of allowed values. Annotations are voted on; the displayed value is the one with the most positive votes. 376 + 377 + ### 9.1 Current Annotation Attributes and Values 378 + 379 + | Attribute | Applies to | Allowed Values | 380 + |-----------|-----------|----------------| 381 + | **Life Stage** | Animals | Adult, Egg, Juvenile, Larva, Nymph, Pupa, Subimago, Teneral | 382 + | **Sex** | Animals | Female, Male, Cannot Be Determined | 383 + | **Plant Phenology** | Plants | Flowering, Flower Budding, Fruiting, No Evidence of Flowering | 384 + | **Alive or Dead** | Animals | Alive, Dead, Cannot Be Determined | 385 + | **Evidence of Presence** | Non-plant, non-human | Bone, Construction, Feather, Egg, Gall, Hair, Leafmine, Molt, Organism, Scat, Track | 386 + | **Established** | Amphibians and Reptiles | Not Established (for escaped/vagrant animals outside established populations) | 387 + 388 + ### 9.2 Annotation Object Fields 389 + 390 + | Field | Type | Description | 391 + |-------|------|-------------| 392 + | `uuid` | UUID string | | 393 + | `controlled_attribute_id` | integer | ID of the attribute (Life Stage=1, Sex=9, Plant Phenology=12, Alive or Dead=17, Evidence of Presence=22, Established=35, etc.). | 394 + | `controlled_attribute` | object | `{ id, label, uri }` — the attribute definition. | 395 + | `controlled_value_id` | integer | ID of the selected value. | 396 + | `controlled_value` | object | `{ id, label, uri }` — the value definition. | 397 + | `user` | object | User who added the annotation. | 398 + | `user_id` | integer | | 399 + | `votes` | array | Vote records on this annotation. Each vote has `{ user_id, vote_flag: bool }`. | 400 + | `vote_score` | integer | Net upvotes minus downvotes. | 401 + | `concatenated_attr_val` | string | Convenience string: `"{attribute_label}={value_label}"`, e.g. `"Life Stage=Adult"`. | 402 + 403 + Annotations are retrieved via `GET /controlled_terms` (returns all attributes) and `GET /controlled_terms/for_taxon?taxon_id={id}` (returns applicable attributes). 404 + 405 + --- 406 + 407 + ## 10. Observation Fields 408 + 409 + Observation fields (OFVs) are project-defined or community-defined custom key-value fields. Any user can define a field and apply it to any observation. 410 + 411 + ### 10.1 Observation Field Object 412 + 413 + ```json 414 + { 415 + "id": 25, 416 + "name": "Associated species", 417 + "description": "A second species that was seen near this organism", 418 + "datatype": "taxon", 419 + "allowed_values": null, 420 + "units": null, 421 + "users_count": 14823, 422 + "values_count": 34021 423 + } 424 + ``` 425 + 426 + | Field | Type | Description | 427 + |-------|------|-------------| 428 + | `id` | integer | Field definition ID. | 429 + | `name` | string | Field name (not unique globally; users can create duplicates). | 430 + | `description` | string | What the field captures. | 431 + | `datatype` | string | One of: `"text"`, `"numeric"`, `"date"`, `"time"`, `"datetime"`, `"taxon"`, `"dna"`. | 432 + | `allowed_values` | string or null | Pipe-delimited list of allowed values for text fields acting as enumerations, e.g. `"yes\|no\|maybe"`. | 433 + | `units` | string or null | Unit label for numeric fields. | 434 + 435 + ### 10.2 Observation Field Value Object (on an observation) 436 + 437 + ```json 438 + { 439 + "id": 12345678, 440 + "uuid": "...", 441 + "field_id": 25, 442 + "observation_field": { "id": 25, "name": "Associated species", "datatype": "taxon" }, 443 + "value": "Quercus robur", 444 + "taxon": { ... }, 445 + "user": { ... }, 446 + "created_at": "2023-05-10T12:00:00Z", 447 + "updated_at": "2023-05-10T12:00:00Z" 448 + } 449 + ``` 450 + 451 + --- 452 + 453 + ## 11. iNaturalist API 454 + 455 + ### 11.1 Base URL and Authentication 456 + 457 + - Base URL: `https://api.inaturalist.org/v1/` 458 + - Authentication: JSON Web Token (JWT) in `Authorization` header. Obtain via `/users/api_token` using an OAuth2 access token from the v1 Rails app. 459 + - Rate limits: ~1 request/second, ~10,000 requests/day per IP. Bulk data should use the open data S3 snapshot or GBIF exports instead. 460 + 461 + ### 11.2 Key Endpoints 462 + 463 + | Endpoint | Method | Description | 464 + |----------|--------|-------------| 465 + | `/observations` | GET | Search observations with 100+ filter parameters. Key params: `taxon_id`, `taxon_name`, `place_id`, `user_id`, `quality_grade`, `license`, `photos`, `sounds`, `geoprivacy`, `lat`, `lng`, `radius`, `d1`, `d2`, `per_page` (max 200), `page`. | 466 + | `/observations/{id}` | GET | Single observation by ID. Returns full detail including all nested objects. | 467 + | `/observations/species_counts` | GET | Aggregated species counts for a set of observations. | 468 + | `/observations/identifiers` | GET | Users who have identified, with counts. | 469 + | `/observations/observers` | GET | Observers, with counts. | 470 + | `/observations/histogram` | GET | Time series histogram. | 471 + | `/taxa` | GET | Search taxa. Params: `q`, `rank`, `iconic_taxa`, `per_page`. | 472 + | `/taxa/{id}` | GET | Taxon by ID (accepts comma-separated list). | 473 + | `/taxa/autocomplete` | GET | Autocomplete for taxon names; used for ID suggestions. | 474 + | `/identifications` | GET | Search identifications. Params: `observation_id`, `user_id`, `taxon_id`, `category`, `current`. | 475 + | `/identifications/{id}` | GET | Single identification. | 476 + | `/places/{id}` | GET | Place by ID or slug. | 477 + | `/places/nearby` | GET | Places near a bounding box. | 478 + | `/projects` | GET | Search projects. | 479 + | `/projects/{id}` | GET | Project details. | 480 + | `/controlled_terms` | GET | All annotation attribute definitions. | 481 + | `/controlled_terms/for_taxon` | GET | Annotation attributes applicable to a taxon. | 482 + | `/users/{id}` | GET | User profile. | 483 + | `/search` | GET | Site-wide search (taxa, places, projects, users). | 484 + 485 + ### 11.3 Pagination 486 + 487 + - Default `per_page` = 30, maximum = 200. 488 + - Pagination via `page` parameter. 489 + - Hard cap: only the first 10,000 results are accessible via pagination (offset limit in Elasticsearch). For larger datasets use the open data export. 490 + 491 + ### 11.4 Response Envelope 492 + 493 + ```json 494 + { 495 + "total_results": 1234, 496 + "page": 1, 497 + "per_page": 30, 498 + "results": [ ... ] 499 + } 500 + ``` 501 + 502 + --- 503 + 504 + ## 12. Export Formats 505 + 506 + ### 12.1 CSV Export (inaturalist.org/observations/export) 507 + 508 + Available to any logged-in user. Columns include: 509 + 510 + | Column | Description | 511 + |--------|-------------| 512 + | `id` | Observation ID | 513 + | `observed_on_string` | Observation date as entered | 514 + | `observed_on` | `YYYY-MM-DD` | 515 + | `time_observed_at` | ISO 8601 UTC datetime or blank | 516 + | `user_id` | Observer user ID | 517 + | `user_login` | Observer username | 518 + | `created_at` | Record creation datetime | 519 + | `updated_at` | Last update datetime | 520 + | `quality_grade` | `casual`, `needs_id`, or `research` | 521 + | `license` | License code for the observation | 522 + | `url` | Canonical URL | 523 + | `image_url` | URL of first photo (square size) | 524 + | `sound_url` | URL of first sound | 525 + | `tag_list` | Comma-delimited free-text tags | 526 + | `description` | Observer notes | 527 + | `num_identification_agreements` | | 528 + | `num_identification_disagreements` | | 529 + | `captive_cultivated` | Boolean | 530 + | `oauth_application_id` | App used to create observation | 531 + | `place_guess` | Free-text location name | 532 + | `latitude` | Potentially obscured decimal degrees | 533 + | `longitude` | Potentially obscured decimal degrees | 534 + | `positional_accuracy` | Metres (inflated for obscured) | 535 + | `geoprivacy` | | 536 + | `taxon_geoprivacy` | | 537 + | `coordinates_obscured` | Boolean | 538 + | `positioning_method` | GPS, manual, etc. | 539 + | `positioning_device` | Device description | 540 + | `species_guess` | Free-text species name | 541 + | `scientific_name` | Scientific name of community taxon | 542 + | `common_name` | Common name of community taxon | 543 + | `iconic_taxon_name` | Broad group (Aves, Insecta, etc.) | 544 + | `taxon_id` | Community taxon ID | 545 + 546 + ### 12.2 Darwin Core Archive (DwC-A) to GBIF 547 + 548 + iNaturalist publishes all research-grade observations with open licenses to GBIF weekly as a Darwin Core Archive. The GBIF dataset is at https://www.gbif.org/dataset/50c9509d-22c7-4a22-a47d-8c48425ef4a7. 549 + 550 + Key Darwin Core term mappings from iNaturalist: 551 + 552 + | DwC Term | iNaturalist Source | 553 + |----------|-------------------| 554 + | `occurrenceID` | `https://www.inaturalist.org/observations/{id}` | 555 + | `basisOfRecord` | `HumanObservation` | 556 + | `scientificName` | Community taxon `name` | 557 + | `taxonRank` | Community taxon `rank` | 558 + | `kingdom`, `phylum`, `class`, `order`, `family`, `genus` | Extracted from taxon ancestry | 559 + | `decimalLatitude` | `latitude` | 560 + | `decimalLongitude` | `longitude` | 561 + | `coordinateUncertaintyInMeters` | `positional_accuracy` | 562 + | `eventDate` | `observed_on` or `time_observed_at` | 563 + | `recordedBy` | Observer `login` or `name` | 564 + | `license` | Observation `license_code` | 565 + | `rightsHolder` | Observer name | 566 + | `datasetName` | `"iNaturalist Research-grade Observations"` | 567 + | `collectionCode` | `"Observations"` | 568 + | `institutionCode` | `"iNaturalist"` | 569 + | `gbifID` | Assigned by GBIF on ingestion | 570 + 571 + --- 572 + 573 + ## 13. Licensing 574 + 575 + ### 13.1 License Levels 576 + 577 + iNaturalist applies licenses at two independent levels: 578 + 1. **Observation record** (metadata: species, date, location) — controlled by `license_code` on the observation object. 579 + 2. **Photos and sounds** — each photo/sound has its own `license_code`. 580 + 581 + Both can be set independently by the observer. 582 + 583 + ### 13.2 License Values 584 + 585 + | iNaturalist `license_code` | SPDX Identifier | Description | 586 + |---------------------------|----------------|-------------| 587 + | `cc0` | `CC0-1.0` | Public domain dedication; no rights reserved. | 588 + | `cc-by` | `CC-BY-4.0` | Attribution required. | 589 + | `cc-by-nc` | `CC-BY-NC-4.0` | Attribution + non-commercial use only. **Default for new accounts.** | 590 + | `cc-by-sa` | `CC-BY-SA-4.0` | Attribution + share-alike. | 591 + | `cc-by-nd` | `CC-BY-ND-4.0` | Attribution + no derivatives. | 592 + | `cc-by-nc-sa` | `CC-BY-NC-SA-4.0` | Attribution + non-commercial + share-alike. | 593 + | `cc-by-nc-nd` | `CC-BY-NC-ND-4.0` | Attribution + non-commercial + no derivatives. | 594 + | `null` | *(none)* | All rights reserved; no reuse without explicit permission. | 595 + 596 + ### 13.3 GBIF Eligibility 597 + 598 + Only observations (and their photos) with `cc0`, `cc-by`, or `cc-by-nc` licenses are shared with GBIF. 599 + 600 + --- 601 + 602 + ## 14. Geoprivacy 603 + 604 + ### 14.1 Settings 605 + 606 + | `geoprivacy` value | Behaviour | 607 + |--------------------|-----------| 608 + | `"open"` (or `null`) | Exact coordinates are public. `obscured=false`. | 609 + | `"obscured"` | Coordinates are randomised to a point within a 0.2° × 0.2° bounding box containing the true location. `obscured=true`. `positional_accuracy` is set to the diameter of this cell (~22,000 m at equator; smaller at higher latitudes). The true coordinates are stored in private fields (`private_latitude`, `private_longitude`) not visible publicly. | 610 + | `"private"` | No coordinates appear in the public API response at all. `latitude` and `longitude` are `null`. Not shared with GBIF or data partners. | 611 + 612 + ### 14.2 Taxon Geoprivacy 613 + 614 + If the community taxon has an at-risk conservation status (e.g. IUCN Vulnerable or higher, or a national red list status), iNaturalist automatically obscures coordinates regardless of the observer's geoprivacy setting. The `taxon_geoprivacy` field records this. The effective geoprivacy is the more restrictive of `geoprivacy` and `taxon_geoprivacy`. 615 + 616 + ### 14.3 Trusted Access 617 + 618 + The true coordinates are visible to: 619 + - The observer themselves 620 + - Users the observer has granted "trust" to 621 + - Curators of traditional projects that the observation belongs to (if the observer granted the project trust) 622 + 623 + --- 624 + 625 + ## 15. Place System 626 + 627 + Places are named geographic regions in iNaturalist used for filtering and associating species checklists. 628 + 629 + ### 15.1 Place Types 630 + 631 + Standard admin levels: `"Country"` (admin_level=0), `"State"` (1), `"County"` (2), `"Town"` (3). Also: `"Open Space"`, `"National Park"`, `"Continent"`, `"Island"`, `"Point of Interest"`, and community-created custom places. 632 + 633 + ### 15.2 Place Object Fields 634 + 635 + | Field | Type | Description | 636 + |-------|------|-------------| 637 + | `id` | integer | Place ID. | 638 + | `name` | string | | 639 + | `display_name` | string | Full name with country/state context. | 640 + | `place_type` | string | Type name. | 641 + | `admin_level` | integer or null | 0=country, 1=state, 2=county, 3=town. | 642 + | `ancestry` | string | Slash-delimited parent place IDs. | 643 + | `parent_id` | integer or null | Immediate parent place ID. | 644 + | `bbox_area` | float | Area of bounding box in square degrees. | 645 + | `latitude` | float | Centroid latitude. | 646 + | `longitude` | float | Centroid longitude. | 647 + | `swlat`, `swlng`, `nelat`, `nelng` | float | Bounding box corners. | 648 + | `geometry_geojson` | object | Full polygon geometry as GeoJSON. | 649 + | `check_list_id` | integer | ID of the default species checklist for this place. | 650 + 651 + ### 15.3 Observations and Places 652 + 653 + Each observation carries a `place_ids` array containing the IDs of all places (at any hierarchy level) whose polygon contains the observation's coordinates. This enables filtering by any place, including deeply nested administrative subdivisions. 654 + 655 + --- 656 + 657 + ## 16. Mapping Plan: iNaturalist → Terradots 658 + 659 + This section defines the mapping from each iNaturalist concept to the Terradots label store data model (`Terradots.mli`). 660 + 661 + ### 16.1 Observation → Label 662 + 663 + Each iNaturalist observation maps to a single Terradots `label` constructed via `make_imported`. 664 + 665 + ### 16.2 Field-by-Field Mapping 666 + 667 + | iNaturalist Field | Terradots Field | Mapping Notes | 668 + |-------------------|----------------|---------------| 669 + | `latitude`, `longitude` | `geometry = Point { x=longitude; y=latitude }` | WGS84 → EPSG:4326 Point. x=lon, y=lat per Terradots convention. | 670 + | `positional_accuracy` | `origin.Measured.accuracy_m` | Direct mapping. Already in metres. For obscured observations this will be ~22,000 m, which faithfully captures the uncertainty introduced by obscuration. | 671 + | `observed_on` / `time_observed_at` | `event_date` | Use `time_observed_at` (ISO 8601 datetime) if available, otherwise `observed_on` (date only). Both are valid Darwin Core `eventDate` formats. | 672 + | `id` | `origin.Measured.via` = `"inaturalist:observation/{id}"` | The `via` URI uniquely identifies the source record. Matches the URI scheme already defined in `Terradots.mli`: `inaturalist:observation/12345`. | 673 + | `uuid` | `properties [("inaturalist_uuid", uuid)]` | Preserve as property for round-trip fidelity. | 674 + | `user.id` | `origin.Measured.observer` = `"inaturalist:user/{user_id}"` | Encodes the observer as a URI in the iNaturalist namespace. If the user has an ORCID (`user.orcid`), prefer `"orcid:{orcid}"` instead. | 675 + | `license_code` | `origin.Measured.license` | Map to SPDX: `cc0` → `"CC0-1.0"`, `cc-by` → `"CC-BY-4.0"`, `cc-by-nc` → `"CC-BY-NC-4.0"`, etc. `null` → `None` (no free license). | 676 + | Community taxon + identifications | `class_dist` | **The key mapping.** See §16.3 below. | 677 + | `quality_grade` | `confidence` | Map `"research"` → `1.0`, `"needs_id"` → `0.5`, `"casual"` → `0.0` (or omit). Also store raw value in `properties`. See §16.4. | 678 + | `obscured` | `properties [("geoprivacy", "obscured")]` | Store geoprivacy status in properties. When `obscured=true` the `positional_accuracy` is already inflated, so `accuracy_m` correctly reflects the uncertainty. | 679 + | `geoprivacy` / `taxon_geoprivacy` | `properties` | Store both raw values for downstream consumers that need to know why coordinates were obscured. | 680 + | Photos | `properties [("inaturalist_photos", ...)]` | See §16.5. | 681 + | Sounds | `properties [("inaturalist_sounds", ...)]` | See §16.5. | 682 + | `place_guess` | `properties [("place_guess", place_guess)]` | Human-readable location hint. | 683 + | `captive` | `properties [("captive", "true")]` | Flag only when `true`. | 684 + | `description` | `properties [("description", description)]` | Observer notes. | 685 + | `species_guess` | `properties [("species_guess", species_guess)]` | Original free-text species guess. | 686 + | `taxon.iconic_taxon_name` | `properties [("iconic_taxon", name)]` | Broad group (Aves, Insecta, etc.). | 687 + | `taxon.rank` | `properties [("taxon_rank", rank)]` | | 688 + | `taxon.ancestry` | `properties [("taxon_ancestry", ancestry)]` | Slash-delimited ancestor IDs. | 689 + | Annotations | `properties` | See §16.6. | 690 + | Observation fields | `properties` | See §16.7. | 691 + | Project membership | `groups` | See §16.8. | 692 + 693 + ### 16.3 Identifications → class_dist 694 + 695 + This is the most semantically rich mapping. The Terradots `class_dist` field is a probability distribution over class names — exactly what the iNaturalist identification system computes. 696 + 697 + **Algorithm:** 698 + 699 + 1. Collect all `current=true` (non-withdrawn) identifications. 700 + 2. For each identification, the identifier's implicit weight in the consensus is 1 vote for the proposed taxon and all its ancestors. 701 + 3. Compute per-taxon vote counts from all current identifications. 702 + 4. Compute the total voter count N = number of distinct current identifiers. 703 + 5. Derive probabilities: `p(taxon) = votes_for_taxon / N`. 704 + 6. Use the **scientific name** of each taxon as the class name string. 705 + 7. Only include taxa at species rank or finer in the distribution (higher ranks can be included as properties or annotations separately). 706 + 707 + **Example:** 3 identifiers propose *Calochortus vestae*, 1 proposes *Calochortus venustus*. Both taxa have *Calochortus* as parent. 708 + 709 + - *Calochortus vestae*: 3/4 = 0.75 710 + - *Calochortus venustus*: 1/4 = 0.25 711 + - *Calochortus* (genus): 4/4 = 1.0 (but we may omit genus-level entries in favour of species) 712 + 713 + Recommended: include only the finest-rank taxa proposed by at least one identifier, normalised to sum to 1.0: 714 + 715 + ``` 716 + class_dist = [("Calochortus vestae", 0.75); ("Calochortus venustus", 0.25)] 717 + ``` 718 + 719 + **Single-taxon research grade case:** All identifiers agree → `class_dist = [("Calochortus vestae", 1.0)]`. 720 + 721 + **Degenerate case (community taxon at genus level):** Report the genus as the sole class: 722 + 723 + ``` 724 + class_dist = [("Calochortus", 1.0)] 725 + ``` 726 + 727 + **No identifications or community taxon only at family or above:** `class_dist = []` (unclassified) or `class_dist = [("Calochortaceae", 1.0)]` depending on the use case. 728 + 729 + **Alternative simpler approach** (for implementors who only want the community consensus): 730 + 731 + Use `community_taxon.name` as the sole class with `confidence` derived from `quality_grade`: 732 + 733 + ``` 734 + class_dist = [("Calochortus vestae", 1.0)] 735 + confidence = 1.0 (* research grade *) 736 + ``` 737 + 738 + The richer approach using all identifications is recommended because it preserves uncertainty information that is lost in the simpler approach. 739 + 740 + ### 16.4 quality_grade → confidence 741 + 742 + | `quality_grade` | Suggested `confidence` | Rationale | 743 + |-----------------|----------------------|-----------| 744 + | `"research"` | `0.95` (or `1.0`) | Community consensus at species level with ≥ 2/3 agreement; high reliability. | 745 + | `"needs_id"` | `0.5` | Some identification activity but no consensus yet. | 746 + | `"casual"` | `0.1` (or omit) | May lack date, location, or evidence; or organism not wild. Very low reliability. | 747 + 748 + Also store the raw `quality_grade` string in `properties [("quality_grade", quality_grade)]` so downstream systems can apply their own thresholds. 749 + 750 + ### 16.5 Photos and Sounds → properties 751 + 752 + Terradots has no dedicated media field. Store photo information as properties: 753 + 754 + ``` 755 + properties = [ 756 + ("inaturalist_photo_count", "2"); 757 + ("inaturalist_photo_0_id", "61482854"); 758 + ("inaturalist_photo_0_url", "https://static.inaturalist.org/photos/61482854/medium.jpg"); 759 + ("inaturalist_photo_0_license","CC-BY-NC-4.0"); 760 + ("inaturalist_photo_0_attr", "(c) Alice Smith, some rights reserved (CC BY-NC)"); 761 + ("inaturalist_sound_count", "0"); 762 + ] 763 + ``` 764 + 765 + **Limitation:** The Terradots `properties` field is `(string * string) list`, which means we linearise the photo array. A proper extension would add a `media` field typed as a list of `{ url; license; attribution; role }` records — see §17.1. 766 + 767 + ### 16.6 Annotations → properties 768 + 769 + Store each annotation as a property pair using the concatenated attribute/value label: 770 + 771 + ``` 772 + properties = [ 773 + ("annotation_life_stage", "Adult"); 774 + ("annotation_sex", "Female"); 775 + ("annotation_plant_phenology", "Flowering"); 776 + ("annotation_alive_or_dead", "Alive"); 777 + ("annotation_evidence", "Organism"); 778 + ] 779 + ``` 780 + 781 + This is lossy in that it discards vote counts. If vote scores are needed, include them: 782 + 783 + ``` 784 + ("annotation_life_stage_votes", "3"); 785 + ``` 786 + 787 + **Better approach:** The Terradots `annotation` type (free-text `text` anchored to label IDs) is not a natural fit for structured controlled-vocabulary annotations. See §17.2 for the recommended extension. 788 + 789 + ### 16.7 Observation Fields → properties 790 + 791 + Store as `("ofv_{field_name}", value)` pairs: 792 + 793 + ``` 794 + properties = [ 795 + ("ofv_associated_species", "Quercus robur"); 796 + ("ofv_microhabitat", "bark"); 797 + ("ofv_behaviour", "foraging"); 798 + ] 799 + ``` 800 + 801 + Sanitise `field_name` by lowercasing and replacing spaces/punctuation with underscores. Collisions are possible (two fields with the same name after normalisation); suffix with field ID if necessary. 802 + 803 + ### 16.8 Projects → groups 804 + 805 + Each project that includes the observation (from `project_ids`) should be represented as a Terradots `group`: 806 + 807 + ```ocaml 808 + { 809 + id = "inaturalist:project/4"; 810 + activity = Some activity_id; 811 + members = [ label_id_1; label_id_2; ... ]; 812 + } 813 + ``` 814 + 815 + **Collection projects** are automatically computed by iNaturalist and are equivalent to saved searches; these may not need to be materialised as Terradots groups unless the use case requires it. **Traditional projects** with explicit membership are more meaningful to preserve as groups. 816 + 817 + ### 16.9 Import Activity Record 818 + 819 + Create one `activity` record for the import batch: 820 + 821 + ```ocaml 822 + { 823 + activity_id = "import-inaturalist-2025-09-01"; 824 + agent = "inaturalist-importer:v1"; 825 + date = "2025-09-01"; 826 + description = Some "Bulk import of iNaturalist research-grade observations for taxon 64083"; 827 + } 828 + ``` 829 + 830 + --- 831 + 832 + ## 17. Concepts That Don't Map Cleanly 833 + 834 + ### 17.1 Media Attachments 835 + 836 + **Problem:** Terradots has no `media` field. Photos and sounds are first-class scientific evidence in iNaturalist — the photo IS the observation evidence, not just metadata. 837 + 838 + **Recommended extension:** 839 + 840 + ```ocaml 841 + type media_item = { 842 + url : string; 843 + license : string option; (* SPDX *) 844 + attribution : string; 845 + role : string; (* "photo" | "sound" *) 846 + position : int; (* ordering within observation *) 847 + } 848 + ``` 849 + 850 + Add `media : media_item list` to the `label` type, or model it as a separate document-level collection keyed by label ID. 851 + 852 + ### 17.2 Structured Annotations (Controlled Vocabulary) 853 + 854 + **Problem:** The Terradots `annotation` type is free-text (`text : string`) intended for human commentary. iNaturalist annotations are structured (attribute ID + value ID + vote count) controlled vocabulary terms. Storing them as flat properties loses the structure. 855 + 856 + **Recommended extension:** Add a `structured_annotation` type: 857 + 858 + ```ocaml 859 + type structured_annotation = { 860 + label_id : string; 861 + attribute : string; (* e.g. "Life Stage" *) 862 + value : string; (* e.g. "Adult" *) 863 + votes : int; (* net positive votes *) 864 + source : string; (* e.g. "inaturalist:annotation/12345" *) 865 + } 866 + ``` 867 + 868 + Or model annotations as a sub-`class_dist` over trait categories. 869 + 870 + ### 17.3 Identification History and Withdrawn IDs 871 + 872 + **Problem:** Terradots only stores the current state. iNaturalist's identification history (which IDs were proposed, withdrawn, and in what order, including maverick IDs) is a rich audit trail. 873 + 874 + **Partial solution:** Store the full identification JSON blob in a property: 875 + 876 + ``` 877 + properties = [("inaturalist_identifications_json", "...")] 878 + ``` 879 + 880 + Or represent each historical identification as a separate `Derived` label with `method_="inaturalist-identification"`, sourced from the observation label. This adds complexity but preserves full provenance. 881 + 882 + ### 17.4 Community Taxon vs. Observer Taxon Divergence 883 + 884 + **Problem:** When an observer opts out of community ID, `taxon_id` (observer's ID) and `community_taxon_id` differ. The Terradots `class_dist` naturally represents the community view; the observer's dissent is an outlier. 885 + 886 + **Solution:** Record both: 887 + - `class_dist` — derived from community identifications 888 + - `properties [("observer_taxon", community_taxon.name)]` — observer's own ID 889 + - `properties [("observer_opted_out_of_community_id", "true")]` — flag when relevant 890 + 891 + ### 17.5 Obscured Coordinates and Downstream ML Use 892 + 893 + **Problem:** When `obscured=true`, the `latitude`/`longitude` in the API response are randomised within a ~0.2° cell. The true coordinates are not available. Storing these as a Point geometry gives a false sense of precision. 894 + 895 + **Solutions (in order of preference):** 896 + 1. Use `accuracy_m` = positional_accuracy (which iNaturalist sets to the cell diameter, ~22,000 m). This correctly signals that the point location is unreliable. 897 + 2. Store the 0.2° bounding box as a `Polygon` geometry (if the cell corners can be computed), making the uncertainty geometrically explicit. 898 + 3. Tag with `properties [("geoprivacy", "obscured")]` and filter these out of precision-sensitive analyses. 899 + 900 + **Private observations** (`geoprivacy="private"`) have no coordinates at all; they cannot be imported as Point labels. They can be imported as unlocated metadata with a null/stub geometry, or skipped entirely. 901 + 902 + ### 17.6 Taxon Changes (Synonymisation, Splits, Lumps) 903 + 904 + **Problem:** iNaturalist's taxonomy is dynamic. A taxon may be synonymised into another, split into two, or lumped. The `taxon.is_active=false` flag signals this, and `current_synonymous_taxon_ids` points to the replacement. Historical labels imported at one point in time may have stale taxon names. 905 + 906 + **Solution:** Record the taxon ID and name at import time: 907 + 908 + ``` 909 + properties = [ 910 + ("inaturalist_taxon_id", "64083"); 911 + ("inaturalist_taxon_name", "Calochortus vestae"); 912 + ("inaturalist_taxon_rank", "species"); 913 + ("inaturalist_taxon_is_active", "true"); 914 + ] 915 + ``` 916 + 917 + When re-importing, compare against stored taxon ID and emit a `Derived` label if the taxon has changed. 918 + 919 + ### 17.7 Observation Field Namespace Collisions 920 + 921 + **Problem:** iNaturalist observation fields are user-created and have no global namespace. Two fields named "Behaviour" by different users may mean different things. Field IDs are globally unique but names are not. 922 + 923 + **Solution:** Key properties by field ID rather than field name: 924 + 925 + ``` 926 + properties = [ 927 + ("ofv_id_25", "Quercus robur"); (* field 25: Associated species *) 928 + ("ofv_id_25_name", "Associated species"); 929 + ] 930 + ``` 931 + 932 + This is verbose but unambiguous. 933 + 934 + ### 17.8 Vote/Fave Metadata 935 + 936 + iNaturalist observations carry `faves_count` and vote data. These have no Terradots equivalent and are social engagement metrics rather than scientific metadata. Store as properties if needed: 937 + 938 + ``` 939 + properties = [("inaturalist_faves_count", "12")] 940 + ``` 941 + 942 + ### 17.9 Species Checklists and Range Data 943 + 944 + iNaturalist places have associated species checklists ("this species has been recorded in this region"). These are not observations and have no direct Terradots equivalent. They could be imported as `Derived` labels (derived from the observation corpus) or handled at the document/groups level as regional metadata. 945 + 946 + --- 947 + 948 + ## 18. Summary Mapping Table 949 + 950 + ``` 951 + iNaturalist concept → Terradots field / type 952 + ─────────────────────────────────────────────────────────────────────────── 953 + observation → label (via make_imported) 954 + latitude, longitude → geometry = Point { x=lon; y=lat } 955 + positional_accuracy → origin.Measured.accuracy_m 956 + observed_on / time_observed_at → event_date 957 + id → origin.Measured.via = "inaturalist:observation/{id}" 958 + uuid → properties [("inaturalist_uuid", ...)] 959 + user.id → origin.Measured.observer = "inaturalist:user/{id}" 960 + user.orcid → origin.Measured.observer = "orcid:{orcid}" (preferred) 961 + license_code → origin.Measured.license (SPDX mapped) 962 + community taxon + identifications → class_dist (probability distribution) 963 + quality_grade = "research" → confidence = 0.95 964 + quality_grade = "needs_id" → confidence = 0.5 965 + quality_grade = "casual" → confidence = 0.1 966 + quality_grade (raw) → properties [("quality_grade", ...)] 967 + geoprivacy / obscured → properties [("geoprivacy", ...)] 968 + taxon.name → class_dist primary class name string 969 + taxon.iconic_taxon_name → properties [("iconic_taxon", ...)] 970 + taxon.rank → properties [("taxon_rank", ...)] 971 + taxon.ancestry → properties [("taxon_ancestry", ...)] 972 + taxon.gbif_id → properties [("gbif_taxon_id", ...)] 973 + photos (array) → properties [("inaturalist_photo_N_...", ...)] ★ EXTENSION NEEDED 974 + sounds (array) → properties [("inaturalist_sound_N_...", ...)] ★ EXTENSION NEEDED 975 + annotations (controlled vocab) → properties [("annotation_{attr}", value)] ★ EXTENSION NEEDED 976 + observation_field_values → properties [("ofv_id_{id}", value)] 977 + project_ids (traditional) → groups 978 + description → properties [("description", ...)] 979 + place_guess → properties [("place_guess", ...)] 980 + captive=true → properties [("captive", "true")] 981 + species_guess → properties [("species_guess", ...)] 982 + import batch → activity record 983 + ``` 984 + 985 + ### Items requiring Terradots model extensions (★) 986 + 987 + 1. **Media field** (`media : media_item list`) — photos and sounds as structured list with URL, license, attribution, role. 988 + 2. **Structured annotations** — controlled-vocabulary attribute/value pairs with vote scores, distinct from free-text `annotation` records. 989 + 3. **Polygon geometry for obscured observations** — represent the 0.2° × 0.2° obscuration cell as a Polygon rather than a misleading Point. 990 + ```
+809
docs/plans/osm-mapping.md
··· 1 + # OSM to Terradots Mapping Plan 2 + 3 + ## 1. OSM Data Model Overview 4 + 5 + ### Primary Sources 6 + 7 + - Data model: https://wiki.openstreetmap.org/wiki/Elements 8 + - Tagging: https://wiki.openstreetmap.org/wiki/Tags 9 + - XML format: https://wiki.openstreetmap.org/wiki/OSM_XML 10 + - API v0.6: https://wiki.openstreetmap.org/wiki/API_v0.6 11 + - PBF format: https://wiki.openstreetmap.org/wiki/PBF_Format 12 + - Changesets: https://wiki.openstreetmap.org/wiki/Changesets 13 + - OsmChange: https://wiki.openstreetmap.org/wiki/OsmChange 14 + - Relations: https://wiki.openstreetmap.org/wiki/Relation 15 + - Multipolygon: https://wiki.openstreetmap.org/wiki/Multipolygon 16 + - ODbL license: https://wiki.openstreetmap.org/wiki/Open_Database_License 17 + - Contributor terms: https://osmfoundation.org/wiki/Licence/Contributor_Terms 18 + - Overpass QL: https://wiki.openstreetmap.org/wiki/Overpass_API/Overpass_QL 19 + 20 + --- 21 + 22 + ## 2. OSM Data Primitives 23 + 24 + OSM has exactly three element types. Everything in the database is one of these. 25 + 26 + ### 2.1 Node 27 + 28 + A node is a single point on the earth's surface. 29 + 30 + **XML structure:** 31 + 32 + ```xml 33 + <node id="298884269" 34 + lat="54.0901746" lon="12.2482632" 35 + user="SvenHRO" uid="46882" 36 + visible="true" 37 + version="1" 38 + changeset="676636" 39 + timestamp="2008-09-21T21:37:45Z"> 40 + <tag k="name" v="Neu Broderstorf"/> 41 + <tag k="traffic_sign" v="city_limit"/> 42 + </node> 43 + ``` 44 + 45 + **Fields:** 46 + 47 + | Field | Type | Description | 48 + |-------|------|-------------| 49 + | `id` | int64 (≥1) | Globally unique among nodes. Negative IDs are temporary (unsaved). | 50 + | `lat` | float (−90..90) | WGS84 latitude, 7 decimal places (~1 cm precision). | 51 + | `lon` | float (−180..180) | WGS84 longitude, 7 decimal places. | 52 + | `version` | int | Incremented on every edit. | 53 + | `changeset` | int | ID of the changeset that last modified this node. | 54 + | `timestamp` | ISO 8601 | Time of last modification. | 55 + | `uid` | int | Numeric user ID of last modifier. | 56 + | `user` | string | Display name of last modifier (denormalised; can change). | 57 + | `visible` | bool | `false` for deleted nodes (still present in history). | 58 + 59 + **Tags:** zero or more `<tag k="..." v="..."/>` children. 60 + 61 + **Coordinate system:** WGS84 (EPSG:4326). Stored as fixed-point integers in PBF (nanodegrees, divide by 10^7 or 10^9 depending on granularity setting). 62 + 63 + **Role:** Standalone features (bench, ATM) or geometry vertices for ways. 64 + 65 + ### 2.2 Way 66 + 67 + An ordered list of 2–2000 node references, forming a polyline. If the first and last `<nd>` reference the same node ID, the way is closed. 68 + 69 + **XML structure:** 70 + 71 + ```xml 72 + <way id="26659127" 73 + user="Masch" uid="55988" 74 + visible="true" 75 + version="5" 76 + changeset="4142606" 77 + timestamp="2010-01-18T17:15:48Z"> 78 + <nd ref="292403538"/> 79 + <nd ref="298884390"/> 80 + <nd ref="261728686"/> 81 + <nd ref="292403538"/> <!-- same as first: closed way --> 82 + <tag k="highway" v="unclassified"/> 83 + <tag k="name" v="Pastower Straße"/> 84 + </way> 85 + ``` 86 + 87 + **Fields:** Same metadata attributes as node (`id`, `version`, `changeset`, `timestamp`, `uid`, `user`, `visible`). No coordinate fields; geometry is computed from the referenced nodes. 88 + 89 + **Children:** 90 + - `<nd ref="..."/>`: ordered node references (IDs only, not inline). 91 + - `<tag k="..." v="..."/>`: arbitrary number of tags. 92 + 93 + **Open vs closed vs area semantics:** 94 + 95 + - Open way (first ≠ last node): linear feature. Example: `highway=residential` (a road segment). 96 + - Closed way, linear tags: closed loop, still treated as a line. Example: `highway=pedestrian` roundabout. 97 + - Closed way, area tags: polygon. Example: `building=yes`, `landuse=forest`. 98 + - The `area=yes` tag forces area interpretation on an otherwise ambiguous closed way. 99 + 100 + ### 2.3 Relation 101 + 102 + An ordered list of members (nodes, ways, or other relations), each with an optional `role` string. 103 + 104 + **XML structure:** 105 + 106 + ```xml 107 + <relation id="56688" 108 + user="kmvar" uid="56190" 109 + visible="true" 110 + version="28" 111 + changeset="6947401" 112 + timestamp="2011-01-12T14:23:49Z"> 113 + <member type="node" ref="294942404" role=""/> 114 + <member type="node" ref="364933006" role="stop"/> 115 + <member type="way" ref="4579143" role=""/> 116 + <member type="node" ref="249673494" role="stop"/> 117 + <tag k="name" v="Bus 566: Eindhoven - Hamont"/> 118 + <tag k="type" v="route"/> 119 + <tag k="route" v="bus"/> 120 + </relation> 121 + ``` 122 + 123 + **Fields:** Same metadata attributes as node and way. 124 + 125 + **Children:** 126 + - `<member type="node|way|relation" ref="..." role="..."/>`: ordered member list. 127 + - `<tag k="..." v="..."/>`: tags. A `type=*` tag is required by convention. 128 + 129 + **Hard limits:** 32,000 members per relation. Best practice: keep under 300 to reduce conflict risk. 130 + 131 + **Common relation types** (distinguished by `type=*` tag): 132 + 133 + | `type` value | Purpose | Key member roles | 134 + |---|---|---| 135 + | `multipolygon` | Area with holes or multiple parts | `outer`, `inner` | 136 + | `route` | Transportation route (bus, cycle, hiking) | `stop`, `platform`, (no role for way segments) | 137 + | `boundary` | Administrative border | `outer`, `inner`, `subarea` | 138 + | `restriction` | Turn restriction | `from` (way), `via` (node/way), `to` (way) | 139 + | `waterway` | River system | `main_stream`, `side_stream` | 140 + | `public_transport` | Transit stops/lines | `stop`, `platform` | 141 + | `network` | Numbered hiking/cycling network | varies | 142 + 143 + --- 144 + 145 + ## 3. Tagging System 146 + 147 + ### 3.1 Structure 148 + 149 + Tags are `key=value` pairs. Each element can have an unlimited number of tags, but each key must be unique per element. Keys and values are free-form Unicode strings, each up to 255 characters. 150 + 151 + **No schema enforcement.** Tags are established by community convention, documented on the wiki, and approved informally through votes or discussion. As of 2025, OSM contains over 99,000 distinct keys and over 168 million distinct tags across all elements. 152 + 153 + ### 3.2 Key Conventions 154 + 155 + - Lowercase, underscores for spaces. Example: `opening_hours`, `max_speed`. 156 + - Colon namespace separator for sub-categories and language variants. Examples: 157 + - `addr:street`, `addr:housenumber` (address sub-keys) 158 + - `name:en`, `name:de` (language variants of a name) 159 + - `source:date` (provenance sub-key) 160 + - Multi-value fields: semicolon-separated. Example: `cuisine=italian;pizza`. 161 + 162 + ### 3.3 Major Key Categories (Map Features) 163 + 164 + | Key | What it tags | 165 + |-----|-------------| 166 + | `highway` | Roads, paths, tracks, footways | 167 + | `building` | Structures (yes, house, commercial, …) | 168 + | `natural` | Natural features (wood, water, peak, coastline) | 169 + | `landuse` | Human land use (residential, farmland, forest) | 170 + | `amenity` | Public facilities (cafe, school, hospital, parking) | 171 + | `shop` | Retail establishments | 172 + | `leisure` | Recreation (park, pitch, swimming_pool) | 173 + | `tourism` | Tourism features (hotel, viewpoint, museum) | 174 + | `waterway` | Water bodies and channels (river, stream, canal) | 175 + | `railway` | Rail infrastructure | 176 + | `boundary` | Administrative boundaries | 177 + | `historic` | Historical sites and monuments | 178 + | `barrier` | Fences, walls, gates | 179 + | `power` | Electrical infrastructure | 180 + | `man_made` | Artificial structures | 181 + | `name` | Primary name (universal, applies to any element) | 182 + | `ref` | Reference code (road number, postcode) | 183 + | `operator` | Organisation operating the feature | 184 + | `source` | Data source attribution | 185 + | `note` | Human-readable comment for mappers | 186 + | `fixme` | Flags data quality issues needing attention | 187 + 188 + ### 3.4 Classification Semantics 189 + 190 + OSM tags encode class through primary keys. A `highway=motorway` node/way is unambiguously a motorway. The primary classification key is whichever key carries the semantic meaning: `highway`, `building`, `natural`, `amenity`, etc. A single element can carry multiple primary keys, though this is unusual. 191 + 192 + --- 193 + 194 + ## 4. Metadata Fields (Provenance) 195 + 196 + Every element version records: 197 + 198 + | Field | Type | Semantics | 199 + |-------|------|-----------| 200 + | `id` | int64 | Permanent once created; reused after deletion is avoided in practice. | 201 + | `version` | int ≥1 | Monotonically increasing. Version 1 is creation. | 202 + | `changeset` | int64 | The changeset that created this version. | 203 + | `timestamp` | ISO 8601 UTC | Exact second this version was saved. | 204 + | `uid` | int | Numeric user ID (stable; display name can change). | 205 + | `user` | string | Display name at time of edit (snapshot, may be outdated). | 206 + | `visible` | bool | `false` only in history for deleted elements; current version omits this or sets `true`. | 207 + 208 + ### 4.1 Immutability of History 209 + 210 + Once committed, a version is permanent. The only exception is **redaction** (OSMF legal/copyright procedure, very rare): redacted versions are hidden from history but the version number is still visible as a gap. 211 + 212 + --- 213 + 214 + ## 5. Changesets 215 + 216 + A changeset is an atomic group of edits by a single user, bounded by time. 217 + 218 + **Constraints:** 219 + - Maximum 10,000 elements per changeset. 220 + - Auto-closed after 24 hours of opening or 1 hour of inactivity. 221 + - Each changeset has a numeric ID and a bounding box computed from all modified elements. 222 + 223 + **Changeset tags (key=value, same format as element tags):** 224 + 225 + | Tag | Purpose | 226 + |-----|---------| 227 + | `comment` | Human-readable description of the edit | 228 + | `created_by` | Editor software and version (e.g. `iD 2.30.0`) | 229 + | `imagery_used` | Background imagery referenced during mapping | 230 + | `source` | External data source used | 231 + | `bot=yes` | Automated edit flag | 232 + | `review_requested=yes` | Flags for community review | 233 + 234 + **Changeset XML (for read operations):** 235 + 236 + ```xml 237 + <changeset id="10" user="fred" uid="123" created_at="2008-11-08T19:07:39Z" 238 + closed_at="2008-11-08T20:07:39Z" open="false" 239 + min_lat="51.5023" min_lon="-0.1682" 240 + max_lat="51.5024" max_lon="-0.1681" 241 + comments_count="0" changes_count="10"> 242 + <tag k="created_by" v="JOSM 1.61"/> 243 + <tag k="comment" v="Just adding some streetnames"/> 244 + </changeset> 245 + ``` 246 + 247 + **Changeset discussions:** Each changeset can have a public discussion thread (users posting comments). Accessible via `GET /api/0.6/changeset/<id>?include_discussion=true`. 248 + 249 + --- 250 + 251 + ## 6. OSM Data Formats 252 + 253 + ### 6.1 OSM XML (.osm) 254 + 255 + UTF-8 encoded XML. Root element `<osm version="0.6">`. Blocks appear in strict order: nodes first, then ways, then relations. Each element block is complete and self-contained. 256 + 257 + **Characteristics:** 258 + - Human-readable; suitable for small extracts and debugging. 259 + - No official XSD schema (unofficial schemas exist). 260 + - Can include a `<bounds>` element describing the geographic extent. 261 + - Widely supported by all OSM tooling. 262 + 263 + ### 6.2 PBF (Protocol Buffer Binary Format) 264 + 265 + Binary format based on Google Protocol Buffers. 266 + 267 + **Advantages over XML:** 268 + - ~50% smaller than gzipped OSM XML. 269 + - 5–6x faster read/write. 270 + - Independently decodable file blocks (~8,000 entities each). 271 + - Delta-encoding for node IDs and coordinates (stores diffs, not absolutes). 272 + - Centralized string table for tag keys and values. 273 + 274 + **Structure:** sequential `BlobHeader + Blob` pairs. Each blob decompresses to a `PrimitiveBlock` containing `PrimitiveGroup`s for nodes (dense or normal), ways, and relations. 275 + 276 + **Coordinate encoding:** stored as integers in nanodegrees (lat/lon × 10^9), with a configurable granularity divisor (default 100 → ~1 cm). 277 + 278 + **Metadata optionality:** version, timestamp, changeset, uid, user are stored in optional `Info`/`DenseInfo` structures and can be omitted in stripped files. 279 + 280 + ### 6.3 OsmChange (.osc) 281 + 282 + Diff format for uploading changesets to the API. Wraps elements in `<create>`, `<modify>`, or `<delete>` action blocks. 283 + 284 + ```xml 285 + <osmChange version="0.6" generator="JOSM"> 286 + <create> 287 + <node id="-1" changeset="1234" lat="51.5" lon="-0.1" version="0"> 288 + <tag k="amenity" v="bench"/> 289 + </node> 290 + </create> 291 + <modify> 292 + <way id="987" changeset="1234" version="3"> 293 + <nd ref="111"/> <nd ref="222"/> <nd ref="111"/> 294 + <tag k="name" v="High Street"/> 295 + </way> 296 + </modify> 297 + <delete> 298 + <node id="555" changeset="1234" version="2"/> 299 + </delete> 300 + </osmChange> 301 + ``` 302 + 303 + - Negative IDs in `<create>` are placeholders; server replies with the real ID. 304 + - `<delete if-unused="true"/>` makes deletion conditional (safe deletes). 305 + - Modifications must include the complete tag set (no partial-tag updates). 306 + - Uploads are atomic: all succeed or all fail. 307 + 308 + ### 6.4 Overpass API 309 + 310 + Read-only query interface for complex spatial/tag queries. Does not require an account. 311 + 312 + **Base URL:** `https://overpass-api.de/api/interpreter` 313 + 314 + **Query language (Overpass QL) example:** 315 + 316 + ``` 317 + [out:json][timeout:30][bbox:51.5,-0.2,51.6,-0.1]; 318 + ( 319 + node[amenity=cafe]; 320 + way[amenity=cafe]; 321 + relation[amenity=cafe]; 322 + ); 323 + out body; 324 + >; 325 + out skel qt; 326 + ``` 327 + 328 + - `[out:json]`: JSON output (alternatives: `xml`, `csv`). 329 + - `[bbox:...]`: bounding box filter. 330 + - `out meta`: include full metadata (user, uid, changeset, timestamp, version). 331 + - `out geom`: inline node coordinates into ways/relations (avoids separate node fetch). 332 + - `>` recurse-down: fetch all referenced nodes/ways. 333 + 334 + **Overpass XML output** mirrors OSM XML but adds `<remark>` and `<note>` elements. 335 + 336 + --- 337 + 338 + ## 7. Element History 339 + 340 + ### 7.1 Versioning Model 341 + 342 + Every save to an element increments its `version` integer. The full version history is permanently stored on the OSM server. 343 + 344 + **API endpoints for history:** 345 + 346 + ``` 347 + GET /api/0.6/node/{id}/history # all versions as XML 348 + GET /api/0.6/node/{id}/{version} # specific version 349 + GET /api/0.6/way/{id}/history 350 + GET /api/0.6/relation/{id}/history 351 + ``` 352 + 353 + The history response is identical in format to the current element response but lists multiple versions sequentially. 354 + 355 + ### 7.2 Deletion in History 356 + 357 + When an element is deleted, a new version is created with `visible="false"`. The element's ID, version, changeset, timestamp, uid, and user are all preserved; coordinates and tags from the last visible version are retained in the history but suppressed from normal reads. 358 + 359 + Deleted elements still appear in `GET /api/0.6/node/{id}/history`. Fetching the current version of a deleted element returns HTTP 410 Gone. 360 + 361 + ### 7.3 Redaction 362 + 363 + Copyright-violating versions can be legally redacted by the OSMF. Redacted versions are hidden (HTTP 403 on direct access) but their version number creates a visible gap in the sequence. This is rare and requires OSMF action. 364 + 365 + --- 366 + 367 + ## 8. API v0.6 Endpoints 368 + 369 + **Base URL:** `https://api.openstreetmap.org/api/0.6/` 370 + 371 + Authentication: OAuth 2.0 (write operations). Read operations are public. 372 + 373 + ### 8.1 Element CRUD 374 + 375 + | Operation | Method | URL | Notes | 376 + |-----------|--------|-----|-------| 377 + | Read | GET | `/node/{id}` | Returns current version | 378 + | Create | POST | `/nodes` | Body: OSM XML of new element; returns new ID | 379 + | Update | PUT | `/node/{id}` | Full element in body; must include correct version | 380 + | Delete | DELETE | `/node/{id}` | Requires matching version + open changeset | 381 + | Read history | GET | `/node/{id}/history` | All versions | 382 + | Read version | GET | `/node/{id}/{ver}` | Specific version | 383 + | Read multiple | GET | `/nodes?nodes=id1,id2,id3` | Comma-separated IDs | 384 + | Full | GET | `/way/{id}/full` | Way + all member nodes | 385 + | Full | GET | `/relation/{id}/full` | Relation + all members recursively | 386 + 387 + Same patterns apply to `/way/` and `/relation/`. 388 + 389 + ### 8.2 Changesets 390 + 391 + | Operation | Method | URL | 392 + |-----------|--------|-----| 393 + | Create | PUT | `/changeset/create` | 394 + | Read | GET | `/changeset/{id}` | 395 + | Update tags | PUT | `/changeset/{id}` | 396 + | Close | PUT | `/changeset/{id}/close` | 397 + | Upload diff | POST | `/changeset/{id}/upload` | 398 + | Query | GET | `/changesets?bbox=...&user=...&time=...` | 399 + | Add comment | POST | `/changeset/{id}/comment` | 400 + 401 + ### 8.3 Map/Bbox Query 402 + 403 + ``` 404 + GET /map?bbox={left},{bottom},{right},{top} 405 + ``` 406 + 407 + Returns all nodes, ways, and relations in the bounding box as OSM XML. Capped at ~50,000 nodes. 408 + 409 + ### 8.4 Response Formats 410 + 411 + The API supports both XML (default) and JSON (`Accept: application/json` header or `.json` suffix on URL). JSON output mirrors XML structure with camelCase keys. 412 + 413 + --- 414 + 415 + ## 9. Licensing 416 + 417 + ### 9.1 ODbL 1.0 418 + 419 + OSM data is licensed under the **Open Database License (ODbL) 1.0**. 420 + 421 + **Key provisions:** 422 + 423 + - **Attribution required:** Any public use must credit "© OpenStreetMap contributors." 424 + - **Share-alike:** Any database derived from OSM (a "Derivative Database") must be distributed under ODbL or a compatible license. 425 + - **Produced works:** Maps, images, or other products produced from OSM data (not the database itself) can be released under any license, provided credit is given. 426 + - **No DRM:** You may not impose additional technical restrictions. 427 + 428 + The database contents (individual facts) are covered by the **Database Contents License (DbCL) 1.0**. 429 + 430 + ### 9.2 Contributor Terms 431 + 432 + Contributors grant OSMF a worldwide, royalty-free, irrevocable, perpetual licence to use contributions under ODbL, CC-BY-SA 2.0, or another open licence approved by a 2/3 majority vote. OSMF cannot relicense to a closed licence without that supermajority. 433 + 434 + ### 9.3 Implications for Terradots 435 + 436 + When importing OSM data into a Terradots label store: 437 + 438 + - **License field:** Set `origin.license = "ODbL-1.0"`. 439 + - **Attribution:** Any downstream use of the document must preserve the OSM attribution. This should be noted in the document metadata. 440 + - **Share-alike:** A Terradots document containing OSM-derived labels that constitutes a "Derivative Database" must itself be released under ODbL or a compatible licence if distributed publicly. 441 + - **Produced works exception:** If Terradots labels are used to train a model (the model is a "Produced Work" of OSM data), the model weights may be released under any licence, provided the training data credit is documented. 442 + 443 + --- 444 + 445 + ## 10. Terradots Mapping Plan 446 + 447 + This section maps every OSM concept to the Terradots `terradots.ml` type system. 448 + 449 + ### 10.1 Element Type → Geometry 450 + 451 + | OSM element | Terradots `geometry` | Notes | 452 + |-------------|---------------------|-------| 453 + | Node (standalone feature) | `Point { x = lon; y = lat }` | lon→x, lat→y for EPSG:4326 | 454 + | Node (way vertex only) | Not a label | Geometry-only; no label created | 455 + | Open way (linear) | `Polygon` ring with 2..2000 pts | Use as degenerate polygon or extend to `LineString` (see §10.8) | 456 + | Closed way (area) | `Polygon ring` | Close the ring: last point = first point | 457 + | Multipolygon relation (`type=multipolygon`) | `Multi [Polygon outer; Polygon inner; …]` | Outer rings positive area, inner rings holes (see §10.8) | 458 + | Route/boundary relation | `group` of labels | Members become individual labels; relation becomes `group` | 459 + | Other relation types | `group` | Generic grouping | 460 + 461 + **CRS:** Always `EPSG:4326` (OSM is always WGS84). Set `document.crs = "EPSG:4326"`. 462 + 463 + ### 10.2 Element Identity → label.id and label.via 464 + 465 + | OSM field | Terradots field | Value | 466 + |-----------|----------------|-------| 467 + | `node/{id}` | `origin.via` | `"osm:node/{id}"` | 468 + | `way/{id}` | `origin.via` | `"osm:way/{id}"` | 469 + | `relation/{id}` | `origin.via` | `"osm:relation/{id}"` | 470 + | `node/{id}` (local) | `label.id` | Generate UUID or use `"osm-node-{id}"` | 471 + | `way/{id}` (local) | `label.id` | Generate UUID or use `"osm-way-{id}"` | 472 + | `relation/{id}` (local) | `label.id` | Generate UUID or use `"osm-relation-{id}"` | 473 + 474 + The `via` URI format `osm:node/123456` is already specified as a recognised scheme in the Terradots `.mli` URI table. The `id` field should be a UUID generated at import time (not the OSM integer ID) to remain stable if the element is re-imported. 475 + 476 + Alternatively, the importer may use `"osm-node-{id}"` directly as the Terradots `id`, accepting that ID stability is tied to OSM ID stability (which is generally good — OSM IDs are persistent once created). 477 + 478 + ### 10.3 Tags → class_dist and properties 479 + 480 + OSM tags carry both classification and attribute information. The mapping strategy: 481 + 482 + **Classification tags** (map to `class_dist`): 483 + 484 + Identify the primary semantic key for the element type. Construct the class string as `"{primary_key}={value}"`. 485 + 486 + ``` 487 + highway=residential → class_dist = [("highway=residential", 1.0)] 488 + building=yes → class_dist = [("building=yes", 1.0)] 489 + natural=wood → class_dist = [("natural=wood", 1.0)] 490 + amenity=cafe → class_dist = [("amenity=cafe", 1.0)] 491 + ``` 492 + 493 + Because OSM classification is always definite (a human chose the tag), confidence = 1.0 and class probability = 1.0. 494 + 495 + **Attribute tags** (map to `properties`): 496 + 497 + All remaining tags become `(key, value)` entries in `label.properties`. 498 + 499 + ``` 500 + name=Pastower Straße → ("name", "Pastower Straße") 501 + maxspeed=50 → ("maxspeed", "50") 502 + addr:street=High Street → ("addr:street", "High Street") 503 + opening_hours=Mo-Fr 09:00-18:00 → ("opening_hours", "Mo-Fr 09:00-18:00") 504 + source=survey → ("osm:source", "survey") 505 + note=Check this later → ("osm:note", "Check this later") 506 + ``` 507 + 508 + **Prefix convention:** To avoid collisions with non-OSM properties, OSM-specific meta-tags (`source`, `note`, `fixme`, `created_by`) should be prefixed `osm:` in `properties`. 509 + 510 + **Determining the primary classification key:** 511 + 512 + Priority order for choosing which tag becomes the class: 513 + 1. `highway`, `railway`, `waterway`, `aeroway`, `aerialway` (infrastructure, usually on ways) 514 + 2. `building`, `landuse`, `natural`, `leisure`, `amenity`, `shop`, `tourism`, `historic`, `barrier`, `power`, `man_made`, `craft`, `emergency`, `healthcare` (area/point features) 515 + 3. `boundary` (on relations) 516 + 4. `type` (on relations — e.g. `type=route`) 517 + 5. First tag alphabetically (fallback) 518 + 519 + If an element has multiple primary-class-candidate keys, include all as equal-probability entries in `class_dist`: 520 + 521 + ``` 522 + highway=pedestrian + area=yes → class_dist = [("highway=pedestrian", 0.5), ("area=yes", 0.5)] 523 + ``` 524 + 525 + In practice this is rare; most elements have a single dominant classification key. 526 + 527 + ### 10.4 OSM User → origin.observer 528 + 529 + | OSM field | Terradots field | Value | 530 + |-----------|----------------|-------| 531 + | `uid` (numeric) | `origin.observer` | `"osm:user/{uid}"` | 532 + | `user` (display name) | `properties` | `("osm:user_name", "{user}")` | 533 + 534 + Use `uid` not `user` for the observer URI, because display names can change while UIDs are stable. Store the display name in `properties` for human readability. 535 + 536 + ### 10.5 OSM Timestamp → event_date 537 + 538 + | OSM field | Terradots field | Value | 539 + |-----------|----------------|-------| 540 + | `timestamp` | `event_date` | ISO 8601 datetime, pass through verbatim | 541 + 542 + The OSM timestamp records when the element version was last modified, not when the real-world feature came into existence. This is an approximation of the observation date. For imported data, this is the best available proxy. 543 + 544 + Example: `timestamp="2010-01-18T17:15:48Z"` → `event_date = "2010-01-18T17:15:48Z"`. 545 + 546 + ### 10.6 OSM Version → properties 547 + 548 + OSM element version has no direct Terradots equivalent (Terradots does not version individual labels). Record it as a property. 549 + 550 + | OSM field | Terradots field | Value | 551 + |-----------|----------------|-------| 552 + | `version` | `properties` | `("osm:version", "5")` | 553 + 554 + ### 10.7 OSM Changeset → activity 555 + 556 + An OSM changeset maps to a Terradots `activity`. One activity per changeset imported. 557 + 558 + | OSM changeset field | Terradots `activity` field | Notes | 559 + |---------------------|---------------------------|-------| 560 + | `changeset.id` | `activity_id` | `"osm:changeset:{id}"` | 561 + | `changeset.user` / `uid` | `agent` | `"osm:user/{uid}"` (see §10.4) | 562 + | `changeset.created_at` | `date` | ISO 8601 date | 563 + | `changeset.tags["comment"]` | `description` | Free-text comment | 564 + 565 + Additional changeset tags go into a dedicated label property or are recorded in the `description` field. 566 + 567 + All labels from the same changeset reference the same `activity_id`. This mirrors how OSM groups edits. 568 + 569 + ```ocaml 570 + let act = { 571 + activity_id = "osm:changeset:6947401"; 572 + agent = "osm:user/56190"; 573 + date = "2011-01-12"; 574 + description = Some "Bus 566 route update"; 575 + } 576 + ``` 577 + 578 + Labels referencing this changeset: `label.activity = Some "osm:changeset:6947401"`. 579 + 580 + ### 10.8 OSM Relation → group 581 + 582 + Simple mapping: 583 + 584 + ``` 585 + OSM Relation id=56688 → group { id = "osm:relation:56688"; members = [...] } 586 + ``` 587 + 588 + Members are the Terradots label IDs for each member element. The relation's tags (name, type, route, etc.) can be stored as a special "relation descriptor" label with `Point` geometry at the centroid, or in a `properties` map on the group (which Terradots `group` currently lacks — see §11 on gaps). 589 + 590 + ### 10.9 OSM Positional Accuracy 591 + 592 + OSM does not store positional accuracy per element. For GPS traces the practical accuracy is ~2–5 m; for hand-traced imagery it depends on imagery resolution. A reasonable default: 593 + 594 + ```ocaml 595 + accuracy_m = None (* or Some 5.0 for GPS-derived nodes *) 596 + ``` 597 + 598 + Importers may want to set `accuracy_m` based on heuristics: if `source=survey` and `gps_source` tags are present, use a small value; for imagery-traced ways, use a larger value based on known imagery resolution. 599 + 600 + ### 10.10 License 601 + 602 + ```ocaml 603 + origin = Measured { 604 + observer = Some "osm:user/{uid}"; 605 + via = Some "osm:node/{id}"; 606 + license = Some "ODbL-1.0"; 607 + accuracy_m = None; 608 + } 609 + ``` 610 + 611 + ### 10.11 Complete Import Example 612 + 613 + **OSM input:** 614 + 615 + ```xml 616 + <node id="298884269" 617 + lat="54.0901746" lon="12.2482632" 618 + user="SvenHRO" uid="46882" 619 + visible="true" version="1" 620 + changeset="676636" 621 + timestamp="2008-09-21T21:37:45Z"> 622 + <tag k="name" v="Neu Broderstorf"/> 623 + <tag k="traffic_sign" v="city_limit"/> 624 + </node> 625 + ``` 626 + 627 + **Terradots output:** 628 + 629 + ```ocaml 630 + let label = make_imported 631 + ~cell:(hilbert_cell ~level:12 ~crs:"EPSG:4326" {x=12.2482632; y=54.0901746}) 632 + ~id:"<uuid>" 633 + ~geometry:(Point {x=12.2482632; y=54.0901746}) 634 + ~via:"osm:node/298884269" 635 + ~observer:"osm:user/46882" 636 + ~license:"ODbL-1.0" 637 + ~event_date:(event_date_of_string "2008-09-21T21:37:45Z") 638 + ~class_dist:[("traffic_sign=city_limit", 1.0)] 639 + ~activity:"osm:changeset:676636" 640 + ~properties:[ 641 + ("name", "Neu Broderstorf"); 642 + ("osm:version", "1"); 643 + ("osm:user_name", "SvenHRO"); 644 + ] 645 + () 646 + ``` 647 + 648 + **OSM Way → Terradots Polygon example:** 649 + 650 + ```xml 651 + <way id="26659127" uid="55988" user="Masch" 652 + timestamp="2010-01-18T17:15:48Z" version="5" changeset="4142606"> 653 + <nd ref="292403538"/> 654 + <nd ref="298884390"/> 655 + <nd ref="261728686"/> 656 + <nd ref="292403538"/> 657 + <tag k="highway" v="unclassified"/> 658 + <tag k="name" v="Pastower Straße"/> 659 + </way> 660 + ``` 661 + 662 + The importer must resolve node IDs to coordinates (fetched separately). For a closed way with `highway=` tag, treat as a `Polygon` (closed loop road): 663 + 664 + ```ocaml 665 + let ring = [{x=12.24; y=54.09}; {x=12.25; y=54.09}; 666 + {x=12.25; y=54.10}; {x=12.24; y=54.09}] in 667 + let label = make_imported 668 + ~cell:(hilbert_cell ~level:12 ~crs:"EPSG:4326" (centroid (Polygon ring))) 669 + ~id:"<uuid>" 670 + ~geometry:(Polygon ring) 671 + ~via:"osm:way/26659127" 672 + ~observer:"osm:user/55988" 673 + ~license:"ODbL-1.0" 674 + ~event_date:(event_date_of_string "2010-01-18T17:15:48Z") 675 + ~class_dist:[("highway=unclassified", 1.0)] 676 + ~activity:"osm:changeset:4142606" 677 + ~properties:[ 678 + ("name", "Pastower Straße"); 679 + ("osm:version", "5"); 680 + ("osm:user_name", "Masch"); 681 + ] 682 + () 683 + ``` 684 + 685 + --- 686 + 687 + ## 11. Gaps and Required Extensions 688 + 689 + ### 11.1 LineString Geometry 690 + 691 + **Problem:** Terradots `geometry` has `Point`, `Polygon`, and `Multi`. There is no `LineString`. Open OSM ways (roads, rivers, fences) are linear features that are not polygons. 692 + 693 + **Options:** 694 + 1. **Extend the geometry type:** Add `LineString of point list` to `Terradots.geometry`. This is the cleanest solution and consistent with OGC Simple Features. 695 + 2. **Degenerate polygon:** Represent a linestring as a `Polygon` where the ring is the point sequence without closing (the last point is not repeated). This violates the polygon semantics documented in the type. 696 + 3. **Store as Multi of Points:** Lossy; loses connectivity. 697 + 4. **Ignore open ways:** Only import area-forming closed ways and nodes. Many OSM features (roads, rivers, boundaries) would be lost. 698 + 699 + **Recommendation:** Add `LineString of point list` to `geometry` and document the OGC alignment. This is a small, backward-compatible extension. 700 + 701 + ### 11.2 Polygon Interior Rings (Holes) 702 + 703 + **Problem:** Terradots `Polygon` is documented as "Exterior ring, closed" with no support for interior rings (holes). OSM multipolygon relations frequently have inner rings: a lake within a forest, a courtyard within a building. 704 + 705 + **Options:** 706 + 1. **Extend Polygon:** `Polygon of { outer: point list; inners: point list list }`. 707 + 2. **Use Multi:** Represent the hole as a separate label with a `class_dist` indicating it is an exclusion. Requires a convention (e.g. `("osm:role", "inner")`). 708 + 3. **Approximate:** Ignore inner rings. Suitable only for small holes relative to the outer area. 709 + 4. **Keep as relation:** Represent the multipolygon as a `group` with members tagged by role. Loses the single-geometry semantics. 710 + 711 + **Recommendation:** Extend `Polygon` to carry optional inner rings, or add `MultiPolygon of { outer: point list; inners: point list list } list`. Option (2) using Multi with role properties is a reasonable interim approach. 712 + 713 + ### 11.3 Group Metadata / Tags 714 + 715 + **Problem:** Terradots `group` has `id`, `activity`, and `members`. It carries no tags/properties. OSM relations carry rich tag metadata (name, route number, operator, network, etc.) that describe the relation itself. 716 + 717 + **Options:** 718 + 1. **Add properties to group:** `group.properties : (string * string) list`. 719 + 2. **Create a "descriptor label":** A zero-area `Point` label at the centroid of the relation, carrying the relation's tags. The group references this label as a member. 720 + 3. **Embed in activity description:** Serialise relation tags as JSON in `activity.description`. Poor discoverability. 721 + 722 + **Recommendation:** Add `properties : (string * string) list` to `group`. This is a minimal extension consistent with the pattern used on `label`. 723 + 724 + ### 11.4 OSM History / Versions 725 + 726 + **Problem:** Terradots has no concept of element versioning or history. Each label is a snapshot. OSM tracks the full edit history (every version of every element with its full metadata). 727 + 728 + **Implications for import:** Importing the full history of an OSM element would require creating one Terradots label per version, with `event_date` set to the version's timestamp. The labels would share the same `via` URI but differ in `properties["osm:version"]` and `id`. 729 + 730 + **Implications for round-trip:** Writing Terradots labels back to OSM requires knowing the current OSM version number for optimistic locking (`PUT` requires matching `version`). This must be stored in `properties["osm:version"]`. 731 + 732 + **No design change needed**, but the import/export code must handle versioning explicitly. 733 + 734 + ### 11.5 OSM Changeset Bounding Box and Element Count 735 + 736 + **Problem:** OSM changesets carry a geographic bounding box (min/max lat/lon) and a `changes_count`. These are computed fields on the changeset, not part of `activity`. 737 + 738 + **Mapping:** Store in activity description or as separate properties on a group. No clean Terradots equivalent. If needed, extend `activity` with optional `bbox` and `count` fields. 739 + 740 + ### 11.6 OSM Deletion (Tombstones) 741 + 742 + **Problem:** When an OSM element is deleted, `visible=false` is set on the new version. Terradots has no concept of deletion/tombstones. A deleted OSM element would need to be represented as an absent label, but there is no way to signal "this label was deleted in OSM" to downstream consumers. 743 + 744 + **Recommendation:** Add a `deleted : bool` field to `label` or support a "tombstone" label type. For now, deletion can be indicated via `properties[("osm:deleted", "true")]` and a convention that importers filter these out. 745 + 746 + ### 11.7 Relation Member Roles 747 + 748 + **Problem:** OSM relation members carry a `role` string (e.g. `"outer"`, `"inner"`, `"stop"`, `"platform"`, `"from"`, `"to"`, `"via"`). Terradots `group.members` is a plain list of label IDs with no role annotation. 749 + 750 + **Recommendation:** Extend `group` with `members : (string * string option) list` where the second element is the optional role. Or store roles as `properties` on member labels. 751 + 752 + ### 11.8 OSM `ref` and Route Numbers 753 + 754 + **Problem:** OSM route relations use `ref=*` for official route numbers (e.g. `ref=566` for bus line 566). This is a relation-level tag. There is no clean Terradots home for route identifiers separate from classification. 755 + 756 + **Mapping:** Store as `properties[("ref", "566")]` on the group descriptor label (§11.3 option 2). Fine for most purposes. 757 + 758 + ### 11.9 Overpass API Integration 759 + 760 + The Overpass API output can be in OSM XML format (`[out:xml]`) or JSON. Importer code should accept both. Key Overpass-specific considerations: 761 + 762 + - Request `out meta` to get `uid`, `user`, `changeset`, `version`, `timestamp`. 763 + - Request `out geom` to get inline node coordinates in ways and relations (avoids a second fetch for node resolution). 764 + - The `_type` and `_id` fields in JSON correspond to OSM element type and ID. 765 + 766 + --- 767 + 768 + ## 12. Mapping Summary Table 769 + 770 + | OSM Concept | Terradots Mapping | Status | 771 + |---|---|---| 772 + | Node (point feature) | `label` with `Point` geometry | Direct | 773 + | Open way (linestring) | `label` with `LineString` geometry | Requires `LineString` extension (§11.1) | 774 + | Closed way (area) | `label` with `Polygon` geometry | Direct | 775 + | Multipolygon relation | `label` with `Multi` of `Polygon`s | Inner rings require extension (§11.2) | 776 + | Route/boundary relation | `group` of labels | Direct (roles need extension, §11.7) | 777 + | Element `id` | `origin.via = "osm:{type}/{id}"` | Direct (URI scheme documented) | 778 + | Element `uid` | `origin.observer = "osm:user/{uid}"` | Direct | 779 + | Element `user` | `properties[("osm:user_name", name)]` | Direct | 780 + | Element `timestamp` | `event_date` | Direct (ISO 8601 passthrough) | 781 + | Element `version` | `properties[("osm:version", v)]` | Direct | 782 + | Element `changeset` | `label.activity = "osm:changeset:{id}"` | Direct | 783 + | Element `visible=false` | `properties[("osm:deleted", "true")]` | Workaround; tombstone support needed (§11.6) | 784 + | Primary tag (e.g. `highway=*`) | `class_dist = [("{k}={v}", 1.0)]` | Direct | 785 + | Attribute tags | `properties[(k, v)]` | Direct | 786 + | Changeset | `activity` | Direct | 787 + | Changeset comment | `activity.description` | Direct | 788 + | Changeset `created_by` | `properties[("osm:created_by", v)]` | On activity or related label | 789 + | Changeset `imagery_used` | `properties[("osm:imagery_used", v)]` | On activity or related label | 790 + | ODbL license | `origin.license = "ODbL-1.0"` | Direct (SPDX identifier) | 791 + | Relation member role | No direct equivalent | Requires `group` extension (§11.7) | 792 + | Relation tags/name | No `group.properties` | Requires `group` extension (§11.3) | 793 + | Polygon inner ring | No interior ring support | Requires `Polygon` extension (§11.2) | 794 + | Changeset bbox | Not representable | Could extend `activity` (§11.5) | 795 + | Full element history | Multiple labels by version | Convention only; no history type | 796 + | Positional accuracy | `origin.accuracy_m = None` | No OSM source; heuristic only | 797 + 798 + --- 799 + 800 + ## 13. Recommended Priority for Extensions 801 + 802 + Ordered by impact on faithful OSM import: 803 + 804 + 1. **Add `LineString of point list` to `geometry`** (§11.1) — blocks import of all roads, rivers, and other linear features. High priority. 805 + 2. **Add `properties` to `group`** (§11.3) — blocks capturing relation names, route numbers, and type metadata. High priority. 806 + 3. **Add role to `group.members`** (§11.7) — needed for multipolygon inner/outer distinction and route stop ordering. Medium priority. 807 + 4. **Add interior ring support to `Polygon`** (§11.2) — needed for accurate multipolygon areas. Medium priority. 808 + 5. **Add `deleted` flag to `label`** (§11.6) — needed for incremental import/sync. Low priority initially. 809 + 6. **Extend `activity` with bbox and count** (§11.5) — nice-to-have for changeset fidelity. Low priority.
+1753
docs/spec.html
··· 1 + <!DOCTYPE html> 2 + <html lang="en"> 3 + <head> 4 + <meta charset="UTF-8"> 5 + <meta name="viewport" content="width=device-width, initial-scale=1.0"> 6 + <title>Terradots Label Store — Specification</title> 7 + <style> 8 + /* ═══════════════════════════════════════════════════════════ 9 + CSS — self-contained, no external dependencies 10 + ═══════════════════════════════════════════════════════════ */ 11 + :root { 12 + --fg: #1a1a2e; 13 + --bg: #ffffff; 14 + --bg-alt: #f8f9fc; 15 + --muted: #6b7280; 16 + --border: #e2e8f0; 17 + --accent: #2563eb; 18 + --accent-light: #eff6ff; 19 + --code-bg: #f1f5f9; 20 + --header-bg: #0f172a; 21 + --header-fg: #f1f5f9; 22 + 23 + /* layer colours */ 24 + --c-camera: #ea580c; 25 + --c-gps: #2563eb; 26 + --c-gbif: #16a34a; 27 + --c-inat: #0d9488; 28 + --c-iucn: #dc2626; 29 + --c-sim: #7c3aed; 30 + --c-habitat: #65a30d; 31 + --c-habitat-fill: rgba(101,163,13,0.25); 32 + --c-range: #3b82f6; 33 + --c-aoh: #16a34a; 34 + --c-aoh-fill: rgba(22,163,74,0.3); 35 + 36 + --font-mono: 'SF Mono', 'Cascadia Code', 'Fira Code', 'Consolas', monospace; 37 + --font-sans: -apple-system, BlinkMacSystemFont, 'Segoe UI', system-ui, sans-serif; 38 + } 39 + 40 + * { margin: 0; padding: 0; box-sizing: border-box; } 41 + 42 + body { 43 + font-family: var(--font-sans); 44 + color: var(--fg); 45 + background: var(--bg); 46 + line-height: 1.7; 47 + font-size: 15px; 48 + } 49 + 50 + /* ── Header ────────────────────────────────────────────── */ 51 + header { 52 + background: var(--header-bg); 53 + color: var(--header-fg); 54 + padding: 2.5rem 2rem 2rem; 55 + } 56 + header .inner { 57 + max-width: 1200px; 58 + margin: 0 auto; 59 + } 60 + header h1 { 61 + font-size: 1.8rem; 62 + font-weight: 700; 63 + letter-spacing: -0.02em; 64 + } 65 + header .subtitle { 66 + color: #94a3b8; 67 + font-size: 1rem; 68 + margin-top: 0.4rem; 69 + } 70 + header .meta { 71 + margin-top: 0.8rem; 72 + font-size: 0.8rem; 73 + color: #64748b; 74 + } 75 + 76 + /* ── Layout ────────────────────────────────────────────── */ 77 + main { 78 + max-width: 1200px; 79 + margin: 0 auto; 80 + padding: 0 2rem 6rem; 81 + } 82 + section { 83 + margin-top: 2.5rem; 84 + } 85 + h2 { 86 + font-size: 1.35rem; 87 + font-weight: 700; 88 + margin-bottom: 0.8rem; 89 + padding-bottom: 0.4rem; 90 + border-bottom: 2px solid var(--border); 91 + letter-spacing: -0.01em; 92 + } 93 + h3 { 94 + font-size: 1.05rem; 95 + font-weight: 600; 96 + margin-top: 1.5rem; 97 + margin-bottom: 0.5rem; 98 + color: #334155; 99 + } 100 + h4 { 101 + font-size: 0.95rem; 102 + font-weight: 600; 103 + margin-top: 1.2rem; 104 + margin-bottom: 0.4rem; 105 + } 106 + p { margin-bottom: 0.9rem; } 107 + a { color: var(--accent); text-decoration: none; } 108 + a:hover { text-decoration: underline; } 109 + ul, ol { margin-bottom: 0.9rem; padding-left: 1.5rem; } 110 + li { margin-bottom: 0.3rem; } 111 + strong { font-weight: 600; } 112 + 113 + /* ── Code blocks ───────────────────────────────────────── */ 114 + code { 115 + font-family: var(--font-mono); 116 + font-size: 0.88em; 117 + background: var(--code-bg); 118 + padding: 0.15em 0.4em; 119 + border-radius: 4px; 120 + } 121 + pre { 122 + background: var(--code-bg); 123 + border: 1px solid var(--border); 124 + border-radius: 6px; 125 + padding: 1rem 1.2rem; 126 + overflow-x: auto; 127 + font-family: var(--font-mono); 128 + font-size: 0.82rem; 129 + line-height: 1.6; 130 + margin-bottom: 1rem; 131 + } 132 + pre code { 133 + background: none; 134 + padding: 0; 135 + } 136 + 137 + /* ── Tables ────────────────────────────────────────────── */ 138 + table { 139 + width: 100%; 140 + border-collapse: collapse; 141 + margin-bottom: 1rem; 142 + font-size: 0.88rem; 143 + } 144 + th, td { 145 + text-align: left; 146 + padding: 0.5rem 0.8rem; 147 + border-bottom: 1px solid var(--border); 148 + } 149 + th { 150 + font-weight: 600; 151 + background: var(--bg-alt); 152 + } 153 + td code { 154 + font-size: 0.85em; 155 + } 156 + 157 + /* ── Map container ─────────────────────────────────────── */ 158 + .map-container { 159 + display: grid; 160 + grid-template-columns: 1fr 280px; 161 + gap: 0; 162 + border: 1px solid var(--border); 163 + border-radius: 8px; 164 + overflow: hidden; 165 + background: var(--bg-alt); 166 + margin-bottom: 1.5rem; 167 + } 168 + @media (max-width: 800px) { 169 + .map-container { grid-template-columns: 1fr; } 170 + } 171 + 172 + .map-svg-wrap { 173 + position: relative; 174 + min-height: 520px; 175 + background: #f0f4f8; 176 + } 177 + .map-svg-wrap svg { 178 + width: 100%; 179 + height: 100%; 180 + display: block; 181 + } 182 + 183 + .map-sidebar { 184 + background: #fff; 185 + border-left: 1px solid var(--border); 186 + padding: 1rem; 187 + overflow-y: auto; 188 + max-height: 620px; 189 + font-size: 0.82rem; 190 + } 191 + .map-sidebar h3 { 192 + font-size: 0.95rem; 193 + margin-top: 0; 194 + margin-bottom: 0.6rem; 195 + color: var(--fg); 196 + } 197 + 198 + /* ── Layer toggles ─────────────────────────────────────── */ 199 + .layer-controls { 200 + display: flex; 201 + flex-wrap: wrap; 202 + gap: 0.3rem 0.8rem; 203 + padding: 0.7rem 1rem; 204 + background: #fff; 205 + border-bottom: 1px solid var(--border); 206 + font-size: 0.82rem; 207 + } 208 + .layer-controls label { 209 + display: inline-flex; 210 + align-items: center; 211 + gap: 0.3rem; 212 + cursor: pointer; 213 + white-space: nowrap; 214 + } 215 + .layer-controls .swatch { 216 + display: inline-block; 217 + width: 12px; 218 + height: 12px; 219 + border-radius: 3px; 220 + border: 1px solid rgba(0,0,0,0.15); 221 + } 222 + 223 + /* ── Stats bar ─────────────────────────────────────────── */ 224 + .stats-grid { 225 + display: grid; 226 + grid-template-columns: repeat(auto-fit, minmax(100px, 1fr)); 227 + gap: 0.5rem; 228 + margin-bottom: 1rem; 229 + } 230 + .stat-card { 231 + background: var(--bg-alt); 232 + border: 1px solid var(--border); 233 + border-radius: 6px; 234 + padding: 0.6rem 0.7rem; 235 + text-align: center; 236 + } 237 + .stat-card .val { 238 + font-size: 1.3rem; 239 + font-weight: 700; 240 + color: var(--accent); 241 + } 242 + .stat-card .lbl { 243 + font-size: 0.72rem; 244 + color: var(--muted); 245 + text-transform: uppercase; 246 + letter-spacing: 0.04em; 247 + } 248 + 249 + /* ── Detail panel ──────────────────────────────────────── */ 250 + .detail-panel { 251 + background: #fff; 252 + border: 1px solid var(--border); 253 + border-radius: 6px; 254 + padding: 0.8rem; 255 + margin-top: 0.5rem; 256 + font-size: 0.8rem; 257 + line-height: 1.5; 258 + display: none; 259 + } 260 + .detail-panel.show { display: block; } 261 + .detail-panel .dp-title { 262 + font-weight: 700; 263 + font-size: 0.9rem; 264 + margin-bottom: 0.4rem; 265 + color: var(--fg); 266 + } 267 + .detail-panel .dp-row { 268 + display: flex; 269 + gap: 0.5rem; 270 + padding: 0.2rem 0; 271 + border-bottom: 1px solid #f1f5f9; 272 + } 273 + .detail-panel .dp-key { 274 + font-weight: 600; 275 + min-width: 90px; 276 + color: #475569; 277 + font-family: var(--font-mono); 278 + font-size: 0.78rem; 279 + } 280 + .detail-panel .dp-val { 281 + color: #334155; 282 + word-break: break-all; 283 + } 284 + 285 + /* ── SVG interactive ───────────────────────────────────── */ 286 + svg .map-point { cursor: pointer; transition: r 0.15s; } 287 + svg .map-point:hover { r: 7; } 288 + svg .map-poly { cursor: pointer; } 289 + svg .map-poly:hover { opacity: 0.85; } 290 + 291 + /* ── Provenance DAG ────────────────────────────────────── */ 292 + .dag-container { 293 + background: var(--bg-alt); 294 + border: 1px solid var(--border); 295 + border-radius: 8px; 296 + overflow-x: auto; 297 + padding: 1rem; 298 + margin-bottom: 1.5rem; 299 + } 300 + .dag-container svg { 301 + display: block; 302 + margin: 0 auto; 303 + } 304 + 305 + /* ── Training cycle diagram ────────────────────────────── */ 306 + .cycle-diagram { 307 + background: var(--bg-alt); 308 + border: 1px solid var(--border); 309 + border-radius: 8px; 310 + overflow-x: auto; 311 + padding: 1rem; 312 + margin-bottom: 1.5rem; 313 + } 314 + 315 + /* ── Badges ────────────────────────────────────────────── */ 316 + .badge { 317 + display: inline-block; 318 + font-size: 0.7rem; 319 + font-family: var(--font-mono); 320 + padding: 0.15em 0.5em; 321 + border-radius: 3px; 322 + vertical-align: middle; 323 + } 324 + .badge-measured { background: #dbeafe; color: #1e40af; } 325 + .badge-derived { background: #fef3c7; color: #92400e; } 326 + .badge-simulated { background: #ede9fe; color: #5b21b6; } 327 + .badge-abstract { background: #f1f5f9; color: #475569; } 328 + 329 + /* ── TOC ───────────────────────────────────────────────── */ 330 + .toc { 331 + background: var(--bg-alt); 332 + border: 1px solid var(--border); 333 + border-radius: 8px; 334 + padding: 1.2rem 1.5rem; 335 + margin-bottom: 2rem; 336 + } 337 + .toc h3 { 338 + font-size: 0.9rem; 339 + margin-top: 0; 340 + margin-bottom: 0.6rem; 341 + text-transform: uppercase; 342 + letter-spacing: 0.05em; 343 + color: var(--muted); 344 + } 345 + .toc ol { 346 + padding-left: 1.2rem; 347 + font-size: 0.88rem; 348 + } 349 + .toc li { 350 + margin-bottom: 0.2rem; 351 + } 352 + .toc a { 353 + color: var(--fg); 354 + } 355 + .toc a:hover { 356 + color: var(--accent); 357 + } 358 + 359 + /* ── Principle box ─────────────────────────────────────── */ 360 + .principle { 361 + background: var(--accent-light); 362 + border-left: 3px solid var(--accent); 363 + padding: 0.8rem 1rem; 364 + margin-bottom: 1rem; 365 + border-radius: 0 6px 6px 0; 366 + } 367 + .principle strong { 368 + color: var(--accent); 369 + } 370 + </style> 371 + </head> 372 + <body> 373 + 374 + <!-- ═══════════════════════════════════════════════════════ 375 + HEADER 376 + ═══════════════════════════════════════════════════════ --> 377 + <header> 378 + <div class="inner"> 379 + <h1>Terradots Label Store &mdash; Specification</h1> 380 + <div class="subtitle">A data model for geospatial labels with full provenance, uncertainty, and spatial indexing</div> 381 + <div class="meta">Worked example: Area of Habitat for <em>Panthera leo</em> in the Serengeti &middot; 23 labels &middot; 10 activities &middot; CRS EPSG:4326 &middot; Hilbert level 12</div> 382 + </div> 383 + </header> 384 + 385 + <main> 386 + 387 + <!-- ═══════════════════════════════════════════════════════ 388 + TABLE OF CONTENTS 389 + ═══════════════════════════════════════════════════════ --> 390 + <section> 391 + <div class="toc"> 392 + <h3>Contents</h3> 393 + <ol> 394 + <li><a href="#map">Interactive Map &mdash; AOH Worked Example</a></li> 395 + <li><a href="#provenance">Provenance Graph</a></li> 396 + <li><a href="#training-cycle">Training Cycle</a></li> 397 + <li><a href="#design">Design Principles</a></li> 398 + <li><a href="#types">Type Specification</a></li> 399 + <li><a href="#constructors">Constructors</a></li> 400 + <li><a href="#accessors">Accessors</a></li> 401 + <li><a href="#fingerprinting">Fingerprinting</a></li> 402 + <li><a href="#storage">Storage Layer</a></li> 403 + </ol> 404 + </div> 405 + </section> 406 + 407 + <!-- ═══════════════════════════════════════════════════════ 408 + 1. INTERACTIVE MAP 409 + ═══════════════════════════════════════════════════════ --> 410 + <section id="map"> 411 + <h2>1. Interactive Map &mdash; AOH Worked Example</h2> 412 + 413 + <p>This map shows all 23 labels from the <em>Panthera leo</em> Area of Habitat example plotted at their real 414 + WGS 84 coordinates in the Serengeti ecosystem. Click any label to see its full metadata. Use the 415 + layer toggles below to show or hide each data source.</p> 416 + 417 + <!-- Statistics --> 418 + <div class="stats-grid"> 419 + <div class="stat-card"><div class="val">23</div><div class="lbl">Total Labels</div></div> 420 + <div class="stat-card"><div class="val">14</div><div class="lbl">Measured</div></div> 421 + <div class="stat-card"><div class="val">3</div><div class="lbl">Simulated</div></div> 422 + <div class="stat-card"><div class="val">6</div><div class="lbl">Derived</div></div> 423 + <div class="stat-card"><div class="val">3,420</div><div class="lbl">AOH km&sup2;</div></div> 424 + <div class="stat-card"><div class="val">70.5%</div><div class="lbl">Habitat</div></div> 425 + </div> 426 + 427 + <!-- Layer toggles --> 428 + <div class="layer-controls" id="layerControls"> 429 + <label><input type="checkbox" data-layer="camera" checked><span class="swatch" style="background:var(--c-camera)"></span> Camera Traps</label> 430 + <label><input type="checkbox" data-layer="gps" checked><span class="swatch" style="background:var(--c-gps)"></span> GPS Collars</label> 431 + <label><input type="checkbox" data-layer="gbif" checked><span class="swatch" style="background:var(--c-gbif)"></span> GBIF</label> 432 + <label><input type="checkbox" data-layer="inat" checked><span class="swatch" style="background:var(--c-inat)"></span> iNaturalist</label> 433 + <label><input type="checkbox" data-layer="iucn" checked><span class="swatch" style="background:var(--c-iucn)"></span> IUCN Range</label> 434 + <label><input type="checkbox" data-layer="iucn-hab" checked><span class="swatch" style="background:var(--c-iucn)"></span> IUCN Habitat</label> 435 + <label><input type="checkbox" data-layer="sim" checked><span class="swatch" style="background:var(--c-sim)"></span> Simulated (LV)</label> 436 + <label><input type="checkbox" data-layer="habitat" checked><span class="swatch" style="background:var(--c-habitat)"></span> Habitat Tiles</label> 437 + <label><input type="checkbox" data-layer="range" checked><span class="swatch" style="background:var(--c-range)"></span> Species Range</label> 438 + <label><input type="checkbox" data-layer="aoh" checked><span class="swatch" style="background:var(--c-aoh)"></span> AOH Patches</label> 439 + </div> 440 + 441 + <!-- Map + sidebar --> 442 + <div class="map-container"> 443 + <div class="map-svg-wrap" id="mapWrap"> 444 + <svg id="mapSvg" viewBox="0 0 800 600" xmlns="http://www.w3.org/2000/svg"> 445 + </svg> 446 + </div> 447 + <div class="map-sidebar" id="mapSidebar"> 448 + <h3>Label Details</h3> 449 + <p style="color:var(--muted); font-size:0.82rem;">Click a label on the map to view its metadata, origin, classification, and properties.</p> 450 + <div class="detail-panel" id="detailPanel"></div> 451 + </div> 452 + </div> 453 + </section> 454 + 455 + <!-- ═══════════════════════════════════════════════════════ 456 + 2. PROVENANCE GRAPH 457 + ═══════════════════════════════════════════════════════ --> 458 + <section id="provenance"> 459 + <h2>2. Provenance Graph</h2> 460 + <p>The provenance DAG traces how the AOH label (<code>aoh-001</code>) was derived from upstream 461 + sources. Every arrow means &ldquo;was computed from&rdquo;. Measured labels (leaves) have no incoming edges. 462 + Simulated labels are marked with dashed borders. Derived labels show their method.</p> 463 + 464 + <div class="dag-container" id="dagContainer"> 465 + <svg id="dagSvg" xmlns="http://www.w3.org/2000/svg"></svg> 466 + </div> 467 + 468 + <p>The full provenance tree as text:</p> 469 + <pre><code>AOH polygon (aoh-001) method: aoh:iucn-2022:range-intersect-habitat 470 + +-- species_range (range-001) method: alpha-shape:alpha-0.005 471 + | +-- ct-001 Camera trap (34.82, -2.33) Measured 472 + | +-- ct-002 Camera trap (34.83, -2.32) Measured 473 + | +-- ct-003 Camera trap (35.01, -2.15) Measured 474 + | +-- gps-001 GPS leo-007 (34.81, -2.34) Measured (via Movebank) 475 + | +-- gps-002 GPS leo-007 (34.84, -2.31) Measured (via Movebank) 476 + | +-- gps-003 GPS leo-007 (34.91, -2.28) Measured (via Movebank) 477 + | +-- gps-004 GPS leo-012 (35.05, -2.10) Measured (via Movebank) 478 + | +-- gbif-001 GBIF (34.85, -2.35) Measured (via GBIF) 479 + | +-- gbif-002 GBIF (35.40, -2.50) Measured (via GBIF) 480 + | +-- inat-001 iNaturalist (34.95, -2.20) Measured (via iNat) 481 + +-- iucn-range-001 IUCN expert range Measured (via IUCN) 482 + +-- hab-001 Habitat tile: core savanna Derived from IUCN hab prefs 483 + | +-- iucn-hab-001 Savanna preference Measured (via IUCN) 484 + | +-- iucn-hab-002 Shrubland preference Measured (via IUCN) 485 + +-- hab-002 Habitat tile: savanna-shrubland Derived from IUCN hab prefs 486 + +-- iucn-hab-001 (shared) 487 + +-- iucn-hab-002 (shared)</code></pre> 488 + 489 + <p>Note: The training set (<code>ts-001</code>) feeds into the habitat classifier but is not a direct 490 + source of the AOH label. It includes all measured observations plus 3 synthetic (Lotka-Volterra) 491 + labels for a 23% synthetic fraction.</p> 492 + </section> 493 + 494 + <!-- ═══════════════════════════════════════════════════════ 495 + 3. TRAINING CYCLE 496 + ═══════════════════════════════════════════════════════ --> 497 + <section id="training-cycle"> 498 + <h2>3. Training Cycle</h2> 499 + <p>Labels flow through a <strong>training/inference cycle</strong>, not a streaming recomputation 500 + pipeline. The cycle has discrete phases, each producing labels that become inputs to the next.</p> 501 + 502 + <div class="cycle-diagram" id="cycleContainer"> 503 + <svg id="cycleSvg" xmlns="http://www.w3.org/2000/svg"></svg> 504 + </div> 505 + 506 + <h3>Cycle Phases</h3> 507 + <ol> 508 + <li><strong>Observations accumulate.</strong> Camera traps trigger, GPS collars record fixes, citizen 509 + scientists upload sightings, museum records are digitised. Each produces a <span class="badge badge-measured">Measured</span> label. 510 + Registry imports (GBIF, Movebank, iNaturalist) carry their <code>via</code> URI for provenance.</li> 511 + 512 + <li><strong>Training set assembled.</strong> A <span class="badge badge-derived">Derived</span> label 513 + (<code>ts-001</code>) records exactly which observations were selected, plus any 514 + <span class="badge badge-simulated">Simulated</span> augmentation. The synthetic fraction (here 23%) 515 + is stored in <code>properties</code> for transparency.</li> 516 + 517 + <li><strong>Model trained.</strong> The TESSERA habitat classifier trains on the assembled set. 518 + The activity record links the training run to its notebook, parameters, and timestamp.</li> 519 + 520 + <li><strong>Habitat classified.</strong> Each landscape tile receives a suitability classification 521 + (savanna 78%, shrubland 13%, etc.) expressed as <code>class_dist</code>. Tiles above the threshold 522 + become suitable; those below (cropland, settlement) are excluded.</li> 523 + 524 + <li><strong>Species range computed.</strong> An alpha-shape from <em>measured-only</em> observations 525 + (simulated labels excluded via <code>is_simulated</code>). This ensures the range reflects where 526 + lions have actually been observed.</li> 527 + 528 + <li><strong>AOH computed.</strong> The intersection of species range with suitable habitat tiles, 529 + validated against the IUCN expert range. Result: 3,420 km&sup2; of suitable habitat out of 530 + 4,850 km&sup2; total range (70.5%).</li> 531 + </ol> 532 + 533 + <h3>Recomputation</h3> 534 + <p>When new observations arrive or the model retrains (TESSERA v3.1 &rarr; v3.2), downstream 535 + derivations recompute. Each recomputation produces a <em>new</em> label with a new activity record; 536 + the old version is retained for comparison. This is not streaming &mdash; it is a deliberate 537 + batch cycle where each phase completes before the next begins.</p> 538 + </section> 539 + 540 + <!-- ═══════════════════════════════════════════════════════ 541 + 4. DESIGN PRINCIPLES 542 + ═══════════════════════════════════════════════════════ --> 543 + <section id="design"> 544 + <h2>4. Design Principles</h2> 545 + 546 + <div class="principle"> 547 + <strong>Coordinates live in CRS space.</strong> 548 + All coordinates are in the document's native Coordinate Reference System. Pixel-space mapping 549 + (affine transforms, SVG viewBox) is a serialisation concern, not a data model concern. The CRS 550 + is specified per document as any string that <a href="https://proj.org/">PROJ</a> can resolve: 551 + EPSG codes (<code>"EPSG:4326"</code>), WKT2 strings, or PROJ pipeline definitions. 552 + </div> 553 + 554 + <div class="principle"> 555 + <strong>Origin distinguishes measured, derived, and simulated.</strong> 556 + Every label records how it was produced. <em>Measured</em> labels come from direct observation or 557 + registry import. <em>Derived</em> labels are computed from other labels (convex hulls, buffers, 558 + merges). <em>Simulated</em> labels come from theoretical models and must remain identifiable as 559 + synthetic &mdash; they augment training data but do not represent real-world observations. 560 + </div> 561 + 562 + <div class="principle"> 563 + <strong>URIs identify observers and registries.</strong> 564 + Observers (sensors, humans) and external registries are identified by URI. The URI scheme encodes 565 + the kind of source. Adding a new kind requires no code changes &mdash; just use a new URI scheme. 566 + </div> 567 + 568 + <table> 569 + <thead><tr><th>URI</th><th>Meaning</th></tr></thead> 570 + <tbody> 571 + <tr><td><code>orcid:0000-0001-2345-6789</code></td><td>Human observer (ORCID)</td></tr> 572 + <tr><td><code>https://ror.org/035dkdb55</code></td><td>Institution (ROR)</td></tr> 573 + <tr><td><code>urn:sensor:gps:trimble-r12-0042</code></td><td>GPS receiver</td></tr> 574 + <tr><td><code>urn:sensor:camera-trap:ct-0042</code></td><td>Camera trap</td></tr> 575 + <tr><td><code>gbif:4023589127</code></td><td>GBIF occurrence record</td></tr> 576 + <tr><td><code>inaturalist:observation/12345</code></td><td>iNaturalist observation</td></tr> 577 + <tr><td><code>osm:node/123456</code></td><td>OpenStreetMap node</td></tr> 578 + <tr><td><code>movebank:study/1234/individual/leo-007/event/98001</code></td><td>Movebank GPS event</td></tr> 579 + <tr><td><code>iucn:redlist:22/Panthera-leo:range:2024.1</code></td><td>IUCN range polygon</td></tr> 580 + <tr><td><code>fairground:notebook/lotka-volterra-serengeti:v4</code></td><td>Simulation model</td></tr> 581 + </tbody> 582 + </table> 583 + 584 + <div class="principle"> 585 + <strong>Identity and spatial indexing are separate.</strong> 586 + A label has two name components: <code>cell</code> (Hilbert curve cell, recomputed on reprojection) 587 + and <code>id</code> (UUID, stable forever). Concatenating <code>cell-id</code> gives a spatially-sortable 588 + unique name. Any sorted index gets spatial clustering for free. 589 + </div> 590 + 591 + <div class="principle"> 592 + <strong>Classification is a probability distribution.</strong> 593 + A label's class is expressed through <code>class_dist</code>, a list of 594 + <code>(class_name, probability)</code> pairs ordered by decreasing probability. 595 + A definite classification is <code>[("Panthera leo", 1.0)]</code>. 596 + An uncertain classification distributes probability across candidates. 597 + An unclassified label has an empty list. 598 + </div> 599 + 600 + <div class="principle"> 601 + <strong>Temporal data follows Darwin Core.</strong> 602 + The <code>event_date</code> field follows the Darwin Core temporal interpretation convention 603 + (ISO 8601-1:2019). It records when the observation was made, not when the label was imported. 604 + Supported formats: precise dates (<code>"2023-09-18"</code>), imprecise dates (<code>"2023-09"</code> 605 + or <code>"2023"</code>), date-times (<code>"2023-09-18T13:27:00Z"</code>), and intervals 606 + (<code>"2023-09-05/2023-09-18"</code>). 607 + </div> 608 + 609 + <div class="principle"> 610 + <strong>Deduplication is a derivation.</strong> 611 + Labels imported from multiple sources may refer to the same real-world feature. Dedup is modelled 612 + as a derivation: find candidate matches (same Hilbert cell, class agreement, temporal overlap), 613 + let an expert decide, then merge via <code>Derived { sources = [a; b]; method_ = "manual-merge" }</code>. 614 + Both originals are kept for full provenance. 615 + </div> 616 + 617 + <div class="principle"> 618 + <strong>Training cycle, not streaming recomputation.</strong> 619 + Downstream derivations (habitat classification, species range, AOH) are batch operations that 620 + recompute when inputs change. Each run gets a new activity record. Old and new versions coexist 621 + for comparison. This is deliberate: ecological models need stable training windows, not 622 + continuous flux. 623 + </div> 624 + </section> 625 + 626 + <!-- ═══════════════════════════════════════════════════════ 627 + 5. TYPE SPECIFICATION 628 + ═══════════════════════════════════════════════════════ --> 629 + <section id="types"> 630 + <h2>5. Type Specification</h2> 631 + <p>All types are defined in the <code>Terradots</code> OCaml module. The data model is independent of 632 + any serialisation format (SVG, GeoJSON, GeoParquet, etc.).</p> 633 + 634 + <h3 id="type-crs">Coordinate Reference Systems</h3> 635 + <pre><code>(** Any string that PROJ can resolve: EPSG codes, WKT2, PROJ pipelines. *) 636 + type crs = string 637 + 638 + val wgs84 : crs (* "EPSG:4326" -- lon/lat in degrees *) 639 + val web_mercator : crs (* "EPSG:3857" -- metres, for web tiles *)</code></pre> 640 + <p>The CRS determines the units and meaning of all <code>point</code> coordinates in the document. 641 + For EPSG:4326, <code>x</code> is longitude (degrees east), <code>y</code> is latitude (degrees north). 642 + For projected CRS (UTM, Web Mercator), <code>x</code> is easting (metres), <code>y</code> is 643 + northing (metres).</p> 644 + 645 + <h3 id="type-temporal">Temporal</h3> 646 + <pre><code>(** Abstract type -- construct with event_date_of_string, 647 + inspect with string_of_event_date. *) 648 + type event_date 649 + 650 + val event_date_of_string : string -&gt; event_date 651 + val string_of_event_date : event_date -&gt; string</code></pre> 652 + <p>Valid forms: precise dates (<code>"2023-09-18"</code>), imprecise dates (<code>"2023-09"</code>, 653 + <code>"2023"</code>), date-times (<code>"2023-09-18T13:27:00Z"</code>), intervals 654 + (<code>"2023-09-05/2023-09-18"</code>). The abstraction boundary allows future parsing, validation, 655 + and temporal overlap queries. <span class="badge badge-abstract">abstract</span></p> 656 + 657 + <h3 id="type-cell">Spatial Indexing (Hilbert Cell)</h3> 658 + <pre><code>(** Abstract type -- a hex-encoded Hilbert curve cell index. *) 659 + type cell 660 + 661 + val cell_of_string : string -&gt; cell 662 + val string_of_cell : cell -&gt; string</code></pre> 663 + <p>The Hilbert curve maps 2D coordinates to a 1D index preserving spatial locality. Nearby points 664 + in CRS space get nearby cell values. <span class="badge badge-abstract">abstract</span></p> 665 + 666 + <h4>Hilbert Level Table (EPSG:4326)</h4> 667 + <table> 668 + <thead><tr><th>Level</th><th>Cell size</th><th>Hex chars</th><th>Use case</th></tr></thead> 669 + <tbody> 670 + <tr><td>8</td><td>~1.4 km</td><td>2</td><td>Coarse regional indexing</td></tr> 671 + <tr><td>12</td><td>~88 m</td><td>3</td><td>Standard (this example)</td></tr> 672 + <tr><td>16</td><td>~5.5 m</td><td>4</td><td>High-resolution surveys</td></tr> 673 + <tr><td>20</td><td>~0.3 m</td><td>5</td><td>Sub-metre precision</td></tr> 674 + </tbody> 675 + </table> 676 + 677 + <h3 id="type-geometry">Geometry</h3> 678 + <pre><code>(** A point in the document's native CRS. *) 679 + type point = { x : float; y : float } 680 + 681 + (** Follows OGC Simple Features / ISO 19125. *) 682 + type geometry = 683 + | Point of point 684 + | Polygon of point list (* exterior ring, closed *) 685 + | Multi of geometry list (* GeometryCollection / Multi* *) 686 + 687 + (** Representative point for spatial indexing. *) 688 + val centroid : geometry -&gt; point</code></pre> 689 + <p>Centroid computation: <strong>Point</strong> returns itself. <strong>Polygon</strong> returns the 690 + arithmetic mean of ring vertices. <strong>Multi</strong> returns the centroid of centroids 691 + (unweighted &mdash; sufficient for indexing, not for area-weighted analysis).</p> 692 + 693 + <h3 id="type-origin">Origin</h3> 694 + <pre><code>type origin = 695 + | Measured of { 696 + observer : string option; (* URI of observer *) 697 + via : string option; (* URI of registry record *) 698 + license : string option; (* SPDX identifier *) 699 + accuracy_m : float option; (* positional uncertainty, metres *) 700 + } 701 + | Derived of { 702 + sources : string list; (* IDs of source labels *) 703 + method_ : string; (* algorithm identifier *) 704 + } 705 + | Simulated of { 706 + model : string; (* URI of simulation model *) 707 + run_id : string; (* unique run identifier *) 708 + }</code></pre> 709 + 710 + <table> 711 + <thead><tr><th>Variant</th><th>Fields</th><th>Description</th></tr></thead> 712 + <tbody> 713 + <tr> 714 + <td><span class="badge badge-measured">Measured</span></td> 715 + <td><code>observer</code>, <code>via</code>, <code>license</code>, <code>accuracy_m</code></td> 716 + <td>Direct observation or registry import. <code>observer</code> is required for direct obs, optional for imports. 717 + <code>via</code> is the registry URI (GBIF, Movebank, iNat). <code>accuracy_m</code> is positional uncertainty in metres.</td> 718 + </tr> 719 + <tr> 720 + <td><span class="badge badge-derived">Derived</span></td> 721 + <td><code>sources</code>, <code>method_</code></td> 722 + <td>Computed from other labels. <code>sources</code> are label IDs within the same document. 723 + <code>method_</code> identifies the algorithm (e.g. <code>"convex-hull"</code>, <code>"manual-merge"</code>, 724 + <code>"alpha-shape:alpha-0.005"</code>).</td> 725 + </tr> 726 + <tr> 727 + <td><span class="badge badge-simulated">Simulated</span></td> 728 + <td><code>model</code>, <code>run_id</code></td> 729 + <td>Produced by a theoretical model. <code>model</code> URI identifies the code (e.g. a Fairground notebook). 730 + <code>run_id</code> links all labels from the same execution.</td> 731 + </tr> 732 + </tbody> 733 + </table> 734 + 735 + <h3 id="type-activity">Activity (Provenance Audit Record)</h3> 736 + <pre><code>type activity = { 737 + activity_id : string; 738 + agent : string; (* who/what: URI, email, tool *) 739 + date : string; (* ISO 8601 *) 740 + description : string option; (* free-text note *) 741 + }</code></pre> 742 + <p>An activity captures the &ldquo;who&rdquo; and &ldquo;when&rdquo; of label creation or derivation. Multiple labels may 743 + share the same activity (e.g. a batch import). Labels reference activities via their 744 + <code>activity</code> field.</p> 745 + 746 + <h3 id="type-label">Label</h3> 747 + <pre><code>type label = { 748 + cell : cell; (* Hilbert cell index *) 749 + id : string; (* unique identifier *) 750 + geometry : geometry; (* spatial extent *) 751 + origin : origin; (* how produced *) 752 + event_date : event_date option; (* when observed *) 753 + confidence : float option; (* semantic confidence in [0,1] *) 754 + class_dist : (string * float) list; (* probability distribution *) 755 + activity : string option; (* activity ID *) 756 + properties : (string * string) list; (* extensible metadata *) 757 + } 758 + 759 + val label_name : label -&gt; string (* cell ^ "-" ^ id *)</code></pre> 760 + <p>The <code>label</code> type is the central data structure. All fields except <code>cell</code>, 761 + <code>id</code>, <code>geometry</code>, and <code>origin</code> are optional or may be empty.</p> 762 + 763 + <h3 id="type-annotation">Annotation</h3> 764 + <pre><code>type annotation = { 765 + id : string; 766 + text : string; (* free-text content *) 767 + anchors : string list; (* label IDs this annotates *) 768 + }</code></pre> 769 + <p>Annotations provide commentary, corrections, or contextual notes without modifying labels. 770 + An annotation may span multiple labels.</p> 771 + 772 + <h3 id="type-group">Group</h3> 773 + <pre><code>type group = { 774 + id : string; 775 + activity : string option; (* activity that created this group *) 776 + members : string list; (* label IDs *) 777 + }</code></pre> 778 + <p>Groups organise labels into logical collections (field campaigns, seasonal surveys, thematic subsets). 779 + Purely organisational &mdash; they do not affect label semantics. A label may belong to multiple groups.</p> 780 + 781 + <h3 id="type-document">Document</h3> 782 + <pre><code>type document = { 783 + crs : crs; 784 + level : int; (* Hilbert curve level *) 785 + provenance : activity list; 786 + labels : label list; 787 + annotations : annotation list; 788 + groups : group list; 789 + } 790 + 791 + val empty_document : crs:crs -&gt; ?level:int -&gt; unit -&gt; document</code></pre> 792 + <p>The top-level container: a set of labels in a common CRS, with provenance records, annotations, 793 + and groups. The <code>level</code> parameter defaults to 12 (~88 m cells for EPSG:4326).</p> 794 + </section> 795 + 796 + <!-- ═══════════════════════════════════════════════════════ 797 + 6. CONSTRUCTORS 798 + ═══════════════════════════════════════════════════════ --> 799 + <section id="constructors"> 800 + <h2>6. Constructors</h2> 801 + <p>Convenience functions that enforce common patterns. All require <code>~cell</code> and <code>~id</code>. 802 + Classification is always via <code>~class_dist</code>.</p> 803 + 804 + <h3>make_point</h3> 805 + <pre><code>val make_point : 806 + cell:cell -&gt; id:string -&gt; 807 + x:float -&gt; y:float -&gt; 808 + observer:string -&gt; 809 + ?accuracy_m:float -&gt; 810 + ?event_date:event_date -&gt; ?confidence:float -&gt; 811 + ?class_dist:(string * float) list -&gt; 812 + ?activity:string -&gt; 813 + ?properties:(string * string) list -&gt; 814 + unit -&gt; label</code></pre> 815 + <p>Construct a measured point label from a direct observation. Requires an <code>observer</code> URI.</p> 816 + 817 + <h3>make_polygon</h3> 818 + <pre><code>val make_polygon : 819 + cell:cell -&gt; id:string -&gt; 820 + ring:point list -&gt; 821 + observer:string -&gt; 822 + ?accuracy_m:float -&gt; 823 + ?event_date:event_date -&gt; ?confidence:float -&gt; 824 + ?class_dist:(string * float) list -&gt; 825 + ?activity:string -&gt; 826 + ?properties:(string * string) list -&gt; 827 + unit -&gt; label</code></pre> 828 + <p>Construct a measured polygon label. The ring must be closed (last point = first point).</p> 829 + 830 + <h3>make_imported</h3> 831 + <pre><code>val make_imported : 832 + cell:cell -&gt; id:string -&gt; 833 + geometry:geometry -&gt; 834 + via:string -&gt; 835 + ?observer:string -&gt; ?license:string -&gt; 836 + ?accuracy_m:float -&gt; 837 + ?event_date:event_date -&gt; ?confidence:float -&gt; 838 + ?class_dist:(string * float) list -&gt; 839 + ?activity:string -&gt; 840 + ?properties:(string * string) list -&gt; 841 + unit -&gt; label</code></pre> 842 + <p>Construct a label imported from an external registry. The <code>via</code> URI identifies the 843 + registry record. Observer is optional (many registries do not expose the original collector).</p> 844 + 845 + <h3>make_derived</h3> 846 + <pre><code>val make_derived : 847 + cell:cell -&gt; id:string -&gt; 848 + geometry:geometry -&gt; 849 + sources:string list -&gt; 850 + method_:string -&gt; 851 + ?event_date:event_date -&gt; ?confidence:float -&gt; 852 + ?class_dist:(string * float) list -&gt; 853 + ?activity:string -&gt; 854 + ?properties:(string * string) list -&gt; 855 + unit -&gt; label</code></pre> 856 + <p>Construct a derived label. Deduplication merges are a special case: 857 + <code>make_derived ~sources:["a";"b"] ~method_:"manual-merge" ...</code></p> 858 + 859 + <h3>make_simulated</h3> 860 + <pre><code>val make_simulated : 861 + cell:cell -&gt; id:string -&gt; 862 + geometry:geometry -&gt; 863 + model:string -&gt; 864 + run_id:string -&gt; 865 + ?event_date:event_date -&gt; ?confidence:float -&gt; 866 + ?class_dist:(string * float) list -&gt; 867 + ?activity:string -&gt; 868 + ?properties:(string * string) list -&gt; 869 + unit -&gt; label</code></pre> 870 + <p>Construct a simulated label. The <code>model</code> URI identifies the simulation code; 871 + <code>run_id</code> links all labels from the same execution.</p> 872 + </section> 873 + 874 + <!-- ═══════════════════════════════════════════════════════ 875 + 7. ACCESSORS 876 + ═══════════════════════════════════════════════════════ --> 877 + <section id="accessors"> 878 + <h2>7. Accessors</h2> 879 + <pre><code>(** Most likely class from class_dist, or None if empty. *) 880 + val primary_class : label -&gt; string option 881 + 882 + (** Positional accuracy in metres, if Measured. *) 883 + val accuracy_of : label -&gt; float option 884 + 885 + (** Source label IDs, if Derived. Empty otherwise. *) 886 + val sources_of : label -&gt; string list 887 + 888 + (** Registry URI, if imported via a registry. *) 889 + val via_of : label -&gt; string option 890 + 891 + (** True for Simulated labels. *) 892 + val is_simulated : label -&gt; bool</code></pre> 893 + <p>These accessors provide safe pattern-matching over the <code>origin</code> variant without 894 + exposing internal structure. <code>is_simulated</code> is used by the species-range pipeline to 895 + exclude synthetic observations.</p> 896 + </section> 897 + 898 + <!-- ═══════════════════════════════════════════════════════ 899 + 8. FINGERPRINTING 900 + ═══════════════════════════════════════════════════════ --> 901 + <section id="fingerprinting"> 902 + <h2>8. Fingerprinting</h2> 903 + <pre><code>(** Coarse key for deduplication candidates. 904 + Returns cell ^ "|" ^ primary_class (or "_" if unclassified). *) 905 + val fingerprint : label -&gt; string</code></pre> 906 + <p>A fingerprint combines the Hilbert cell (spatial locality) with the primary class. Two labels 907 + with the same fingerprint are worth comparing for potential deduplication. Different fingerprints 908 + guarantee the labels are either spatially distant or differently classified.</p> 909 + <p>The <code>event_date</code> is deliberately excluded: the same real-world feature observed at 910 + different times should still match as a candidate, so a human reviewer can decide whether they 911 + represent the same feature.</p> 912 + </section> 913 + 914 + <!-- ═══════════════════════════════════════════════════════ 915 + 9. STORAGE LAYER 916 + ═══════════════════════════════════════════════════════ --> 917 + <section id="storage"> 918 + <h2>9. Storage Layer</h2> 919 + <p>The data model is independent of how labels are stored and indexed. This section specifies the 920 + contract between the core types and a storage backend.</p> 921 + 922 + <h3>Hilbert Cell Computation</h3> 923 + <p>The <code>cell</code> field on each label is a hex-encoded Hilbert curve cell index, computed 924 + from the label's <code>centroid</code> at the document's <code>level</code>. The storage layer 925 + must provide:</p> 926 + <pre><code>val hilbert_cell : level:int -&gt; crs:crs -&gt; point -&gt; cell</code></pre> 927 + 928 + <h3>Why Hilbert, not Geohash</h3> 929 + <p>Geohash uses a Z-order (Morton) curve. Z-order curves have discontinuities at certain cell 930 + boundaries: two points close in 2D space can receive very different hash values when they fall 931 + on opposite sides of a major subdivision. The Hilbert curve avoids this &mdash; adjacent cells on 932 + the curve are <em>always</em> spatially adjacent. This gives more uniform spatial clustering and 933 + fewer edge-case misses in proximity queries.</p> 934 + 935 + <h3>Reprojection</h3> 936 + <p>When a document's CRS changes, all <code>cell</code> values must be recomputed from the 937 + (reprojected) geometries. The <code>id</code> fields remain stable &mdash; identity is independent 938 + of coordinate system.</p> 939 + 940 + <h3>Sorted Keys</h3> 941 + <p>Concatenating <code>cell ^ "-" ^ id</code> (via <code>label_name</code>) produces a key that 942 + sorts spatially. Any system that maintains sorted order (B-tree, LSM tree, lexicographic file 943 + listing) gets spatial clustering for free: a prefix scan on a cell value retrieves all labels 944 + in that spatial neighbourhood.</p> 945 + </section> 946 + 947 + </main> 948 + 949 + <!-- ═══════════════════════════════════════════════════════ 950 + JAVASCRIPT -- all interactive behaviour 951 + ═══════════════════════════════════════════════════════ --> 952 + <script> 953 + (function() { 954 + "use strict"; 955 + 956 + // ═══════════════════════════════════════════════════════ 957 + // LABEL DATA -- all 23 labels from the AOH example 958 + // ═══════════════════════════════════════════════════════ 959 + 960 + var labels = [ 961 + // Camera traps 962 + {id:"ct-001", layer:"camera", cell:"b7a", type:"point", x:34.82, y:-2.33, 963 + origin:"Measured", observer:"urn:sensor:camera-trap:serengeti-node-17", 964 + accuracy_m:5.0, confidence:0.97, 965 + class_dist:[["Panthera leo",1.0]], event_date:"2024-06-12T05:42:00Z", 966 + activity:"act-field-2024", 967 + props:{image_uri:"s3://slp/ct17/IMG_4821.jpg", individual_count:"3", behaviour:"resting"}}, 968 + {id:"ct-002", layer:"camera", cell:"b7a", type:"point", x:34.83, y:-2.32, 969 + origin:"Measured", observer:"urn:sensor:camera-trap:serengeti-node-17", 970 + accuracy_m:5.0, confidence:0.92, 971 + class_dist:[["Panthera leo",1.0]], event_date:"2024-06-14T19:15:00Z", 972 + activity:"act-field-2024", 973 + props:{image_uri:"s3://slp/ct17/IMG_4903.jpg", individual_count:"1", behaviour:"walking"}}, 974 + {id:"ct-003", layer:"camera", cell:"b7c", type:"point", x:35.01, y:-2.15, 975 + origin:"Measured", observer:"urn:sensor:camera-trap:serengeti-node-42", 976 + accuracy_m:5.0, confidence:0.88, 977 + class_dist:[["Panthera leo",1.0]], event_date:"2024-06-18T03:22:00Z", 978 + activity:"act-field-2024", 979 + props:{image_uri:"s3://slp/ct42/IMG_1207.jpg", individual_count:"2"}}, 980 + {id:"ct-004", layer:"camera", cell:"b7d", type:"point", x:35.22, y:-2.45, 981 + origin:"Measured", observer:"urn:sensor:camera-trap:serengeti-node-55", 982 + accuracy_m:null, confidence:null, 983 + class_dist:[], event_date:"2024-06-20T22:10:00Z", 984 + activity:"act-field-2024", 985 + props:{image_uri:"s3://slp/ct55/IMG_0891.jpg", trigger:"motion", species_detected:"none"}}, 986 + 987 + // GPS collars -- leo-007 988 + {id:"gps-001", layer:"gps", cell:"b7a", type:"point", x:34.81, y:-2.34, 989 + origin:"Measured", observer:"urn:sensor:gps:vectronic-vertex-plus-007", 990 + via:"movebank:study/1234/individual/leo-007/event/98001", license:"CC-BY-NC-4.0", 991 + accuracy_m:3.5, confidence:null, 992 + class_dist:[["Panthera leo",1.0]], event_date:"2024-06-10T06:00:00Z", 993 + activity:"act-movebank-import", 994 + props:{individual_id:"leo-007", fix_type:"3D", hdop:"0.9"}}, 995 + {id:"gps-002", layer:"gps", cell:"b7a", type:"point", x:34.84, y:-2.31, 996 + origin:"Measured", observer:"urn:sensor:gps:vectronic-vertex-plus-007", 997 + via:"movebank:study/1234/individual/leo-007/event/98002", license:"CC-BY-NC-4.0", 998 + accuracy_m:4.2, confidence:null, 999 + class_dist:[["Panthera leo",1.0]], event_date:"2024-06-10T12:00:00Z", 1000 + activity:"act-movebank-import", 1001 + props:{individual_id:"leo-007", fix_type:"3D", hdop:"1.1"}}, 1002 + {id:"gps-003", layer:"gps", cell:"b7b", type:"point", x:34.91, y:-2.28, 1003 + origin:"Measured", observer:"urn:sensor:gps:vectronic-vertex-plus-007", 1004 + via:"movebank:study/1234/individual/leo-007/event/98003", license:"CC-BY-NC-4.0", 1005 + accuracy_m:5.1, confidence:null, 1006 + class_dist:[["Panthera leo",1.0]], event_date:"2024-06-11T06:00:00Z", 1007 + activity:"act-movebank-import", 1008 + props:{individual_id:"leo-007", fix_type:"3D", hdop:"1.4"}}, 1009 + 1010 + // GPS collar -- leo-012 1011 + {id:"gps-004", layer:"gps", cell:"b7c", type:"point", x:35.05, y:-2.10, 1012 + origin:"Measured", observer:"urn:sensor:gps:vectronic-vertex-plus-012", 1013 + via:"movebank:study/1234/individual/leo-012/event/98501", license:"CC-BY-NC-4.0", 1014 + accuracy_m:3.0, confidence:null, 1015 + class_dist:[["Panthera leo",1.0]], event_date:"2024-06-12T06:00:00Z", 1016 + activity:"act-movebank-import", 1017 + props:{individual_id:"leo-012"}}, 1018 + 1019 + // GBIF 1020 + {id:"gbif-001", layer:"gbif", cell:"b7a", type:"point", x:34.85, y:-2.35, 1021 + origin:"Measured", via:"gbif:4023589127", license:"CC-BY-4.0", 1022 + accuracy_m:100.0, confidence:null, 1023 + class_dist:[["Panthera leo",1.0]], event_date:"2022-08-14", 1024 + activity:"act-gbif-import", 1025 + props:{gbif_dataset:"serengeti-biodiversity-survey", basis_of_record:"HUMAN_OBSERVATION", 1026 + recorded_by:"Tanzania Wildlife Research Institute"}}, 1027 + {id:"gbif-002", layer:"gbif", cell:"b7e", type:"point", x:35.40, y:-2.50, 1028 + origin:"Measured", via:"gbif:4023589999", license:"CC-BY-4.0", 1029 + accuracy_m:500.0, confidence:null, 1030 + class_dist:[["Panthera leo",1.0]], event_date:"2021", 1031 + activity:"act-gbif-import", 1032 + props:{gbif_dataset:"ngorongoro-mammal-survey", basis_of_record:"HUMAN_OBSERVATION"}}, 1033 + 1034 + // iNaturalist 1035 + {id:"inat-001", layer:"inat", cell:"b7b", type:"point", x:34.95, y:-2.20, 1036 + origin:"Measured", observer:"inaturalist:user/safari_dave", 1037 + via:"inaturalist:observation/182345678", license:"CC-BY-NC-4.0", 1038 + accuracy_m:50.0, confidence:0.95, 1039 + class_dist:[["Panthera leo",1.0]], event_date:"2023-07-22T16:30:00Z", 1040 + activity:"act-inat-import", 1041 + props:{quality_grade:"research", num_identifications:"5"}}, 1042 + 1043 + // IUCN range 1044 + {id:"iucn-range-001", layer:"iucn", cell:"b70", type:"polygon", 1045 + ring:[[34,-3],[36,-3],[36,-1],[34,-1],[34,-3]], 1046 + origin:"Measured", via:"iucn:redlist:22/Panthera-leo:range:2024.1", license:"CC-BY-NC-4.0", 1047 + accuracy_m:null, confidence:null, 1048 + class_dist:[["Panthera leo",1.0]], event_date:"2024", 1049 + activity:"act-iucn-import", 1050 + props:{iucn_status:"VU", iucn_criteria:"A2abcd", population_trend:"decreasing", 1051 + range_type:"extant:resident", habitat_codes:"1.5;1.6;2;3;14.1"}}, 1052 + 1053 + // IUCN habitat preferences (point markers at 35,-2) 1054 + {id:"iucn-hab-001", layer:"iucn-hab", cell:"b70", type:"point", x:35.0, y:-2.0, 1055 + origin:"Measured", via:"iucn:redlist:22/Panthera-leo:habitat:2", license:"CC-BY-NC-4.0", 1056 + accuracy_m:null, confidence:0.95, 1057 + class_dist:[["habitat-preference:savanna",1.0]], event_date:null, 1058 + activity:"act-iucn-import", 1059 + props:{iucn_habitat_code:"2", suitability:"Suitable", major_importance:"Yes"}}, 1060 + {id:"iucn-hab-002", layer:"iucn-hab", cell:"b70", type:"point", x:35.0, y:-2.0, 1061 + origin:"Measured", via:"iucn:redlist:22/Panthera-leo:habitat:3", license:"CC-BY-NC-4.0", 1062 + accuracy_m:null, confidence:0.70, 1063 + class_dist:[["habitat-preference:shrubland",1.0]], event_date:null, 1064 + activity:"act-iucn-import", 1065 + props:{iucn_habitat_code:"3", suitability:"Suitable", major_importance:"No"}}, 1066 + 1067 + // Simulated (Lotka-Volterra) 1068 + {id:"sim-001", layer:"sim", cell:"b7d", type:"point", x:35.20, y:-2.50, 1069 + origin:"Simulated", model:"fairground:notebook/lotka-volterra-serengeti:v4", 1070 + run_id:"lv-run-42", 1071 + accuracy_m:null, confidence:0.60, 1072 + class_dist:[["Panthera leo",1.0]], event_date:"2024-06-15T00:00:00Z", 1073 + activity:"act-sim-lv-001", 1074 + props:{scenario:"baseline-2024", time_step:"150", prey_density_km2:"45.2", seed:"42"}}, 1075 + {id:"sim-002", layer:"sim", cell:"b7d", type:"point", x:35.18, y:-2.48, 1076 + origin:"Simulated", model:"fairground:notebook/lotka-volterra-serengeti:v4", 1077 + run_id:"lv-run-42", 1078 + accuracy_m:null, confidence:0.60, 1079 + class_dist:[["Panthera leo",1.0]], event_date:"2024-06-15T06:00:00Z", 1080 + activity:"act-sim-lv-001", 1081 + props:{scenario:"baseline-2024", time_step:"151", prey_density_km2:"44.8", seed:"42"}}, 1082 + {id:"sim-003", layer:"sim", cell:"b7e", type:"point", x:35.45, y:-2.55, 1083 + origin:"Simulated", model:"fairground:notebook/lotka-volterra-serengeti:v4", 1084 + run_id:"lv-run-42", 1085 + accuracy_m:null, confidence:0.55, 1086 + class_dist:[["Panthera leo",1.0]], event_date:"2024-06-16T00:00:00Z", 1087 + activity:"act-sim-lv-001", 1088 + props:{scenario:"drought-2024", time_step:"152", prey_density_km2:"28.1", seed:"42"}}, 1089 + 1090 + // Habitat tiles (derived) 1091 + {id:"hab-001", layer:"habitat", cell:"b7a", type:"polygon", 1092 + ring:[[34.80,-2.40],[34.90,-2.40],[34.90,-2.30],[34.80,-2.30],[34.80,-2.40]], 1093 + origin:"Derived", sources:["iucn-hab-001","iucn-hab-002"], 1094 + method_:"habitat-classify:tessera-v3.1:threshold-0.6", 1095 + accuracy_m:null, confidence:0.91, 1096 + class_dist:[["savanna",0.78],["shrubland",0.13],["other",0.09]], event_date:null, 1097 + activity:"act-habitat-2024", 1098 + props:{tessera_tile:"b7a:034.80:-002.40", dominant_landcover:"savanna"}}, 1099 + {id:"hab-002", layer:"habitat", cell:"b7d", type:"polygon", 1100 + ring:[[35.10,-2.60],[35.20,-2.60],[35.20,-2.50],[35.10,-2.50],[35.10,-2.60]], 1101 + origin:"Derived", sources:["iucn-hab-001","iucn-hab-002"], 1102 + method_:"habitat-classify:tessera-v3.1:threshold-0.6", 1103 + accuracy_m:null, confidence:0.68, 1104 + class_dist:[["savanna",0.45],["shrubland",0.30],["cropland",0.25]], event_date:null, 1105 + activity:"act-habitat-2024", 1106 + props:{tessera_tile:"b7d:035.10:-002.60", dominant_landcover:"savanna-shrubland-mosaic"}}, 1107 + {id:"hab-003", layer:"habitat", cell:"b7f", type:"polygon", 1108 + ring:[[35.80,-1.20],[35.90,-1.20],[35.90,-1.10],[35.80,-1.10],[35.80,-1.20]], 1109 + origin:"Derived", sources:["iucn-hab-001","iucn-hab-002"], 1110 + method_:"habitat-classify:tessera-v3.1:threshold-0.6", 1111 + accuracy_m:null, confidence:0.12, 1112 + class_dist:[["cropland",0.72],["settlement",0.18],["savanna",0.10]], event_date:null, 1113 + activity:"act-habitat-2024", 1114 + props:{tessera_tile:"b7f:035.80:-001.20", dominant_landcover:"cropland"}}, 1115 + 1116 + // Species range (derived) 1117 + {id:"range-001", layer:"range", cell:"b70", type:"polygon", 1118 + ring:[[34.75,-2.60],[35.50,-2.60],[35.50,-2.00],[35.10,-1.90],[34.75,-2.10],[34.75,-2.60]], 1119 + origin:"Derived", 1120 + sources:["ct-001","ct-002","ct-003","gps-001","gps-002","gps-003","gps-004","gbif-001","gbif-002","inat-001"], 1121 + method_:"alpha-shape:alpha-0.005", 1122 + accuracy_m:null, confidence:null, 1123 + class_dist:[["range:Panthera leo",1.0]], event_date:null, 1124 + activity:"act-range-2024", 1125 + props:{range_km2:"4850", n_occurrences:"10", excludes_synthetic:"true"}}, 1126 + 1127 + // AOH (derived, multi-polygon) 1128 + {id:"aoh-001", layer:"aoh", cell:"b70", type:"multi", 1129 + patches:[ 1130 + [[34.80,-2.40],[35.20,-2.40],[35.20,-2.10],[34.80,-2.10],[34.80,-2.40]], 1131 + [[35.10,-2.60],[35.40,-2.60],[35.40,-2.40],[35.10,-2.40],[35.10,-2.60]] 1132 + ], 1133 + origin:"Derived", 1134 + sources:["range-001","iucn-range-001","hab-001","hab-002"], 1135 + method_:"aoh:iucn-2022:range-intersect-habitat", 1136 + accuracy_m:null, confidence:null, 1137 + class_dist:[["aoh:Panthera leo",1.0]], event_date:null, 1138 + activity:"act-aoh-2024", 1139 + props:{aoh_km2:"3420", range_km2:"4850", habitat_proportion:"0.705", 1140 + unsuitable_excluded_km2:"1430", dominant_exclusion:"cropland", 1141 + iucn_status:"VU", iucn_criteria:"A2abcd", population_trend:"decreasing", 1142 + tessera_model:"tessera:v3.1:east-africa", 1143 + synthetic_in_sdm_training:"true", synthetic_fraction_in_training:"0.23"}}, 1144 + 1145 + // Training set (derived) -- not shown on map but in data 1146 + {id:"ts-001", layer:"_none", cell:"b70", type:"polygon", 1147 + ring:[[34,-3],[36,-3],[36,-1],[34,-1],[34,-3]], 1148 + origin:"Derived", 1149 + sources:["ct-001","ct-002","ct-003","gps-001","gps-002","gps-003","gps-004","gbif-001","gbif-002","inat-001","sim-001","sim-002","sim-003"], 1150 + method_:"training-set:balanced-spatial-sample", 1151 + accuracy_m:null, confidence:null, 1152 + class_dist:[["training-set:Panthera-leo:sdm-2024",1.0]], event_date:null, 1153 + activity:"act-training-2024", 1154 + props:{n_measured:"10", n_synthetic:"3", synthetic_fraction:"0.23", 1155 + spatial_extent:"34.0,-3.0,36.0,-1.0", temporal_window:"2021/2024", 1156 + tessera_model:"tessera:v3.1:east-africa"}} 1157 + ]; 1158 + 1159 + // ═══════════════════════════════════════════════════════ 1160 + // MAP RENDERING 1161 + // ═══════════════════════════════════════════════════════ 1162 + 1163 + // Map extent (WGS84) 1164 + var mapExtent = { minLon: 33.8, maxLon: 36.2, minLat: -3.2, maxLat: -0.8 }; 1165 + var svgW = 800, svgH = 600; 1166 + var pad = 30; 1167 + 1168 + function lonToX(lon) { 1169 + return pad + (lon - mapExtent.minLon) / (mapExtent.maxLon - mapExtent.minLon) * (svgW - 2*pad); 1170 + } 1171 + function latToY(lat) { 1172 + return pad + (mapExtent.maxLat - lat) / (mapExtent.maxLat - mapExtent.minLat) * (svgH - 2*pad); 1173 + } 1174 + function ringToPoints(ring) { 1175 + return ring.map(function(c) { return lonToX(c[0]) + "," + latToY(c[1]); }).join(" "); 1176 + } 1177 + 1178 + function buildMap() { 1179 + var svg = document.getElementById("mapSvg"); 1180 + svg.innerHTML = ""; 1181 + 1182 + function el(tag, attrs) { 1183 + var e = document.createElementNS("http://www.w3.org/2000/svg", tag); 1184 + for (var k in attrs) e.setAttribute(k, attrs[k]); 1185 + return e; 1186 + } 1187 + 1188 + // Background 1189 + svg.appendChild(el("rect", {x:0, y:0, width:svgW, height:svgH, fill:"#e8edf3"})); 1190 + 1191 + // Water hint 1192 + svg.appendChild(el("ellipse", {cx: lonToX(33.9), cy: latToY(-2.0), rx: 30, ry: 50, fill:"#b3d4f0", opacity:"0.5"})); 1193 + 1194 + // Graticule 1195 + var grat = el("g", {stroke:"#c8d0dc", "stroke-width":"0.5", fill:"none", opacity:"0.6"}); 1196 + for (var lon = 34; lon <= 36; lon += 0.5) { 1197 + grat.appendChild(el("line", {x1:lonToX(lon), y1:latToY(mapExtent.minLat), x2:lonToX(lon), y2:latToY(mapExtent.maxLat)})); 1198 + } 1199 + for (var lat = -3; lat <= -1; lat += 0.5) { 1200 + grat.appendChild(el("line", {x1:lonToX(mapExtent.minLon), y1:latToY(lat), x2:lonToX(mapExtent.maxLon), y2:latToY(lat)})); 1201 + } 1202 + svg.appendChild(grat); 1203 + 1204 + // Axis labels 1205 + var axG = el("g", {"font-size":"9", fill:"#6b7280", "font-family":"var(--font-mono)"}); 1206 + for (var alon = 34; alon <= 36; alon += 0.5) { 1207 + var t = el("text", {x:lonToX(alon), y:svgH-8, "text-anchor":"middle"}); 1208 + t.textContent = alon.toFixed(1) + "\u00B0E"; 1209 + axG.appendChild(t); 1210 + } 1211 + for (var alat = -3; alat <= -1; alat += 0.5) { 1212 + var t2 = el("text", {x:12, y:latToY(alat)+3, "text-anchor":"middle"}); 1213 + t2.textContent = Math.abs(alat).toFixed(1) + "\u00B0S"; 1214 + axG.appendChild(t2); 1215 + } 1216 + svg.appendChild(axG); 1217 + 1218 + // Title on map 1219 + var title = el("text", {x:svgW/2, y:20, "text-anchor":"middle", "font-size":"13", 1220 + "font-weight":"600", fill:"#334155", "font-family":"var(--font-sans)"}); 1221 + title.textContent = "Serengeti Ecosystem \u2014 Panthera leo AOH"; 1222 + svg.appendChild(title); 1223 + 1224 + // Layer groups (render order: back to front) 1225 + var layerGroups = {}; 1226 + var layerOrder = ["iucn","aoh","range","habitat","iucn-hab","sim","gbif","inat","gps","camera"]; 1227 + layerOrder.forEach(function(ly) { 1228 + layerGroups[ly] = el("g", {"data-layer": ly}); 1229 + }); 1230 + 1231 + labels.forEach(function(lb) { 1232 + if (lb.layer === "_none") return; 1233 + var g = layerGroups[lb.layer]; 1234 + if (!g) return; 1235 + var col = {camera:"#ea580c", gps:"#2563eb", gbif:"#16a34a", inat:"#0d9488", 1236 + iucn:"#dc2626", "iucn-hab":"#dc2626", sim:"#7c3aed", 1237 + habitat:"#65a30d", range:"#3b82f6", aoh:"#16a34a"}[lb.layer] || "#888"; 1238 + 1239 + if (lb.type === "polygon" && lb.ring) { 1240 + var fillOp = "0.15", sw = "1.5", sd = ""; 1241 + if (lb.layer === "iucn") { fillOp = "0.06"; sw = "2"; sd = "6,3"; } 1242 + if (lb.layer === "range") { fillOp = "0.08"; sw = "2"; sd = "4,2"; } 1243 + if (lb.layer === "habitat") { fillOp = "0.3"; } 1244 + 1245 + var poly = el("polygon", { 1246 + points: ringToPoints(lb.ring), 1247 + fill: col, "fill-opacity": fillOp, 1248 + stroke: col, "stroke-width": sw, "stroke-opacity": "0.8", 1249 + "class": "map-poly", "data-id": lb.id 1250 + }); 1251 + if (sd) poly.setAttribute("stroke-dasharray", sd); 1252 + (function(label) { 1253 + poly.addEventListener("click", function() { showDetail(label); }); 1254 + })(lb); 1255 + g.appendChild(poly); 1256 + 1257 + // Habitat tile label 1258 + if (lb.layer === "habitat") { 1259 + var cx = lb.ring.reduce(function(a,c){return a+c[0];},0)/lb.ring.length; 1260 + var cy = lb.ring.reduce(function(a,c){return a+c[1];},0)/lb.ring.length; 1261 + var txt = el("text", {x:lonToX(cx), y:latToY(cy)+3, "text-anchor":"middle", 1262 + "font-size":"8", fill:"#3f6212", "font-weight":"600", 1263 + "pointer-events":"none"}); 1264 + txt.textContent = lb.id; 1265 + g.appendChild(txt); 1266 + } 1267 + } 1268 + else if (lb.type === "multi" && lb.patches) { 1269 + lb.patches.forEach(function(patch, pi) { 1270 + var poly = el("polygon", { 1271 + points: ringToPoints(patch), 1272 + fill: col, "fill-opacity": "0.35", 1273 + stroke: col, "stroke-width": "2.5", "stroke-opacity": "0.9", 1274 + "class": "map-poly", "data-id": lb.id 1275 + }); 1276 + (function(label) { 1277 + poly.addEventListener("click", function() { showDetail(label); }); 1278 + })(lb); 1279 + g.appendChild(poly); 1280 + 1281 + var cx2 = patch.reduce(function(a,c){return a+c[0];},0)/patch.length; 1282 + var cy2 = patch.reduce(function(a,c){return a+c[1];},0)/patch.length; 1283 + var txt2 = el("text", {x:lonToX(cx2), y:latToY(cy2)+3, "text-anchor":"middle", 1284 + "font-size":"9", fill:"#065f46", "font-weight":"700", 1285 + "pointer-events":"none"}); 1286 + txt2.textContent = "AOH P" + (pi+1); 1287 + g.appendChild(txt2); 1288 + }); 1289 + } 1290 + else if (lb.type === "point") { 1291 + var r = 5, sd2 = ""; 1292 + if (lb.layer === "sim") { sd2 = "2,2"; } 1293 + if (lb.layer === "iucn-hab") { r = 7; } 1294 + 1295 + var circ = el("circle", { 1296 + cx: lonToX(lb.x), cy: latToY(lb.y), r: r, 1297 + fill: col, "fill-opacity": "0.85", 1298 + stroke: "#fff", "stroke-width": "1.5", 1299 + "class": "map-point", "data-id": lb.id 1300 + }); 1301 + if (sd2) { 1302 + circ.setAttribute("stroke", col); 1303 + circ.setAttribute("stroke-dasharray", sd2); 1304 + circ.setAttribute("fill-opacity", "0.5"); 1305 + circ.setAttribute("fill", "#ede9fe"); 1306 + } 1307 + if (lb.layer === "iucn-hab") { 1308 + circ.setAttribute("fill", "#fecdd3"); 1309 + circ.setAttribute("stroke", "#dc2626"); 1310 + circ.setAttribute("stroke-width", "2"); 1311 + } 1312 + (function(label) { 1313 + circ.addEventListener("click", function() { showDetail(label); }); 1314 + })(lb); 1315 + g.appendChild(circ); 1316 + 1317 + var ptLabel = el("text", { 1318 + x: lonToX(lb.x) + (lb.layer === "iucn-hab" ? 10 : 8), 1319 + y: latToY(lb.y) + 3, 1320 + "font-size": "8", fill: "#475569", 1321 + "font-family": "var(--font-mono)", 1322 + "pointer-events": "none" 1323 + }); 1324 + ptLabel.textContent = lb.id; 1325 + g.appendChild(ptLabel); 1326 + } 1327 + }); 1328 + 1329 + // GPS track line (leo-007) 1330 + var gpsTrack = el("polyline", { 1331 + points: [lonToX(34.81)+","+latToY(-2.34), 1332 + lonToX(34.84)+","+latToY(-2.31), 1333 + lonToX(34.91)+","+latToY(-2.28)].join(" "), 1334 + fill: "none", stroke: "#2563eb", "stroke-width": "1.5", 1335 + "stroke-dasharray": "4,3", opacity: "0.6" 1336 + }); 1337 + layerGroups["gps"].insertBefore(gpsTrack, layerGroups["gps"].firstChild); 1338 + 1339 + // Append layer groups in order 1340 + layerOrder.forEach(function(ly) { svg.appendChild(layerGroups[ly]); }); 1341 + 1342 + // Scale bar 1343 + var scaleG = el("g", {transform: "translate(" + (svgW - 120) + "," + (svgH - 30) + ")"}); 1344 + var degPx = (svgW - 2*pad) / (mapExtent.maxLon - mapExtent.minLon); 1345 + var km50 = (50/111.32) * degPx; 1346 + scaleG.appendChild(el("line", {x1:0, y1:0, x2:km50, y2:0, stroke:"#334155", "stroke-width":"2"})); 1347 + scaleG.appendChild(el("line", {x1:0, y1:-3, x2:0, y2:3, stroke:"#334155", "stroke-width":"2"})); 1348 + scaleG.appendChild(el("line", {x1:km50, y1:-3, x2:km50, y2:3, stroke:"#334155", "stroke-width":"2"})); 1349 + var scaleText = el("text", {x:km50/2, y:-5, "text-anchor":"middle", "font-size":"9", fill:"#334155"}); 1350 + scaleText.textContent = "50 km"; 1351 + scaleG.appendChild(scaleText); 1352 + svg.appendChild(scaleG); 1353 + } 1354 + 1355 + // ═══════════════════════════════════════════════════════ 1356 + // DETAIL PANEL 1357 + // ═══════════════════════════════════════════════════════ 1358 + 1359 + function showDetail(lb) { 1360 + var panel = document.getElementById("detailPanel"); 1361 + panel.className = "detail-panel show"; 1362 + 1363 + var badge = lb.origin === "Simulated" ? "badge-simulated" : 1364 + lb.origin === "Derived" ? "badge-derived" : "badge-measured"; 1365 + 1366 + var html = '<div class="dp-title">' + escHtml(lb.id) + ' <span class="badge ' + badge + '">' + escHtml(lb.origin) + '</span></div>'; 1367 + 1368 + function row(k, v) { 1369 + if (v === null || v === undefined || v === "") return ""; 1370 + return '<div class="dp-row"><span class="dp-key">' + escHtml(k) + '</span><span class="dp-val">' + escHtml(String(v)) + '</span></div>'; 1371 + } 1372 + 1373 + html += row("cell", lb.cell); 1374 + html += row("label_name", lb.cell + "-" + lb.id); 1375 + 1376 + if (lb.type === "point") { 1377 + html += row("geometry", "Point(" + lb.x + ", " + lb.y + ")"); 1378 + } else if (lb.type === "polygon" && lb.ring) { 1379 + html += row("geometry", "Polygon [" + lb.ring.length + " vertices]"); 1380 + } else if (lb.type === "multi" && lb.patches) { 1381 + html += row("geometry", "Multi [" + lb.patches.length + " patches]"); 1382 + } 1383 + 1384 + if (lb.observer) html += row("observer", lb.observer); 1385 + if (lb.via) html += row("via", lb.via); 1386 + if (lb.license) html += row("license", lb.license); 1387 + if (lb.model) html += row("model", lb.model); 1388 + if (lb.run_id) html += row("run_id", lb.run_id); 1389 + if (lb.sources) html += row("sources", lb.sources.join(", ")); 1390 + if (lb.method_) html += row("method", lb.method_); 1391 + 1392 + if (lb.class_dist && lb.class_dist.length > 0) { 1393 + var cdStr = lb.class_dist.map(function(cd) { return cd[0] + ": " + cd[1].toFixed(2); }).join("; "); 1394 + html += row("class_dist", cdStr); 1395 + } else { 1396 + html += row("class_dist", "(empty -- unclassified)"); 1397 + } 1398 + 1399 + html += row("confidence", lb.confidence !== null ? lb.confidence : null); 1400 + html += row("accuracy_m", lb.accuracy_m !== null ? lb.accuracy_m + " m" : null); 1401 + html += row("event_date", lb.event_date); 1402 + html += row("activity", lb.activity); 1403 + 1404 + if (lb.props && Object.keys(lb.props).length > 0) { 1405 + html += '<div style="margin-top:0.5rem;font-weight:600;font-size:0.78rem;color:#475569;">Properties</div>'; 1406 + for (var pk in lb.props) { 1407 + html += row(pk, lb.props[pk]); 1408 + } 1409 + } 1410 + 1411 + panel.innerHTML = html; 1412 + } 1413 + 1414 + function escHtml(s) { 1415 + var d = document.createElement("div"); 1416 + d.appendChild(document.createTextNode(s)); 1417 + return d.innerHTML; 1418 + } 1419 + 1420 + // ═══════════════════════════════════════════════════════ 1421 + // LAYER TOGGLES 1422 + // ═══════════════════════════════════════════════════════ 1423 + 1424 + function setupToggles() { 1425 + var controls = document.querySelectorAll("#layerControls input[data-layer]"); 1426 + controls.forEach(function(cb) { 1427 + cb.addEventListener("change", function() { 1428 + var layer = this.getAttribute("data-layer"); 1429 + var groups = document.querySelectorAll("#mapSvg g[data-layer='" + layer + "']"); 1430 + groups.forEach(function(g) { 1431 + g.style.display = cb.checked ? "" : "none"; 1432 + }); 1433 + }); 1434 + }); 1435 + } 1436 + 1437 + // ═══════════════════════════════════════════════════════ 1438 + // PROVENANCE DAG 1439 + // ═══════════════════════════════════════════════════════ 1440 + 1441 + function buildDAG() { 1442 + var dagSvg = document.getElementById("dagSvg"); 1443 + var W = 940, H = 520; 1444 + dagSvg.setAttribute("viewBox", "0 0 " + W + " " + H); 1445 + dagSvg.setAttribute("width", "100%"); 1446 + dagSvg.setAttribute("height", H); 1447 + dagSvg.innerHTML = ""; 1448 + 1449 + function el(tag, attrs) { 1450 + var e = document.createElementNS("http://www.w3.org/2000/svg", tag); 1451 + for (var k in attrs) e.setAttribute(k, attrs[k]); 1452 + return e; 1453 + } 1454 + 1455 + // Node positions 1456 + var nodes = { 1457 + "aoh-001": {x:370, y:20, w:200, h:32, color:"#16a34a", label:"aoh-001 (AOH)", dash:false}, 1458 + "range-001": {x:130, y:100, w:200, h:28, color:"#3b82f6", label:"range-001 (Species Range)", dash:false}, 1459 + "iucn-range-001":{x:400, y:100, w:200, h:28, color:"#dc2626", label:"iucn-range-001 (IUCN Range)", dash:false}, 1460 + "hab-001": {x:650, y:100, w:180, h:28, color:"#65a30d", label:"hab-001 (Savanna tile)", dash:false}, 1461 + "hab-002": {x:650, y:150, w:200, h:28, color:"#65a30d", label:"hab-002 (Mosaic tile)", dash:false}, 1462 + "ts-001": {x:370, y:280, w:200, h:28, color:"#a16207", label:"ts-001 (Training Set)", dash:false}, 1463 + "iucn-hab-001": {x:740, y:230, w:150, h:26, color:"#dc2626", label:"iucn-hab-001", dash:false}, 1464 + "iucn-hab-002": {x:740, y:270, w:150, h:26, color:"#dc2626", label:"iucn-hab-002", dash:false}, 1465 + "ct-001": {x:10, y:210, w:82, h:24, color:"#ea580c", label:"ct-001", dash:false}, 1466 + "ct-002": {x:100, y:210, w:82, h:24, color:"#ea580c", label:"ct-002", dash:false}, 1467 + "ct-003": {x:190, y:210, w:82, h:24, color:"#ea580c", label:"ct-003", dash:false}, 1468 + "gps-001": {x:10, y:250, w:82, h:24, color:"#2563eb", label:"gps-001", dash:false}, 1469 + "gps-002": {x:100, y:250, w:82, h:24, color:"#2563eb", label:"gps-002", dash:false}, 1470 + "gps-003": {x:190, y:250, w:82, h:24, color:"#2563eb", label:"gps-003", dash:false}, 1471 + "gps-004": {x:10, y:290, w:82, h:24, color:"#2563eb", label:"gps-004", dash:false}, 1472 + "gbif-001": {x:100, y:290, w:82, h:24, color:"#16a34a", label:"gbif-001", dash:false}, 1473 + "gbif-002": {x:190, y:290, w:82, h:24, color:"#16a34a", label:"gbif-002", dash:false}, 1474 + "inat-001": {x:100, y:330, w:82, h:24, color:"#0d9488", label:"inat-001", dash:false}, 1475 + "sim-001": {x:330, y:390, w:82, h:24, color:"#7c3aed", label:"sim-001", dash:true}, 1476 + "sim-002": {x:420, y:390, w:82, h:24, color:"#7c3aed", label:"sim-002", dash:true}, 1477 + "sim-003": {x:510, y:390, w:82, h:24, color:"#7c3aed", label:"sim-003", dash:true} 1478 + }; 1479 + 1480 + // Edges 1481 + var edges = [ 1482 + ["range-001", "aoh-001"], 1483 + ["iucn-range-001", "aoh-001"], 1484 + ["hab-001", "aoh-001"], 1485 + ["hab-002", "aoh-001"], 1486 + ["ct-001","range-001"], ["ct-002","range-001"], ["ct-003","range-001"], 1487 + ["gps-001","range-001"], ["gps-002","range-001"], ["gps-003","range-001"], 1488 + ["gps-004","range-001"], 1489 + ["gbif-001","range-001"], ["gbif-002","range-001"], 1490 + ["inat-001","range-001"], 1491 + ["iucn-hab-001","hab-001"], ["iucn-hab-002","hab-001"], 1492 + ["iucn-hab-001","hab-002"], ["iucn-hab-002","hab-002"], 1493 + ["ct-001","ts-001"], ["ct-002","ts-001"], ["ct-003","ts-001"], 1494 + ["gps-001","ts-001"], ["gps-002","ts-001"], ["gps-003","ts-001"], ["gps-004","ts-001"], 1495 + ["gbif-001","ts-001"], ["gbif-002","ts-001"], ["inat-001","ts-001"], 1496 + ["sim-001","ts-001"], ["sim-002","ts-001"], ["sim-003","ts-001"], 1497 + ["iucn-hab-001","ts-001"], ["iucn-hab-002","ts-001"] 1498 + ]; 1499 + 1500 + // Arrowhead defs 1501 + var defs = el("defs", {}); 1502 + var marker = el("marker", {id:"arrowhead", markerWidth:"8", markerHeight:"6", 1503 + refX:"8", refY:"3", orient:"auto"}); 1504 + marker.appendChild(el("polygon", {points:"0 0, 8 3, 0 6", fill:"#94a3b8"})); 1505 + defs.appendChild(marker); 1506 + var marker2 = el("marker", {id:"arrowhead-light", markerWidth:"6", markerHeight:"5", 1507 + refX:"6", refY:"2.5", orient:"auto"}); 1508 + marker2.appendChild(el("polygon", {points:"0 0, 6 2.5, 0 5", fill:"#cbd5e1"})); 1509 + defs.appendChild(marker2); 1510 + dagSvg.appendChild(defs); 1511 + 1512 + // Draw edges 1513 + var edgeG = el("g", {}); 1514 + edges.forEach(function(e) { 1515 + var from = nodes[e[0]], to = nodes[e[1]]; 1516 + if (!from || !to) return; 1517 + var x1 = from.x + from.w/2, y1 = from.y; 1518 + var x2 = to.x + to.w/2, y2 = to.y + to.h; 1519 + var isTS = e[1] === "ts-001"; 1520 + var line = el("line", { 1521 + x1:x1, y1:y1, x2:x2, y2:y2, 1522 + stroke: isTS ? "#e2e8f0" : "#94a3b8", 1523 + "stroke-width": isTS ? "1" : "1.5", 1524 + "marker-end": isTS ? "url(#arrowhead-light)" : "url(#arrowhead)" 1525 + }); 1526 + if (isTS) line.setAttribute("stroke-dasharray", "3,3"); 1527 + edgeG.appendChild(line); 1528 + }); 1529 + dagSvg.appendChild(edgeG); 1530 + 1531 + // Draw nodes 1532 + var nodeG = el("g", {}); 1533 + for (var nid in nodes) { 1534 + var n = nodes[nid]; 1535 + var rect = el("rect", { 1536 + x: n.x, y: n.y, width: n.w, height: n.h, 1537 + rx: "4", ry: "4", 1538 + fill: "#fff", stroke: n.color, "stroke-width": n.dash ? "2" : "1.5" 1539 + }); 1540 + if (n.dash) rect.setAttribute("stroke-dasharray", "4,3"); 1541 + nodeG.appendChild(rect); 1542 + 1543 + nodeG.appendChild(el("rect", { 1544 + x: n.x, y: n.y, width: "4", height: n.h, 1545 + rx: "2", fill: n.color 1546 + })); 1547 + 1548 + var txt = el("text", { 1549 + x: n.x + n.w/2, y: n.y + n.h/2 + 4, 1550 + "text-anchor": "middle", "font-size": n.h > 26 ? "10" : "9", 1551 + fill: "#334155", "font-family": "var(--font-mono)" 1552 + }); 1553 + txt.textContent = n.label; 1554 + nodeG.appendChild(txt); 1555 + } 1556 + dagSvg.appendChild(nodeG); 1557 + 1558 + // Legend 1559 + var legG = el("g", {transform: "translate(20," + (H - 100) + ")"}); 1560 + legG.appendChild(el("rect", {x:0, y:0, width:200, height:90, rx:6, fill:"#fff", stroke:"#e2e8f0"})); 1561 + var legTitle = el("text", {x:10, y:16, "font-size":"10", "font-weight":"600", fill:"#334155"}); 1562 + legTitle.textContent = "Legend"; 1563 + legG.appendChild(legTitle); 1564 + 1565 + function legItem(y, col, text, dash) { 1566 + var r = el("rect", {x:10, y:y, width:20, height:12, rx:2, fill:"#fff", stroke:col, "stroke-width":"1.5"}); 1567 + if (dash) r.setAttribute("stroke-dasharray", "3,2"); 1568 + legG.appendChild(r); 1569 + var t = el("text", {x:36, y:y+10, "font-size":"9", fill:"#475569"}); 1570 + t.textContent = text; 1571 + legG.appendChild(t); 1572 + } 1573 + legItem(26, "#ea580c", "Camera trap (Measured)", false); 1574 + legItem(42, "#2563eb", "GPS collar (Measured)", false); 1575 + legItem(58, "#16a34a", "Registry import (Measured)", false); 1576 + legItem(74, "#7c3aed", "Simulated (LV)", true); 1577 + dagSvg.appendChild(legG); 1578 + 1579 + // Method annotations 1580 + var methG = el("g", {"font-size":"8", fill:"#6b7280", "font-style":"italic"}); 1581 + function methodLabel(x, y, text) { 1582 + var t = el("text", {x:x, y:y, "text-anchor":"middle"}); 1583 + t.textContent = text; 1584 + methG.appendChild(t); 1585 + } 1586 + methodLabel(470, 68, "aoh:iucn-2022:range-intersect-habitat"); 1587 + methodLabel(180, 190, "alpha-shape:alpha-0.005"); 1588 + methodLabel(700, 142, "habitat-classify:tessera-v3.1"); 1589 + methodLabel(470, 268, "training-set:balanced-spatial-sample"); 1590 + dagSvg.appendChild(methG); 1591 + } 1592 + 1593 + // ═══════════════════════════════════════════════════════ 1594 + // TRAINING CYCLE DIAGRAM 1595 + // ═══════════════════════════════════════════════════════ 1596 + 1597 + function buildCycleDiagram() { 1598 + var svg = document.getElementById("cycleSvg"); 1599 + var W = 880, H = 310; 1600 + svg.setAttribute("viewBox", "0 0 " + W + " " + H); 1601 + svg.setAttribute("width", "100%"); 1602 + svg.setAttribute("height", H); 1603 + svg.innerHTML = ""; 1604 + 1605 + function el(tag, attrs) { 1606 + var e = document.createElementNS("http://www.w3.org/2000/svg", tag); 1607 + for (var k in attrs) e.setAttribute(k, attrs[k]); 1608 + return e; 1609 + } 1610 + 1611 + // Arrow defs 1612 + var defs = el("defs", {}); 1613 + var m = el("marker", {id:"cycle-arrow", markerWidth:"10", markerHeight:"7", 1614 + refX:"10", refY:"3.5", orient:"auto"}); 1615 + m.appendChild(el("polygon", {points:"0 0, 10 3.5, 0 7", fill:"#64748b"})); 1616 + defs.appendChild(m); 1617 + var m2 = el("marker", {id:"cycle-arrow-red", markerWidth:"10", markerHeight:"7", 1618 + refX:"10", refY:"3.5", orient:"auto"}); 1619 + m2.appendChild(el("polygon", {points:"0 0, 10 3.5, 0 7", fill:"#dc2626"})); 1620 + defs.appendChild(m2); 1621 + svg.appendChild(defs); 1622 + 1623 + // Phases 1624 + var phases = [ 1625 + {x:20, y:80, w:120, h:80, color:"#ea580c", title:"Observations", sub:"Accumulate", 1626 + detail:["Camera traps, GPS,","GBIF, iNat, IUCN"]}, 1627 + {x:170, y:80, w:120, h:80, color:"#7c3aed", title:"Synthetic", sub:"Augment", 1628 + detail:["Lotka-Volterra","simulation"]}, 1629 + {x:320, y:80, w:120, h:80, color:"#a16207", title:"Training Set", sub:"Assemble", 1630 + detail:["Balanced sample","23% synthetic"]}, 1631 + {x:470, y:80, w:120, h:80, color:"#0891b2", title:"Train Model", sub:"TESSERA v3.1", 1632 + detail:["Habitat classifier","from embeddings"]}, 1633 + {x:620, y:80, w:120, h:80, color:"#65a30d", title:"Classify", sub:"Habitat tiles", 1634 + detail:["Savanna, shrubland,","cropland tiles"]}, 1635 + {x:770, y:80, w:90, h:80, color:"#16a34a", title:"AOH", sub:"Compute", 1636 + detail:["Range \u2229 suitable","habitat"]} 1637 + ]; 1638 + 1639 + phases.forEach(function(p, i) { 1640 + svg.appendChild(el("rect", { 1641 + x:p.x+2, y:p.y+2, width:p.w, height:p.h, rx:8, fill:"#e2e8f0" 1642 + })); 1643 + svg.appendChild(el("rect", { 1644 + x:p.x, y:p.y, width:p.w, height:p.h, rx:8, 1645 + fill:"#fff", stroke:p.color, "stroke-width":"2" 1646 + })); 1647 + svg.appendChild(el("rect", { 1648 + x:p.x, y:p.y, width:p.w, height:6, rx:3, fill:p.color 1649 + })); 1650 + 1651 + var t = el("text", {x:p.x+p.w/2, y:p.y+28, "text-anchor":"middle", 1652 + "font-size":"12", "font-weight":"700", fill:p.color}); 1653 + t.textContent = p.title; 1654 + svg.appendChild(t); 1655 + 1656 + var s = el("text", {x:p.x+p.w/2, y:p.y+42, "text-anchor":"middle", 1657 + "font-size":"9", fill:"#64748b"}); 1658 + s.textContent = p.sub; 1659 + svg.appendChild(s); 1660 + 1661 + p.detail.forEach(function(line, li) { 1662 + var d = el("text", {x:p.x+p.w/2, y:p.y+56+li*12, "text-anchor":"middle", 1663 + "font-size":"8", fill:"#94a3b8"}); 1664 + d.textContent = line; 1665 + svg.appendChild(d); 1666 + }); 1667 + 1668 + // Phase number 1669 + var numC = el("circle", {cx:p.x+12, cy:p.y-10, r:10, fill:p.color}); 1670 + svg.appendChild(numC); 1671 + var numT = el("text", {x:p.x+12, y:p.y-6, "text-anchor":"middle", 1672 + "font-size":"10", "font-weight":"700", fill:"#fff"}); 1673 + numT.textContent = String(i+1); 1674 + svg.appendChild(numT); 1675 + }); 1676 + 1677 + // Forward arrows 1678 + [[0,2],[1,2],[2,3],[3,4],[4,5]].forEach(function(pair) { 1679 + var from = phases[pair[0]], to = phases[pair[1]]; 1680 + svg.appendChild(el("line", { 1681 + x1: from.x + from.w + 4, y1: from.y + from.h/2, 1682 + x2: to.x - 4, y2: to.y + to.h/2, 1683 + stroke:"#64748b", "stroke-width":"2", 1684 + "marker-end":"url(#cycle-arrow)" 1685 + })); 1686 + }); 1687 + 1688 + // Species range bypass arc 1689 + var obs = phases[0], aoh = phases[5]; 1690 + svg.appendChild(el("path", { 1691 + d: "M " + (obs.x + obs.w/2) + " " + (obs.y + obs.h) + 1692 + " Q " + (obs.x + obs.w/2) + " " + (obs.y + obs.h + 65) + 1693 + " " + 620 + " " + (obs.y + obs.h + 65) + 1694 + " Q " + (aoh.x + aoh.w/2) + " " + (obs.y + obs.h + 65) + 1695 + " " + (aoh.x + aoh.w/2) + " " + (aoh.y + aoh.h), 1696 + fill:"none", stroke:"#3b82f6", "stroke-width":"1.5", "stroke-dasharray":"5,3", 1697 + "marker-end":"url(#cycle-arrow)" 1698 + })); 1699 + var rangeLabel = el("text", {x:400, y:obs.y + obs.h + 60, "text-anchor":"middle", 1700 + "font-size":"9", fill:"#3b82f6", "font-weight":"600"}); 1701 + rangeLabel.textContent = "Species Range (measured-only, alpha-shape)"; 1702 + svg.appendChild(rangeLabel); 1703 + 1704 + // Recomputation feedback 1705 + svg.appendChild(el("path", { 1706 + d: "M " + (aoh.x + aoh.w) + " " + (aoh.y + 20) + 1707 + " C " + (aoh.x + aoh.w + 30) + " " + (aoh.y + 20) + 1708 + " " + (aoh.x + aoh.w + 30) + " 22 " + 1709 + " " + (phases[2].x + phases[2].w/2) + " 22" + 1710 + " L " + (phases[2].x + phases[2].w/2) + " " + (phases[2].y - 2), 1711 + fill:"none", stroke:"#dc2626", "stroke-width":"1.5", "stroke-dasharray":"4,3", 1712 + "marker-end":"url(#cycle-arrow-red)" 1713 + })); 1714 + var recompLabel = el("text", {x:590, y:16, "text-anchor":"middle", 1715 + "font-size":"9", fill:"#dc2626", "font-weight":"600"}); 1716 + recompLabel.textContent = "Recompute when new observations arrive or model retrains"; 1717 + svg.appendChild(recompLabel); 1718 + 1719 + // Origin type labels 1720 + [ 1721 + {x:80, y:175, text:"Measured", color:"#ea580c"}, 1722 + {x:230, y:175, text:"Simulated", color:"#7c3aed"}, 1723 + {x:380, y:175, text:"Derived", color:"#a16207"} 1724 + ].forEach(function(ol) { 1725 + var t = el("text", {x:ol.x, y:ol.y, "text-anchor":"middle", 1726 + "font-size":"8", fill:ol.color, "font-weight":"600"}); 1727 + t.textContent = ol.text; 1728 + svg.appendChild(t); 1729 + }); 1730 + 1731 + // Bottom note 1732 + var note = el("text", {x:W/2, y:H - 15, "text-anchor":"middle", 1733 + "font-size":"10", fill:"#64748b", "font-style":"italic"}); 1734 + note.textContent = "Each phase produces labels that become inputs to the next. Old versions are retained for comparison."; 1735 + svg.appendChild(note); 1736 + } 1737 + 1738 + // ═══════════════════════════════════════════════════════ 1739 + // INITIALISE 1740 + // ═══════════════════════════════════════════════════════ 1741 + 1742 + document.addEventListener("DOMContentLoaded", function() { 1743 + buildMap(); 1744 + setupToggles(); 1745 + buildDAG(); 1746 + buildCycleDiagram(); 1747 + }); 1748 + 1749 + })(); 1750 + </script> 1751 + 1752 + </body> 1753 + </html>
+15
dune-project
··· 1 + (lang dune 3.16) 2 + (name terradots) 3 + (generate_opam_files true) 4 + 5 + (package 6 + (name terradots) 7 + (synopsis "Geospatial label store for planetary observation data") 8 + (description 9 + "A data model for geospatial labels — human observations, registry 10 + imports, simulation outputs, and derived annotations used to train 11 + geospatial foundation models. Supports full provenance tracking, 12 + Hilbert curve spatial indexing, and Darwin Core temporal conventions.") 13 + (license ISC) 14 + (depends 15 + (ocaml (>= 5.2))))
+846
example/aoh_example.ml
··· 1 + (** AOH worked example: {i Panthera leo} in the Serengeti ecosystem. 2 + 3 + Demonstrates the full label pipeline from raw observations 4 + through synthetic simulation to Area of Habitat calculation, 5 + integrating data from: 6 + 7 + - Camera traps (Serengeti Lion Project grid) 8 + - GPS collars (Movebank study 1234) 9 + - GBIF occurrence records 10 + - iNaturalist citizen science observations 11 + - IUCN Red List expert range and habitat preferences 12 + - Lotka-Volterra population simulation (synthetic) 13 + - TESSERA v3.1 habitat classification 14 + 15 + The provenance graph: 16 + {v 17 + AOH polygon 18 + ├── species_range (alpha-shape, measured-only) 19 + │ ├── camera trap detections 20 + │ ├── GPS collar fixes (Movebank) 21 + │ ├── GBIF occurrences 22 + │ └── iNaturalist observations 23 + ├── IUCN expert range (validation) 24 + └── habitat suitability tiles (TESSERA) 25 + └── training set 26 + ├── all measured occurrences 27 + ├── IUCN habitat preferences 28 + └── synthetic augmentation (Lotka-Volterra) 29 + v} *) 30 + 31 + open Terradots 32 + 33 + let ed = event_date_of_string 34 + let c = cell_of_string 35 + 36 + (* ══════════════════════════════════════════════════════════ 37 + 1. Activities — the audit trail 38 + 39 + Each activity links a batch of labels to who/what produced 40 + them and when. The [agent] field points to Fairground 41 + notebook URIs where applicable. 42 + ══════════════════════════════════════════════════════════ *) 43 + 44 + let act_field_survey = 45 + { activity_id = "act-field-2024"; 46 + agent = "orcid:0000-0002-1234-5678"; 47 + date = "2024-06-15T08:00:00Z"; 48 + description = Some "Serengeti Lion Project 2024 dry-season \ 49 + camera trap survey" } 50 + 51 + let act_movebank_import = 52 + { activity_id = "act-movebank-import"; 53 + agent = "fairground:notebook/movebank-ingest:v2"; 54 + date = "2024-07-01T12:00:00Z"; 55 + description = Some "Bulk import of GPS collar data from \ 56 + Movebank study 1234, individuals leo-007 \ 57 + and leo-012" } 58 + 59 + let act_gbif_import = 60 + { activity_id = "act-gbif-import"; 61 + agent = "fairground:notebook/gbif-ingest:v3"; 62 + date = "2024-07-02T10:00:00Z"; 63 + description = Some "GBIF Panthera leo occurrences, East Africa, \ 64 + 2020-2024" } 65 + 66 + let act_inat_import = 67 + { activity_id = "act-inat-import"; 68 + agent = "fairground:notebook/inat-ingest:v1"; 69 + date = "2024-07-02T14:00:00Z"; 70 + description = Some "iNaturalist research-grade P. leo observations, \ 71 + Serengeti-Mara ecosystem" } 72 + 73 + let act_iucn_import = 74 + { activity_id = "act-iucn-import"; 75 + agent = "fairground:notebook/iucn-ingest:v1"; 76 + date = "2024-07-03T09:00:00Z"; 77 + description = Some "IUCN Red List Panthera leo assessment: expert \ 78 + range polygon and habitat preference codes" } 79 + 80 + let act_simulation = 81 + { activity_id = "act-sim-lv-001"; 82 + agent = "fairground:notebook/lotka-volterra-serengeti:v4@cell-7"; 83 + date = "2024-07-10T16:00:00Z"; 84 + description = Some "Lotka-Volterra predator-prey simulation, \ 85 + lion-zebra-wildebeest, Serengeti parameterisation, \ 86 + 100-year projection, seed=42" } 87 + 88 + let act_training_set = 89 + { activity_id = "act-training-2024"; 90 + agent = "fairground:notebook/sdm-training:v2"; 91 + date = "2024-07-15T10:00:00Z"; 92 + description = Some "Assemble training set for P. leo SDM: \ 93 + balanced spatial sample with synthetic \ 94 + augmentation from Lotka-Volterra run" } 95 + 96 + let act_habitat = 97 + { activity_id = "act-habitat-2024"; 98 + agent = "fairground:notebook/habitat-classify:v3"; 99 + date = "2024-07-16T09:00:00Z"; 100 + description = Some "Habitat suitability classification from \ 101 + TESSERA v3.1 land-cover embeddings, \ 102 + thresholded against IUCN habitat codes" } 103 + 104 + let act_range = 105 + { activity_id = "act-range-2024"; 106 + agent = "fairground:notebook/species-range:v2"; 107 + date = "2024-07-16T11:00:00Z"; 108 + description = Some "Alpha-shape species range from all verified \ 109 + occurrences (measured-only, no synthetic)" } 110 + 111 + let act_aoh = 112 + { activity_id = "act-aoh-2024"; 113 + agent = "fairground:notebook/aoh-iucn:v3"; 114 + date = "2024-07-16T14:00:00Z"; 115 + description = Some "IUCN Area of Habitat: species range intersected \ 116 + with suitable habitat tiles" } 117 + 118 + (* ══════════════════════════════════════════════════════════ 119 + 2. Camera trap observations — Serengeti Lion Project 120 + 121 + Fixed sensors in the Serengeti NP grid. Each trigger 122 + produces a Point at the trap's surveyed coordinates. 123 + Hilbert cells b7a–b7f cover the Serengeti at level 12. 124 + ══════════════════════════════════════════════════════════ *) 125 + 126 + let trap_01 = 127 + make_point 128 + ~cell:(c "b7a") ~id:"ct-001" 129 + ~x:34.82 ~y:(-2.33) 130 + ~observer:"urn:sensor:camera-trap:serengeti-node-17" 131 + ~class_dist:[("Panthera leo", 1.0)] ~accuracy_m:5.0 ~confidence:0.97 132 + ~event_date:(ed "2024-06-12T05:42:00Z") 133 + ~activity:"act-field-2024" 134 + ~properties:[ 135 + ("image_uri", "s3://slp/ct17/IMG_4821.jpg"); 136 + ("individual_count", "3"); 137 + ("behaviour", "resting")] 138 + () 139 + 140 + let trap_02 = 141 + make_point 142 + ~cell:(c "b7a") ~id:"ct-002" 143 + ~x:34.83 ~y:(-2.32) 144 + ~observer:"urn:sensor:camera-trap:serengeti-node-17" 145 + ~class_dist:[("Panthera leo", 1.0)] ~accuracy_m:5.0 ~confidence:0.92 146 + ~event_date:(ed "2024-06-14T19:15:00Z") 147 + ~activity:"act-field-2024" 148 + ~properties:[ 149 + ("image_uri", "s3://slp/ct17/IMG_4903.jpg"); 150 + ("individual_count", "1"); 151 + ("behaviour", "walking")] 152 + () 153 + 154 + let trap_03 = 155 + make_point 156 + ~cell:(c "b7c") ~id:"ct-003" 157 + ~x:35.01 ~y:(-2.15) 158 + ~observer:"urn:sensor:camera-trap:serengeti-node-42" 159 + ~class_dist:[("Panthera leo", 1.0)] ~accuracy_m:5.0 ~confidence:0.88 160 + ~event_date:(ed "2024-06-18T03:22:00Z") 161 + ~activity:"act-field-2024" 162 + ~properties:[ 163 + ("image_uri", "s3://slp/ct42/IMG_1207.jpg"); 164 + ("individual_count", "2")] 165 + () 166 + 167 + (** Non-detection: trap triggered by motion but no lion present. 168 + These matter for occupancy models — absence data is data. *) 169 + let trap_04 = 170 + make_point 171 + ~cell:(c "b7d") ~id:"ct-004" 172 + ~x:35.22 ~y:(-2.45) 173 + ~observer:"urn:sensor:camera-trap:serengeti-node-55" 174 + ~event_date:(ed "2024-06-20T22:10:00Z") 175 + ~activity:"act-field-2024" 176 + ~properties:[ 177 + ("image_uri", "s3://slp/ct55/IMG_0891.jpg"); 178 + ("trigger", "motion"); 179 + ("species_detected", "none")] 180 + () 181 + 182 + (* ══════════════════════════════════════════════════════════ 183 + 3. GPS collar tracks — Movebank study 1234 184 + 185 + Imported via the Movebank registry. Each fix is a Point 186 + with the collar as observer and the Movebank event URI as 187 + the registry record ([via]). 188 + ══════════════════════════════════════════════════════════ *) 189 + 190 + (** Individual leo-007: three fixes showing movement NE. *) 191 + let gps_01 = 192 + make_imported 193 + ~cell:(c "b7a") ~id:"gps-001" 194 + ~geometry:(Point { x = 34.81; y = -2.34 }) 195 + ~via:"movebank:study/1234/individual/leo-007/event/98001" 196 + ~observer:"urn:sensor:gps:vectronic-vertex-plus-007" 197 + ~license:"CC-BY-NC-4.0" 198 + ~accuracy_m:3.5 199 + ~class_dist:[("Panthera leo", 1.0)] 200 + ~event_date:(ed "2024-06-10T06:00:00Z") 201 + ~activity:"act-movebank-import" 202 + ~properties:[ 203 + ("individual_id", "leo-007"); 204 + ("fix_type", "3D"); ("hdop", "0.9")] 205 + () 206 + 207 + let gps_02 = 208 + make_imported 209 + ~cell:(c "b7a") ~id:"gps-002" 210 + ~geometry:(Point { x = 34.84; y = -2.31 }) 211 + ~via:"movebank:study/1234/individual/leo-007/event/98002" 212 + ~observer:"urn:sensor:gps:vectronic-vertex-plus-007" 213 + ~license:"CC-BY-NC-4.0" 214 + ~accuracy_m:4.2 215 + ~class_dist:[("Panthera leo", 1.0)] 216 + ~event_date:(ed "2024-06-10T12:00:00Z") 217 + ~activity:"act-movebank-import" 218 + ~properties:[ 219 + ("individual_id", "leo-007"); 220 + ("fix_type", "3D"); ("hdop", "1.1")] 221 + () 222 + 223 + let gps_03 = 224 + make_imported 225 + ~cell:(c "b7b") ~id:"gps-003" 226 + ~geometry:(Point { x = 34.91; y = -2.28 }) 227 + ~via:"movebank:study/1234/individual/leo-007/event/98003" 228 + ~observer:"urn:sensor:gps:vectronic-vertex-plus-007" 229 + ~license:"CC-BY-NC-4.0" 230 + ~accuracy_m:5.1 231 + ~class_dist:[("Panthera leo", 1.0)] 232 + ~event_date:(ed "2024-06-11T06:00:00Z") 233 + ~activity:"act-movebank-import" 234 + ~properties:[ 235 + ("individual_id", "leo-007"); 236 + ("fix_type", "3D"); ("hdop", "1.4")] 237 + () 238 + 239 + (** Individual leo-012: separate pride, further east. *) 240 + let gps_04 = 241 + make_imported 242 + ~cell:(c "b7c") ~id:"gps-004" 243 + ~geometry:(Point { x = 35.05; y = -2.10 }) 244 + ~via:"movebank:study/1234/individual/leo-012/event/98501" 245 + ~observer:"urn:sensor:gps:vectronic-vertex-plus-012" 246 + ~license:"CC-BY-NC-4.0" 247 + ~accuracy_m:3.0 248 + ~class_dist:[("Panthera leo", 1.0)] 249 + ~event_date:(ed "2024-06-12T06:00:00Z") 250 + ~activity:"act-movebank-import" 251 + ~properties:[("individual_id", "leo-012")] 252 + () 253 + 254 + (* ══════════════════════════════════════════════════════════ 255 + 4. GBIF occurrence records 256 + 257 + Museum specimens and field surveys aggregated through GBIF. 258 + Note the varying accuracy — the 2021 record has 500 m 259 + uncertainty (flagged for review in annotations). 260 + ══════════════════════════════════════════════════════════ *) 261 + 262 + let gbif_01 = 263 + make_imported 264 + ~cell:(c "b7a") ~id:"gbif-001" 265 + ~geometry:(Point { x = 34.85; y = -2.35 }) 266 + ~via:"gbif:4023589127" 267 + ~license:"CC-BY-4.0" 268 + ~accuracy_m:100.0 269 + ~class_dist:[("Panthera leo", 1.0)] 270 + ~event_date:(ed "2022-08-14") 271 + ~activity:"act-gbif-import" 272 + ~properties:[ 273 + ("gbif_dataset", "serengeti-biodiversity-survey"); 274 + ("basis_of_record", "HUMAN_OBSERVATION"); 275 + ("recorded_by", "Tanzania Wildlife Research Institute")] 276 + () 277 + 278 + let gbif_02 = 279 + make_imported 280 + ~cell:(c "b7e") ~id:"gbif-002" 281 + ~geometry:(Point { x = 35.40; y = -2.50 }) 282 + ~via:"gbif:4023589999" 283 + ~license:"CC-BY-4.0" 284 + ~accuracy_m:500.0 285 + ~class_dist:[("Panthera leo", 1.0)] 286 + ~event_date:(ed "2021") 287 + ~activity:"act-gbif-import" 288 + ~properties:[ 289 + ("gbif_dataset", "ngorongoro-mammal-survey"); 290 + ("basis_of_record", "HUMAN_OBSERVATION")] 291 + () 292 + 293 + (* ══════════════════════════════════════════════════════════ 294 + 5. iNaturalist citizen science 295 + 296 + Research-grade observations from the iNaturalist platform. 297 + The observer is a user URI; the record is the observation URI. 298 + ══════════════════════════════════════════════════════════ *) 299 + 300 + let inat_01 = 301 + make_imported 302 + ~cell:(c "b7b") ~id:"inat-001" 303 + ~geometry:(Point { x = 34.95; y = -2.20 }) 304 + ~via:"inaturalist:observation/182345678" 305 + ~observer:"inaturalist:user/safari_dave" 306 + ~license:"CC-BY-NC-4.0" 307 + ~accuracy_m:50.0 308 + ~class_dist:[("Panthera leo", 1.0)] 309 + ~confidence:0.95 310 + ~event_date:(ed "2023-07-22T16:30:00Z") 311 + ~activity:"act-inat-import" 312 + ~properties:[ 313 + ("quality_grade", "research"); 314 + ("num_identifications", "5")] 315 + () 316 + 317 + (* ══════════════════════════════════════════════════════════ 318 + 6. IUCN Red List — expert range and habitat preferences 319 + 320 + The IUCN assessment provides two things: 321 + (a) An expert-drawn range polygon for the species. 322 + (b) Habitat preference codes (IUCN Habitats Classification 323 + Scheme) with suitability ratings. 324 + 325 + The range polygon validates the data-driven range; the 326 + habitat codes drive the suitability classification. 327 + ══════════════════════════════════════════════════════════ *) 328 + 329 + (** Expert-drawn range polygon (simplified to bounding extent). *) 330 + let iucn_range = 331 + make_imported 332 + ~cell:(c "b70") ~id:"iucn-range-001" 333 + ~geometry:(Polygon [ 334 + { x = 34.0; y = -3.0 }; 335 + { x = 36.0; y = -3.0 }; 336 + { x = 36.0; y = -1.0 }; 337 + { x = 34.0; y = -1.0 }; 338 + { x = 34.0; y = -3.0 }; 339 + ]) 340 + ~via:"iucn:redlist:22/Panthera-leo:range:2024.1" 341 + ~license:"CC-BY-NC-4.0" 342 + ~class_dist:[("Panthera leo", 1.0)] 343 + ~event_date:(ed "2024") 344 + ~activity:"act-iucn-import" 345 + ~properties:[ 346 + ("iucn_status", "VU"); 347 + ("iucn_criteria", "A2abcd"); 348 + ("population_trend", "decreasing"); 349 + ("range_type", "extant:resident"); 350 + ("habitat_codes", "1.5;1.6;2;3;14.1")] 351 + () 352 + 353 + (** Habitat preference: savanna (IUCN code 2) — major habitat. *) 354 + let iucn_hab_savanna = 355 + make_imported 356 + ~cell:(c "b70") ~id:"iucn-hab-001" 357 + ~geometry:(Point { x = 35.0; y = -2.0 }) 358 + ~via:"iucn:redlist:22/Panthera-leo:habitat:2" 359 + ~license:"CC-BY-NC-4.0" 360 + ~class_dist:[("habitat-preference:savanna", 1.0)] 361 + ~confidence:0.95 362 + ~activity:"act-iucn-import" 363 + ~properties:[ 364 + ("iucn_habitat_code", "2"); 365 + ("suitability", "Suitable"); 366 + ("major_importance", "Yes")] 367 + () 368 + 369 + (** Habitat preference: shrubland (IUCN code 3) — minor habitat. *) 370 + let iucn_hab_shrubland = 371 + make_imported 372 + ~cell:(c "b70") ~id:"iucn-hab-002" 373 + ~geometry:(Point { x = 35.0; y = -2.0 }) 374 + ~via:"iucn:redlist:22/Panthera-leo:habitat:3" 375 + ~license:"CC-BY-NC-4.0" 376 + ~class_dist:[("habitat-preference:shrubland", 1.0)] 377 + ~confidence:0.70 378 + ~activity:"act-iucn-import" 379 + ~properties:[ 380 + ("iucn_habitat_code", "3"); 381 + ("suitability", "Suitable"); 382 + ("major_importance", "No")] 383 + () 384 + 385 + (* ══════════════════════════════════════════════════════════ 386 + 7. Synthetic simulation — Lotka-Volterra population dynamics 387 + 388 + Agent-based Lotka-Volterra model producing simulated lion 389 + positions in under-sampled areas (Ngorongoro corridor). 390 + These augment the SDM training set but are NEVER included 391 + in the measured species range. 392 + 393 + The [Simulated] origin keeps them type-level distinct from 394 + real observations. Properties carry the scenario parameters 395 + for reproducibility. 396 + ══════════════════════════════════════════════════════════ *) 397 + 398 + let sim_01 = 399 + make_simulated 400 + ~cell:(c "b7d") ~id:"sim-001" 401 + ~geometry:(Point { x = 35.20; y = -2.50 }) 402 + ~model:"fairground:notebook/lotka-volterra-serengeti:v4" 403 + ~run_id:"lv-run-42" 404 + ~class_dist:[("Panthera leo", 1.0)] 405 + ~event_date:(ed "2024-06-15T00:00:00Z") 406 + ~confidence:0.60 407 + ~activity:"act-sim-lv-001" 408 + ~properties:[ 409 + ("scenario", "baseline-2024"); 410 + ("time_step", "150"); 411 + ("prey_density_km2", "45.2"); 412 + ("seed", "42")] 413 + () 414 + 415 + let sim_02 = 416 + make_simulated 417 + ~cell:(c "b7d") ~id:"sim-002" 418 + ~geometry:(Point { x = 35.18; y = -2.48 }) 419 + ~model:"fairground:notebook/lotka-volterra-serengeti:v4" 420 + ~run_id:"lv-run-42" 421 + ~class_dist:[("Panthera leo", 1.0)] 422 + ~event_date:(ed "2024-06-15T06:00:00Z") 423 + ~confidence:0.60 424 + ~activity:"act-sim-lv-001" 425 + ~properties:[ 426 + ("scenario", "baseline-2024"); 427 + ("time_step", "151"); 428 + ("prey_density_km2", "44.8"); 429 + ("seed", "42")] 430 + () 431 + 432 + (** Drought scenario — prey density drops, lion shifts south. *) 433 + let sim_03 = 434 + make_simulated 435 + ~cell:(c "b7e") ~id:"sim-003" 436 + ~geometry:(Point { x = 35.45; y = -2.55 }) 437 + ~model:"fairground:notebook/lotka-volterra-serengeti:v4" 438 + ~run_id:"lv-run-42" 439 + ~class_dist:[("Panthera leo", 1.0)] 440 + ~event_date:(ed "2024-06-16T00:00:00Z") 441 + ~confidence:0.55 442 + ~activity:"act-sim-lv-001" 443 + ~properties:[ 444 + ("scenario", "drought-2024"); 445 + ("time_step", "152"); 446 + ("prey_density_km2", "28.1"); 447 + ("seed", "42")] 448 + () 449 + 450 + (* ══════════════════════════════════════════════════════════ 451 + 8. Derivation: training set assembly 452 + 453 + The training set is itself a derived label — it records 454 + exactly which labels (measured + synthetic) were selected 455 + for model training, and the synthetic fraction. 456 + 457 + This is the provenance anchor for the SDM: you can always 458 + ask "which observations trained this model?" 459 + ══════════════════════════════════════════════════════════ *) 460 + 461 + let all_measured_ids = 462 + [ "ct-001"; "ct-002"; "ct-003"; 463 + "gps-001"; "gps-002"; "gps-003"; "gps-004"; 464 + "gbif-001"; "gbif-002"; 465 + "inat-001" ] 466 + 467 + let all_synthetic_ids = 468 + [ "sim-001"; "sim-002"; "sim-003" ] 469 + 470 + let training_set = 471 + make_derived 472 + ~cell:(c "b70") ~id:"ts-001" 473 + ~geometry:(Polygon [ 474 + { x = 34.0; y = -3.0 }; 475 + { x = 36.0; y = -3.0 }; 476 + { x = 36.0; y = -1.0 }; 477 + { x = 34.0; y = -1.0 }; 478 + { x = 34.0; y = -3.0 }; 479 + ]) 480 + ~sources:(all_measured_ids @ all_synthetic_ids) 481 + ~method_:"training-set:balanced-spatial-sample" 482 + ~class_dist:[("training-set:Panthera-leo:sdm-2024", 1.0)] 483 + ~activity:"act-training-2024" 484 + ~properties:[ 485 + ("n_measured", string_of_int (List.length all_measured_ids)); 486 + ("n_synthetic", string_of_int (List.length all_synthetic_ids)); 487 + ("synthetic_fraction", "0.23"); 488 + ("spatial_extent", "34.0,-3.0,36.0,-1.0"); 489 + ("temporal_window", "2021/2024"); 490 + ("tessera_model", "tessera:v3.1:east-africa")] 491 + () 492 + 493 + (* ══════════════════════════════════════════════════════════ 494 + 9. Derivation: habitat suitability from TESSERA 495 + 496 + Each TESSERA tile is classified as suitable or unsuitable 497 + based on its land-cover embedding and the IUCN habitat 498 + preference codes. The [sources] link back to the IUCN 499 + habitat labels. 500 + ══════════════════════════════════════════════════════════ *) 501 + 502 + let hab_sources = ["iucn-hab-001"; "iucn-hab-002"] 503 + 504 + (** Core Serengeti savanna — highly suitable. *) 505 + let hab_01 = 506 + make_derived 507 + ~cell:(c "b7a") ~id:"hab-001" 508 + ~geometry:(Polygon [ 509 + { x = 34.80; y = -2.40 }; 510 + { x = 34.90; y = -2.40 }; 511 + { x = 34.90; y = -2.30 }; 512 + { x = 34.80; y = -2.30 }; 513 + { x = 34.80; y = -2.40 }; 514 + ]) 515 + ~sources:hab_sources 516 + ~method_:"habitat-classify:tessera-v3.1:threshold-0.6" 517 + ~confidence:0.91 518 + ~class_dist:[("savanna", 0.78); ("shrubland", 0.13); ("other", 0.09)] 519 + ~activity:"act-habitat-2024" 520 + ~properties:[ 521 + ("tessera_tile", "b7a:034.80:-002.40"); 522 + ("dominant_landcover", "savanna")] 523 + () 524 + 525 + (** Savanna-shrubland mosaic — moderate suitability. *) 526 + let hab_02 = 527 + make_derived 528 + ~cell:(c "b7d") ~id:"hab-002" 529 + ~geometry:(Polygon [ 530 + { x = 35.10; y = -2.60 }; 531 + { x = 35.20; y = -2.60 }; 532 + { x = 35.20; y = -2.50 }; 533 + { x = 35.10; y = -2.50 }; 534 + { x = 35.10; y = -2.60 }; 535 + ]) 536 + ~sources:hab_sources 537 + ~method_:"habitat-classify:tessera-v3.1:threshold-0.6" 538 + ~confidence:0.68 539 + ~class_dist:[("savanna", 0.45); ("shrubland", 0.30); ("cropland", 0.25)] 540 + ~activity:"act-habitat-2024" 541 + ~properties:[ 542 + ("tessera_tile", "b7d:035.10:-002.60"); 543 + ("dominant_landcover", "savanna-shrubland-mosaic")] 544 + () 545 + 546 + (** Agricultural land — unsuitable, excluded from AOH. *) 547 + let hab_03 = 548 + make_derived 549 + ~cell:(c "b7f") ~id:"hab-003" 550 + ~geometry:(Polygon [ 551 + { x = 35.80; y = -1.20 }; 552 + { x = 35.90; y = -1.20 }; 553 + { x = 35.90; y = -1.10 }; 554 + { x = 35.80; y = -1.10 }; 555 + { x = 35.80; y = -1.20 }; 556 + ]) 557 + ~sources:hab_sources 558 + ~method_:"habitat-classify:tessera-v3.1:threshold-0.6" 559 + ~confidence:0.12 560 + ~class_dist:[("cropland", 0.72); ("settlement", 0.18); ("savanna", 0.10)] 561 + ~activity:"act-habitat-2024" 562 + ~properties:[ 563 + ("tessera_tile", "b7f:035.80:-001.20"); 564 + ("dominant_landcover", "cropland")] 565 + () 566 + 567 + (* ══════════════════════════════════════════════════════════ 568 + 10. Derivation: species range from occurrences 569 + 570 + Alpha-shape computed from measured-only data. Synthetic 571 + labels are explicitly excluded — the range must reflect 572 + where lions have actually been observed. 573 + 574 + The [is_simulated] accessor is used by the range pipeline 575 + to filter out synthetic augmentation. 576 + ══════════════════════════════════════════════════════════ *) 577 + 578 + let species_range = 579 + make_derived 580 + ~cell:(c "b70") ~id:"range-001" 581 + ~geometry:(Polygon [ 582 + { x = 34.75; y = -2.60 }; 583 + { x = 35.50; y = -2.60 }; 584 + { x = 35.50; y = -2.00 }; 585 + { x = 35.10; y = -1.90 }; 586 + { x = 34.75; y = -2.10 }; 587 + { x = 34.75; y = -2.60 }; 588 + ]) 589 + ~sources:all_measured_ids (* no sim-* labels *) 590 + ~method_:"alpha-shape:alpha-0.005" 591 + ~class_dist:[("range:Panthera leo", 1.0)] 592 + ~activity:"act-range-2024" 593 + ~properties:[ 594 + ("range_km2", "4850"); 595 + ("n_occurrences", string_of_int (List.length all_measured_ids)); 596 + ("excludes_synthetic", "true")] 597 + () 598 + 599 + (* ══════════════════════════════════════════════════════════ 600 + 11. Derivation: Area of Habitat 601 + 602 + The final AOH is the intersection of: 603 + - the data-driven species range (measured-only) 604 + - the TESSERA habitat suitability tiles (suitable only) 605 + - validated against the IUCN expert range 606 + 607 + The result is a Multi polygon — disconnected habitat 608 + patches within the range. Properties carry the IUCN 609 + assessment metadata and the key metrics. 610 + 611 + When TESSERA is retrained (v3.1 → v3.2), the habitat 612 + tiles change, so AOH recomputes. The new AOH label gets 613 + a new activity; both versions coexist for comparison. 614 + ══════════════════════════════════════════════════════════ *) 615 + 616 + let aoh = 617 + make_derived 618 + ~cell:(c "b70") ~id:"aoh-001" 619 + ~geometry:(Multi [ 620 + (* Patch 1: core Serengeti savanna *) 621 + Polygon [ 622 + { x = 34.80; y = -2.40 }; 623 + { x = 35.20; y = -2.40 }; 624 + { x = 35.20; y = -2.10 }; 625 + { x = 34.80; y = -2.10 }; 626 + { x = 34.80; y = -2.40 }; 627 + ]; 628 + (* Patch 2: southern extension into Ngorongoro *) 629 + Polygon [ 630 + { x = 35.10; y = -2.60 }; 631 + { x = 35.40; y = -2.60 }; 632 + { x = 35.40; y = -2.40 }; 633 + { x = 35.10; y = -2.40 }; 634 + { x = 35.10; y = -2.60 }; 635 + ]; 636 + ]) 637 + ~sources:[ 638 + "range-001"; (* data-driven species range *) 639 + "iucn-range-001"; (* IUCN expert range — validation *) 640 + "hab-001"; "hab-002"; (* suitable habitat tiles *) 641 + (* hab-003 excluded: unsuitable cropland *) 642 + ] 643 + ~method_:"aoh:iucn-2022:range-intersect-habitat" 644 + ~class_dist:[("aoh:Panthera leo", 1.0)] 645 + ~activity:"act-aoh-2024" 646 + ~properties:[ 647 + (* AOH metrics *) 648 + ("aoh_km2", "3420"); 649 + ("range_km2", "4850"); 650 + ("habitat_proportion", "0.705"); 651 + ("unsuitable_excluded_km2", "1430"); 652 + ("dominant_exclusion", "cropland"); 653 + (* IUCN assessment context *) 654 + ("iucn_status", "VU"); 655 + ("iucn_criteria", "A2abcd"); 656 + ("population_trend", "decreasing"); 657 + (* Model provenance *) 658 + ("tessera_model", "tessera:v3.1:east-africa"); 659 + ("synthetic_in_sdm_training", "true"); 660 + ("synthetic_fraction_in_training", "0.23")] 661 + () 662 + 663 + (* ══════════════════════════════════════════════════════════ 664 + 12. Document assembly 665 + ══════════════════════════════════════════════════════════ *) 666 + 667 + let doc = 668 + { crs = wgs84; 669 + level = 12; 670 + provenance = [ 671 + act_field_survey; 672 + act_movebank_import; 673 + act_gbif_import; 674 + act_inat_import; 675 + act_iucn_import; 676 + act_simulation; 677 + act_training_set; 678 + act_habitat; 679 + act_range; 680 + act_aoh; 681 + ]; 682 + labels = [ 683 + (* Camera traps *) 684 + trap_01; trap_02; trap_03; trap_04; 685 + (* GPS collars — Movebank *) 686 + gps_01; gps_02; gps_03; gps_04; 687 + (* GBIF *) 688 + gbif_01; gbif_02; 689 + (* iNaturalist *) 690 + inat_01; 691 + (* IUCN Red List *) 692 + iucn_range; iucn_hab_savanna; iucn_hab_shrubland; 693 + (* Synthetic — Lotka-Volterra *) 694 + sim_01; sim_02; sim_03; 695 + (* Derivations *) 696 + training_set; 697 + hab_01; hab_02; hab_03; 698 + species_range; 699 + aoh; 700 + ]; 701 + annotations = [ 702 + { id = "ann-001"; 703 + text = "Camera trap ct-001 and GPS fix gps-001 are 1.4 km \ 704 + apart on the same day — likely same pride. Consider \ 705 + merge after dry-season survey completes."; 706 + anchors = ["ct-001"; "gps-001"] }; 707 + { id = "ann-002"; 708 + text = "GBIF gbif-002 has 500 m uncertainty and only year-level \ 709 + temporal precision. Flag for review before including \ 710 + in high-resolution analyses."; 711 + anchors = ["gbif-002"] }; 712 + { id = "ann-003"; 713 + text = "Synthetic labels sim-001..sim-003 augment the under-sampled \ 714 + Ngorongoro corridor. Weight reduced to 0.5x in training \ 715 + set assembly. Not included in species range computation."; 716 + anchors = ["sim-001"; "sim-002"; "sim-003"] }; 717 + { id = "ann-004"; 718 + text = "AOH shows 70.5% of range is suitable habitat. Main \ 719 + exclusion is cropland encroachment on the eastern boundary. \ 720 + Compare with IUCN 2019 assessment (was 78%)."; 721 + anchors = ["aoh-001"] }; 722 + ]; 723 + groups = [ 724 + { id = "grp-field-2024"; 725 + activity = Some "act-field-2024"; 726 + members = ["ct-001"; "ct-002"; "ct-003"; "ct-004"] }; 727 + { id = "grp-leo-007-track"; 728 + activity = Some "act-movebank-import"; 729 + members = ["gps-001"; "gps-002"; "gps-003"] }; 730 + { id = "grp-leo-012-track"; 731 + activity = Some "act-movebank-import"; 732 + members = ["gps-004"] }; 733 + { id = "grp-synthetic-lv42"; 734 + activity = Some "act-sim-lv-001"; 735 + members = ["sim-001"; "sim-002"; "sim-003"] }; 736 + { id = "grp-iucn-habitat-prefs"; 737 + activity = Some "act-iucn-import"; 738 + members = ["iucn-hab-001"; "iucn-hab-002"] }; 739 + ]; 740 + } 741 + 742 + (* ══════════════════════════════════════════════════════════ 743 + 13. Queries — demonstrating the provenance graph 744 + 745 + These functions show how a wiki renderer or analysis 746 + pipeline would traverse the label graph. 747 + ══════════════════════════════════════════════════════════ *) 748 + 749 + (** Find a label by ID. *) 750 + let find id = 751 + List.find (fun (l : label) -> l.id = id) doc.labels 752 + 753 + (** All labels in a Hilbert cell. *) 754 + let in_cell c = 755 + List.filter (fun (l : label) -> l.cell = c) doc.labels 756 + 757 + (** All measured (non-synthetic, non-derived) labels. *) 758 + let measured_only () = 759 + List.filter (fun (l : label) -> 760 + match l.origin with Measured _ -> true | _ -> false) 761 + doc.labels 762 + 763 + (** All simulated labels. *) 764 + let synthetic_only () = 765 + List.filter is_simulated doc.labels 766 + 767 + (** Immediate sources of a derived label. *) 768 + let sources_of_label l = 769 + List.filter_map 770 + (fun src_id -> 771 + match List.find_opt (fun (l : label) -> l.id = src_id) doc.labels with 772 + | Some src -> Some src 773 + | None -> None) 774 + (sources_of l) 775 + 776 + (** Transitive closure: all labels reachable through [sources]. *) 777 + let rec all_ancestors l = 778 + let immediate = sources_of_label l in 779 + let deeper = List.concat_map all_ancestors immediate in 780 + immediate @ deeper 781 + 782 + (** How many synthetic labels influenced this derivation? *) 783 + let synthetic_ancestor_count l = 784 + all_ancestors l 785 + |> List.filter is_simulated 786 + |> List.length 787 + 788 + (** Activity record for a label. *) 789 + let activity_of (l : label) = 790 + match l.activity with 791 + | None -> None 792 + | Some aid -> 793 + List.find_opt (fun a -> a.activity_id = aid) doc.provenance 794 + 795 + (* ══════════════════════════════════════════════════════════ 796 + 14. Main — exercise the provenance queries 797 + ══════════════════════════════════════════════════════════ *) 798 + 799 + let () = 800 + let n_labels = List.length doc.labels in 801 + let n_measured = List.length (measured_only ()) in 802 + let n_synthetic = List.length (synthetic_only ()) in 803 + let n_derived = n_labels - n_measured - n_synthetic in 804 + Printf.printf "Terradots AOH Example: Panthera leo, Serengeti\n"; 805 + Printf.printf "══════════════════════════════════════════════\n"; 806 + Printf.printf "CRS: %s Hilbert level: %d\n" doc.crs doc.level; 807 + Printf.printf "Labels: %d total (%d measured, %d synthetic, %d derived)\n" 808 + n_labels n_measured n_synthetic n_derived; 809 + Printf.printf "Activities: %d\n" (List.length doc.provenance); 810 + Printf.printf "Annotations: %d\n" (List.length doc.annotations); 811 + Printf.printf "Groups: %d\n\n" (List.length doc.groups); 812 + 813 + (* AOH provenance *) 814 + let aoh_label = find "aoh-001" in 815 + Printf.printf "AOH label: %s\n" (label_name aoh_label); 816 + let props key = 817 + List.assoc_opt key aoh_label.properties 818 + |> Option.value ~default:"?" in 819 + Printf.printf " AOH: %s km² / %s km² range = %s suitable\n" 820 + (props "aoh_km2") (props "range_km2") (props "habitat_proportion"); 821 + Printf.printf " IUCN status: %s (%s), trend: %s\n" 822 + (props "iucn_status") (props "iucn_criteria") 823 + (props "population_trend"); 824 + Printf.printf " TESSERA model: %s\n" (props "tessera_model"); 825 + Printf.printf " Synthetic in training: %s (fraction: %s)\n\n" 826 + (props "synthetic_in_sdm_training") 827 + (props "synthetic_fraction_in_training"); 828 + 829 + (* Provenance depth *) 830 + let ancestors = all_ancestors aoh_label in 831 + let n_syn_ancestors = synthetic_ancestor_count aoh_label in 832 + Printf.printf "Provenance graph from AOH:\n"; 833 + Printf.printf " Reachable labels: %d\n" (List.length ancestors); 834 + Printf.printf " Of which synthetic: %d\n" n_syn_ancestors; 835 + 836 + (* Activity for AOH *) 837 + (match activity_of aoh_label with 838 + | Some a -> 839 + Printf.printf " Activity: %s\n" a.activity_id; 840 + Printf.printf " Agent: %s\n" a.agent; 841 + Printf.printf " Date: %s\n" a.date 842 + | None -> ()); 843 + 844 + (* Spatial query *) 845 + Printf.printf "\nLabels in cell b7a: %d\n" 846 + (List.length (in_cell (c "b7a")))
+3
example/dune
··· 1 + (executable 2 + (name aoh_example) 3 + (libraries terradots))
+3
lib/dune
··· 1 + (library 2 + (name terradots) 3 + (public_name terradots))
+315
lib/terradots.ml
··· 1 + (** Terradots Label Store — core data model. 2 + 3 + Coordinates are always in the document's native CRS (e.g. lon/lat 4 + for EPSG:4326, metres for UTM). Pixel-space mapping (affine 5 + transforms, viewBox) is a serialisation concern handled by format 6 + encoders/decoders, not by this module. *) 7 + 8 + (** {1 Coordinate Reference Systems} *) 9 + 10 + (** Any string that PROJ can resolve: ["EPSG:4326"], WKT2, etc. *) 11 + type crs = string 12 + 13 + let wgs84 = "EPSG:4326" 14 + let web_mercator = "EPSG:3857" 15 + 16 + (** {1 Temporal} *) 17 + 18 + type event_date = string 19 + let event_date_of_string s = s 20 + let string_of_event_date s = s 21 + 22 + (** {1 Spatial indexing} *) 23 + 24 + type cell = string 25 + let cell_of_string s = s 26 + let string_of_cell s = s 27 + 28 + (** {1 Geometry} 29 + 30 + Points and closed polygons, following OGC Simple Features (ISO 19125). 31 + Coordinates are in the document's native CRS units. *) 32 + 33 + type point = { x : float; y : float } 34 + 35 + type geometry = 36 + | Point of point 37 + | Polygon of point list (** Exterior ring, closed. *) 38 + | Multi of geometry list (** GeometryCollection / Multi* *) 39 + 40 + (** Representative point for spatial indexing. *) 41 + let rec centroid = function 42 + | Point p -> p 43 + | Polygon ring -> 44 + let n = Float.of_int (List.length ring) in 45 + let sx = List.fold_left (fun acc p -> acc +. p.x) 0.0 ring in 46 + let sy = List.fold_left (fun acc p -> acc +. p.y) 0.0 ring in 47 + { x = sx /. n; y = sy /. n } 48 + | Multi gs -> 49 + let cs = List.map centroid gs in 50 + centroid (Polygon cs) 51 + 52 + (** {1 Origin} 53 + 54 + How a label was produced. This is the only part that varies 55 + by kind — confidence and classification are universal. 56 + 57 + Observers and registries are identified by URI. The scheme 58 + tells you what kind of source it is: 59 + 60 + {v 61 + URI meaning 62 + ──────────────────────────────────── ──────────────────── 63 + orcid:0000-0001-2345-6789 human (ORCID) 64 + https://ror.org/035dkdb55 institution (ROR) 65 + urn:sensor:gps:trimble-r12-0042 GPS receiver 66 + urn:sensor:camera-trap:ct-0042 camera trap 67 + gbif:4023589127 GBIF occurrence 68 + inaturalist:observation/12345 iNaturalist record 69 + osm:node/123456 OpenStreetMap node 70 + v} 71 + 72 + - {b Measured}: observation by someone/something ([observer]), 73 + possibly imported via a registry ([via]). 74 + - {b Derived}: computed from other labels (convex hull, buffer). 75 + Positional accuracy propagates from sources. 76 + - {b Simulated}: produced by a theoretical model (population 77 + dynamics, agent-based simulation). Must remain identifiable 78 + as synthetic — never mixed with measured observations in 79 + analyses that require ground truth. *) 80 + 81 + type origin = 82 + | Measured of { 83 + observer : string option; (** URI of the observer *) 84 + via : string option; (** URI of the registry record *) 85 + license : string option; (** SPDX identifier, e.g. ["CC-BY-4.0"] *) 86 + accuracy_m : float option; (** Positional uncertainty radius (m) *) 87 + } 88 + | Derived of { 89 + sources : string list; (** IDs of source labels *) 90 + method_ : string; (** e.g. ["convex-hull"], ["buffer-10m"] *) 91 + } 92 + | Simulated of { 93 + model : string; (** URI of the simulation model/notebook *) 94 + run_id : string; (** Unique simulation run identifier *) 95 + } 96 + 97 + (** {1 Provenance} 98 + 99 + An [activity] is the audit record for how labels were produced: 100 + who, when, and (for derivations) which inputs and method. 101 + A label's {!field-label.origin} is the structural summary; 102 + the optional {!field-label.activity} links to the full record. *) 103 + 104 + type activity = { 105 + activity_id : string; 106 + agent : string; (** Who or what: email, tool/version, etc. *) 107 + date : string; (** ISO 8601 *) 108 + description : string option; (** Free-text note on what was done *) 109 + } 110 + 111 + (** {1 Labels} 112 + 113 + {b Identity and spatial indexing.} A label has two name 114 + components: [cell] and [id]. 115 + 116 + - [cell] is a Hilbert curve cell index computed from the 117 + label's {!centroid} at the document's {!field-document.level}. 118 + It encodes area and resolution — you can read off where a 119 + label is from its cell. Recomputed on reprojection. 120 + 121 + - [id] is a unique identifier (e.g. UUID) within the cell. 122 + Stable, never changes. 123 + 124 + Together, [cell ^ "-" ^ id] gives a spatially-sortable unique 125 + name. Sorting by this composite groups nearby labels. 126 + 127 + {b Classification.} A label's class is expressed through 128 + [class_dist] — a probability distribution over class names. 129 + A definite classification is [class_dist = \[("Panthera leo", 1.0)\]]. 130 + An uncertain classification distributes probability across 131 + candidates. An unclassified label has [class_dist = \[\]]. 132 + The {!primary_class} accessor returns the most likely class. 133 + 134 + {b Deduplication.} Dedup across sources is a derivation: find 135 + candidate matches (same [cell] + class agreement + temporal 136 + overlap), let an expert decide, and merge via 137 + [Derived { sources = \[a; b\]; method_ = "manual-merge" }]. 138 + Both originals are kept for provenance. 139 + 140 + {b Temporal.} [event_date] follows the Darwin Core 141 + {{: https://techdocs.gbif.org/en/data-processing/temporal-interpretation} 142 + temporal interpretation} convention (ISO 8601-1:2019): 143 + precise dates (["2023-09-18"]), imprecise dates (["2023-09"], 144 + ["2023"]), date-times (["2023-09-18T13:27:00Z"]), or intervals 145 + (["2023-09-05/2023-09-18"]). This records when the observation 146 + was made, not when the label was imported. *) 147 + type label = { 148 + cell : cell; (** Hilbert cell — spatial index hint *) 149 + id : string; (** Unique identifier (e.g. UUID) *) 150 + geometry : geometry; 151 + origin : origin; 152 + event_date : event_date option; (** Darwin Core eventDate, ISO 8601 *) 153 + confidence : float option; (** Semantic confidence ∈ \[0, 1\] *) 154 + class_dist : (string * float) list; (** Per-class probability distribution *) 155 + activity : string option; (** Activity ID for full provenance *) 156 + properties : (string * string) list; (** Extensible key-value metadata *) 157 + } 158 + 159 + (** The full spatially-sortable name of a label. *) 160 + let label_name l = string_of_cell l.cell ^ "-" ^ l.id 161 + 162 + (** A free-text annotation anchored to one or more labels. *) 163 + type annotation = { 164 + id : string; 165 + text : string; 166 + anchors : string list; (** IDs of labels this annotates *) 167 + } 168 + 169 + (** A named group of labels (e.g. a field campaign). *) 170 + type group = { 171 + id : string; 172 + activity : string option; (** Activity ID for this group's provenance *) 173 + members : string list; (** IDs of labels in this group *) 174 + } 175 + 176 + (** {1 Document} *) 177 + 178 + type document = { 179 + crs : crs; 180 + level : int; (** Hilbert curve level for {!field-label.cell} *) 181 + provenance : activity list; 182 + labels : label list; 183 + annotations : annotation list; 184 + groups : group list; 185 + } 186 + 187 + (** {1 Constructors} *) 188 + 189 + let make_point ~cell ~id ~x ~y ~observer ?accuracy_m 190 + ?event_date ?confidence ?(class_dist = []) ?activity ?(properties = []) () = 191 + { cell; id; geometry = Point { x; y }; 192 + origin = Measured { observer = Some observer; via = None; 193 + license = None; accuracy_m }; 194 + event_date; confidence; class_dist; activity; properties } 195 + 196 + let make_polygon ~cell ~id ~ring ~observer ?accuracy_m 197 + ?event_date ?confidence ?(class_dist = []) ?activity ?(properties = []) () = 198 + { cell; id; geometry = Polygon ring; 199 + origin = Measured { observer = Some observer; via = None; 200 + license = None; accuracy_m }; 201 + event_date; confidence; class_dist; activity; properties } 202 + 203 + let make_imported ~cell ~id ~geometry ~via ?observer ?license 204 + ?accuracy_m ?event_date ?confidence ?(class_dist = []) 205 + ?activity ?(properties = []) () = 206 + { cell; id; geometry; 207 + origin = Measured { observer; via = Some via; license; accuracy_m }; 208 + event_date; confidence; class_dist; activity; properties } 209 + 210 + let make_derived ~cell ~id ~geometry ~sources ~method_ 211 + ?event_date ?confidence ?(class_dist = []) ?activity ?(properties = []) () = 212 + { cell; id; geometry; 213 + origin = Derived { sources; method_ }; 214 + event_date; confidence; class_dist; activity; properties } 215 + 216 + let make_simulated ~cell ~id ~geometry ~model ~run_id 217 + ?event_date ?confidence ?(class_dist = []) ?activity ?(properties = []) () = 218 + { cell; id; geometry; 219 + origin = Simulated { model; run_id }; 220 + event_date; confidence; class_dist; activity; properties } 221 + 222 + (** {1 Accessors} *) 223 + 224 + let primary_class l = 225 + match l.class_dist with 226 + | [] -> None 227 + | (c, _) :: _ -> 228 + Some (List.fold_left 229 + (fun (bc, bp) (c, p) -> if p > bp then (c, p) else (bc, bp)) 230 + (c, 0.0) l.class_dist 231 + |> fst) 232 + 233 + let accuracy_of l = 234 + match l.origin with 235 + | Measured { accuracy_m; _ } -> accuracy_m 236 + | Derived _ | Simulated _ -> None 237 + 238 + let sources_of l = 239 + match l.origin with 240 + | Measured _ | Simulated _ -> [] 241 + | Derived { sources; _ } -> sources 242 + 243 + (** Registry URI, if this label was imported via a registry. *) 244 + let via_of l = 245 + match l.origin with 246 + | Measured { via; _ } -> via 247 + | Derived _ | Simulated _ -> None 248 + 249 + (** Is this label synthetic (from a simulation)? *) 250 + let is_simulated l = 251 + match l.origin with 252 + | Simulated _ -> true 253 + | Measured _ | Derived _ -> false 254 + 255 + (** {1 Fingerprinting} 256 + 257 + A fingerprint is [cell + primary class] — a coarse key for 258 + finding dedup candidates. Event date is deliberately excluded: 259 + the same feature observed at different times should still match 260 + as a candidate for human review. *) 261 + 262 + let fingerprint l = 263 + let class_str = Option.value ~default:"_" (primary_class l) in 264 + string_of_cell l.cell ^ "|" ^ class_str 265 + 266 + let empty_document ~crs ?(level = 12) () = 267 + { crs; level; provenance = []; labels = []; 268 + annotations = []; groups = [] } 269 + 270 + (** {1 Storage Layer} 271 + 272 + The data model above is independent of how labels are stored 273 + and indexed. This section defines the interface between the 274 + core types and a storage backend. 275 + 276 + {b Hilbert cell computation.} The [cell] field on each label 277 + is a hex-encoded Hilbert curve cell index, computed by the 278 + storage layer from the label's {!centroid} at the document's 279 + {!field-document.level}. The Hilbert curve maps 2D coordinates 280 + to a 1D index that preserves spatial locality — nearby points 281 + get nearby indices. 282 + 283 + Level [n] divides each axis into [2{^n}] cells: 284 + 285 + {v 286 + Level EPSG:4326 cell Hex chars 287 + 8 ~1.4 km 2 288 + 12 ~88 m 3 289 + 16 ~5.5 m 4 290 + 20 ~0.3 m 5 291 + v} 292 + 293 + The storage layer must provide: 294 + 295 + {v 296 + val hilbert_cell : level:int -> crs:crs -> point -> cell 297 + v} 298 + 299 + which computes the cell for a point in the document's CRS. 300 + The {!centroid} function gives the representative point for 301 + any geometry. 302 + 303 + {b Why Hilbert, not Geohash.} Geohash uses a Z-order (Morton) 304 + curve which has discontinuities at cell boundaries — two 305 + points close in space can have very different hashes. The 306 + Hilbert curve has better locality: adjacent cells on the curve 307 + are always spatially adjacent. 308 + 309 + {b Reprojection.} When a document's CRS changes, all [cell] 310 + values must be recomputed. The [id] fields remain stable. 311 + 312 + {b Sorted keys.} Concatenating [cell ^ "-" ^ id] (see 313 + {!label_name}) gives a key that sorts spatially. Any system 314 + that maintains sorted order (B-tree, LSM, lexicographic file 315 + listing) gets spatial clustering for free. *)
+726
lib/terradots.mli
··· 1 + (** {0 Terradots Label Store} 2 + 3 + A data model for geospatial labels — human observations and 4 + derived annotations used to train geospatial foundation models. 5 + 6 + This module defines the core types for representing labelled 7 + geographic features with full provenance, uncertainty, and 8 + spatial indexing. It is independent of any serialisation format 9 + (SVG, GeoJSON, GeoParquet, etc.); format encoders and decoders 10 + operate over these types. 11 + 12 + {1 Design Principles} 13 + 14 + {2 Coordinates live in CRS space} 15 + 16 + All coordinates are in the document's native Coordinate Reference 17 + System (CRS). Pixel-space mapping (affine transforms, SVG 18 + viewBox) is a serialisation concern, not a data model concern. 19 + The CRS is specified per document as any string that 20 + {{: https://proj.org/} PROJ} can resolve: EPSG codes 21 + (["EPSG:4326"]), WKT2 strings, or PROJ pipeline definitions. 22 + 23 + {2 Origin distinguishes measured, derived, and simulated} 24 + 25 + Every label records how it was produced. {b Measured} labels 26 + come from direct observation — a GPS receiver, a human expert 27 + digitising on imagery, or an import from an external registry 28 + (GBIF, iNaturalist, OpenStreetMap). {b Derived} labels are 29 + computed from other labels — convex hulls, buffers, manual 30 + merges during deduplication. {b Simulated} labels are 31 + produced by theoretical models (population dynamics, 32 + agent-based simulations, climate projections) and must remain 33 + identifiable as synthetic — they augment training data or 34 + explore counterfactual scenarios but do not represent 35 + real-world observations. 36 + 37 + Measured labels carry positional accuracy (metres). Derived 38 + labels do not — their accuracy propagates from the source 39 + labels via the derivation method. Simulated labels carry 40 + confidence reflecting model reliability, typically lower 41 + than measured data. 42 + 43 + Confidence and classification are universal: you can be 44 + confident (or uncertain) about any label regardless of its 45 + origin. 46 + 47 + {2 URIs identify observers and registries} 48 + 49 + Observers (sensors, humans) and external registries are 50 + identified by URI. The URI scheme encodes the kind of source: 51 + 52 + {v 53 + URI meaning 54 + ──────────────────────────────────── ──────────────────── 55 + orcid:0000-0001-2345-6789 human (ORCID) 56 + https://ror.org/035dkdb55 institution (ROR) 57 + urn:sensor:gps:trimble-r12-0042 GPS receiver 58 + urn:sensor:camera-trap:ct-0042 camera trap 59 + gbif:4023589127 GBIF occurrence 60 + inaturalist:observation/12345 iNaturalist record 61 + osm:node/123456 OpenStreetMap node 62 + v} 63 + 64 + Adding a new kind of observer or registry requires no code 65 + changes — just use a new URI scheme. 66 + 67 + {2 Identity and spatial indexing are separate} 68 + 69 + A label has two name components: 70 + 71 + - {b cell}: a Hilbert curve cell index encoding spatial 72 + locality. Computed from the label's centroid at the 73 + document's Hilbert level. Recomputed on reprojection. 74 + 75 + - {b id}: a unique identifier (e.g. UUID). Stable across 76 + reprojections, never changes. 77 + 78 + Concatenating [cell ^ "-" ^ id] (see {!label_name}) gives a 79 + spatially-sortable unique name. Any sorted index (B-tree, LSM, 80 + lexicographic file listing) gets spatial clustering for free. 81 + 82 + {2 Classification is a probability distribution} 83 + 84 + A label's class is expressed through {!field-label.class_dist}, 85 + a list of [(class_name, probability)] pairs ordered by 86 + decreasing probability. A definite classification is a 87 + singleton list: [\[("Panthera leo", 1.0)\]]. An uncertain 88 + classification distributes probability across candidates. 89 + An unclassified label has an empty list. 90 + 91 + The {!primary_class} accessor extracts the highest-probability 92 + class. The {!fingerprint} function uses the primary class for 93 + coarse deduplication matching. 94 + 95 + {2 Temporal data follows Darwin Core} 96 + 97 + The [event_date] field follows the 98 + {{: https://techdocs.gbif.org/en/data-processing/temporal-interpretation} 99 + Darwin Core temporal interpretation} convention (ISO 8601-1:2019). 100 + It records when the observation was made, not when the label was 101 + imported into the system. Supported formats: 102 + 103 + - Precise dates: ["2023-09-18"] 104 + - Imprecise dates: ["2023-09"] or ["2023"] 105 + - Date-times: ["2023-09-18T13:27:00Z"] 106 + - Intervals: ["2023-09-05/2023-09-18"] 107 + 108 + {2 Deduplication is a derivation} 109 + 110 + Labels imported from multiple sources may refer to the same 111 + real-world feature. Deduplication is modelled as a derivation: 112 + find candidate matches (same Hilbert cell, class agreement, 113 + temporal overlap), let an expert decide, then merge via 114 + [Derived { sources = \[a; b\]; method_ = "manual-merge" }]. 115 + Both originals are kept in the document for full provenance. 116 + The {!fingerprint} function produces coarse spatial+class keys 117 + for efficient candidate matching. *) 118 + 119 + (** {1 Coordinate Reference Systems} *) 120 + 121 + (** A coordinate reference system identifier. 122 + 123 + Any string that {{: https://proj.org/} PROJ} can resolve: 124 + EPSG codes (["EPSG:4326"]), WKT2 strings, or PROJ pipeline 125 + definitions. The CRS determines the units and meaning of 126 + all {!point} coordinates in the document. *) 127 + type crs = string 128 + 129 + (** WGS 84 geographic coordinates (longitude, latitude in degrees). 130 + The most common CRS for global datasets. *) 131 + val wgs84 : crs 132 + 133 + (** Web Mercator (metres). Used by web mapping tile services. *) 134 + val web_mercator : crs 135 + 136 + (** {1 Temporal} *) 137 + 138 + (** A temporal extent following the Darwin Core 139 + {{: https://techdocs.gbif.org/en/data-processing/temporal-interpretation} 140 + [eventDate]} convention (ISO 8601-1:2019). 141 + 142 + This type is abstract — construct values with 143 + {!event_date_of_string} and inspect with 144 + {!string_of_event_date}. Valid forms: 145 + 146 + - Precise dates: ["2023-09-18"] 147 + - Imprecise dates: ["2023-09"], ["2023"] 148 + - Date-times: ["2023-09-18T13:27:00Z"] 149 + - Intervals: ["2023-09-05/2023-09-18"] 150 + 151 + The abstraction boundary allows a future implementation to 152 + parse and validate these forms, or to provide temporal 153 + comparison and overlap queries. *) 154 + type event_date 155 + 156 + (** Construct an {!event_date} from an ISO 8601 string. 157 + 158 + Currently accepts any string — validation is deferred to the 159 + storage layer or a future version of this module. *) 160 + val event_date_of_string : string -> event_date 161 + 162 + (** Convert an {!event_date} back to its ISO 8601 string 163 + representation. *) 164 + val string_of_event_date : event_date -> string 165 + 166 + (** {1 Spatial Indexing} *) 167 + 168 + (** A Hilbert curve cell index — a hex-encoded spatial locality 169 + key computed from a label's {!centroid} at the document's 170 + Hilbert level. 171 + 172 + This type is abstract — construct values with 173 + {!cell_of_string} and inspect with {!string_of_cell}. 174 + 175 + The Hilbert curve maps 2D coordinates to a 1D index that 176 + preserves spatial locality. Nearby points in CRS space 177 + get nearby cell values. See {!section:storagelayer} for 178 + the level-to-resolution table. 179 + 180 + The abstraction boundary allows a future implementation to 181 + enforce hex format, validate level-appropriate lengths, or 182 + provide cell arithmetic (parent, children, neighbours). *) 183 + type cell 184 + 185 + (** Construct a {!cell} from a hex string. *) 186 + val cell_of_string : string -> cell 187 + 188 + (** Convert a {!cell} back to its hex string representation. *) 189 + val string_of_cell : cell -> string 190 + 191 + (** {1 Geometry} *) 192 + 193 + (** A point in the document's native CRS. 194 + 195 + For [EPSG:4326]: [x] is longitude (degrees east), [y] is 196 + latitude (degrees north). For projected CRS (UTM, Web 197 + Mercator): [x] is easting (metres), [y] is northing (metres). *) 198 + type point = { x : float; y : float } 199 + 200 + (** A geometry in the document's native CRS. 201 + 202 + Follows {{: https://www.ogc.org/standard/sfa/} OGC Simple 203 + Features / ISO 19125} with the subset needed for labelling: 204 + 205 + - {b Point}: a single coordinate pair. 206 + - {b Polygon}: a closed exterior ring (the last point must 207 + equal the first). Interior rings (holes) are not supported. 208 + - {b Multi}: a heterogeneous collection of geometries, 209 + corresponding to OGC GeometryCollection and the Multi* 210 + types (MultiPoint, MultiPolygon). *) 211 + type geometry = 212 + | Point of point 213 + | Polygon of point list 214 + | Multi of geometry list 215 + 216 + (** Compute the representative point (centroid) of a geometry. 217 + 218 + - {b Point}: the point itself. 219 + - {b Polygon}: arithmetic mean of the ring vertices. 220 + - {b Multi}: centroid of the centroids of its members 221 + (unweighted — sufficient for spatial indexing, not for 222 + area-weighted geometric analysis). 223 + 224 + Used by the storage layer to compute Hilbert cell indices 225 + (see {!section:storagelayer}). *) 226 + val centroid : geometry -> point 227 + 228 + (** {1 Origin} 229 + 230 + How a label was produced. This is the only dimension along 231 + which label metadata varies — confidence and classification 232 + are universal properties independent of origin. *) 233 + 234 + (** The origin of a label. 235 + 236 + {b Measured.} A direct observation by an observer (sensor, 237 + human, or institution), possibly imported via an external 238 + registry. Fields: 239 + 240 + - [observer]: URI identifying who or what made the observation. 241 + Required for direct observations; optional for registry 242 + imports where the original observer may be unknown. 243 + - [via]: URI of the registry record, if imported from an 244 + external platform. [None] for direct observations. 245 + - [license]: SPDX license identifier (e.g. ["CC-BY-4.0"], 246 + ["ODbL-1.0"]) for the imported data. [None] for direct 247 + observations or when the license is unspecified. 248 + - [accuracy_m]: positional uncertainty radius in metres. 249 + Interpretation: the true position lies within [accuracy_m] 250 + metres of the stated coordinates with high probability. 251 + Maps to GBIF [coordinateUncertaintyInMeters] for imports. 252 + 253 + {b Derived.} A label computed from other labels. Fields: 254 + 255 + - [sources]: list of label IDs that were inputs to the 256 + derivation. These must be valid label IDs within the same 257 + document. 258 + - [method_]: a short identifier for the derivation algorithm 259 + (e.g. ["convex-hull"], ["buffer-10m"], ["manual-merge"]). 260 + 261 + Derived labels do not carry independent positional accuracy — 262 + it propagates from the source labels via the derivation 263 + method. The {!activity} record provides the full audit trail 264 + (who ran the derivation, when, etc.). 265 + 266 + {b Simulated.} A label produced by a theoretical simulation 267 + (agent-based models, population dynamics, climate projections, 268 + etc.). Simulated labels augment training data or explore 269 + counterfactual scenarios but {b must} remain identifiable as 270 + synthetic — they do not represent real-world observations. 271 + 272 + - [model]: URI of the simulation model (typically a Fairground 273 + notebook, e.g. ["fairground:notebook/lotka-volterra:v4"]). 274 + - [run_id]: unique identifier for the simulation run, linking 275 + all labels from the same execution. Combined with the 276 + [activity] record, this gives full reproducibility: model 277 + version, parameters, random seed. 278 + 279 + Simulated labels carry [confidence] reflecting the model's 280 + estimated reliability, typically lower than measured data. 281 + Derivation pipelines that consume simulated labels should 282 + record the synthetic fraction in their [properties] for 283 + transparency. *) 284 + type origin = 285 + | Measured of { 286 + observer : string option; 287 + via : string option; 288 + license : string option; 289 + accuracy_m : float option; 290 + } 291 + | Derived of { 292 + sources : string list; 293 + method_ : string; 294 + } 295 + | Simulated of { 296 + model : string; 297 + run_id : string; 298 + } 299 + 300 + (** {1 Provenance} *) 301 + 302 + (** An audit record for how labels were produced. 303 + 304 + An [activity] captures the "who" and "when" of label creation 305 + or derivation. It complements {!origin}, which captures the 306 + structural "what" and "from what". 307 + 308 + - [activity_id]: unique identifier for this activity. 309 + - [agent]: URI or free-text identifier for the person, team, 310 + or software that performed the activity. 311 + - [date]: when the activity occurred, ISO 8601. 312 + - [description]: optional free-text note on what was done. 313 + 314 + Labels reference activities via {!field-label.activity}. 315 + Multiple labels may share the same activity (e.g. a batch 316 + import or a bulk derivation). *) 317 + type activity = { 318 + activity_id : string; 319 + agent : string; 320 + date : string; 321 + description : string option; 322 + } 323 + 324 + (** {1 Labels} *) 325 + 326 + (** A geospatial label: a geometry in CRS space with classification, 327 + origin, confidence, temporal extent, and extensible metadata. 328 + 329 + See the module-level documentation for the full design rationale 330 + covering identity, spatial indexing, temporal conventions, and 331 + deduplication. 332 + 333 + {b Fields.} 334 + 335 + - [cell]: Hilbert curve cell index (see {!cell}), computed from 336 + {!centroid} at the document's {!field-document.level}. Encodes 337 + spatial locality — labels in the same cell are geographically 338 + close. Recomputed on reprojection; not part of the stable 339 + identity. 340 + 341 + - [id]: unique identifier (e.g. UUID) within the cell. Stable 342 + across reprojections. Two independent imports of the same 343 + real-world feature get different IDs; this is correct until 344 + an expert merges them via derivation. 345 + 346 + - [geometry]: the label's spatial extent in the document's CRS. 347 + 348 + - [origin]: how the label was produced (see {!origin}). 349 + 350 + - [event_date]: when the observation was made (not when it was 351 + imported). See {!event_date} for the Darwin Core temporal 352 + convention and supported formats. 353 + 354 + - [confidence]: semantic confidence in the overall label, 355 + in the range [\[0, 1\]]. Independent of positional accuracy. 356 + 357 + - [class_dist]: per-class probability distribution. A list of 358 + [(class_name, probability)] pairs ordered by decreasing 359 + probability. Should sum to 1.0. A definite classification 360 + is a singleton: [\[("Panthera leo", 1.0)\]]. An unclassified 361 + label has an empty list. See {!primary_class}. 362 + 363 + - [activity]: optional ID of the {!activity} record that 364 + produced this label. Provides the full audit trail. 365 + 366 + - [properties]: extensible key-value metadata. Use for 367 + domain-specific attributes that don't warrant a dedicated 368 + field (e.g. [("gbif_dataset", "uk-nbn-atlas")], 369 + [("observer_name", "Alice Smith")]). *) 370 + type label = { 371 + cell : cell; 372 + id : string; 373 + geometry : geometry; 374 + origin : origin; 375 + event_date : event_date option; 376 + confidence : float option; 377 + class_dist : (string * float) list; 378 + activity : string option; 379 + properties : (string * string) list; 380 + } 381 + 382 + (** The full spatially-sortable name of a label. 383 + 384 + Returns [cell ^ "-" ^ id]. Sorting a collection of labels 385 + by {!label_name} groups spatially nearby labels together, 386 + because the Hilbert cell prefix preserves spatial locality. *) 387 + val label_name : label -> string 388 + 389 + (** A free-text annotation anchored to one or more labels. 390 + 391 + Annotations provide commentary, corrections, or contextual 392 + notes without modifying the labels themselves. 393 + 394 + - [id]: unique identifier for this annotation. 395 + - [text]: the annotation content (free text). 396 + - [anchors]: list of label IDs that this annotation refers to. 397 + An annotation may span multiple labels (e.g. "these three 398 + points are the same tree observed in different years"). *) 399 + type annotation = { 400 + id : string; 401 + text : string; 402 + anchors : string list; 403 + } 404 + 405 + (** A named group of labels. 406 + 407 + Groups organise labels into logical collections — a field 408 + campaign, a seasonal survey, a thematic subset. They are 409 + purely organisational and do not affect label semantics. 410 + 411 + - [id]: unique identifier for this group. 412 + - [activity]: optional ID of the {!activity} that created or 413 + curated this group. 414 + - [members]: list of label IDs belonging to this group. 415 + A label may belong to multiple groups. *) 416 + type group = { 417 + id : string; 418 + activity : string option; 419 + members : string list; 420 + } 421 + 422 + (** {1 Document} 423 + 424 + A document is the top-level container: a set of labels in a 425 + common CRS, with provenance records, annotations, and groups. *) 426 + 427 + (** A label store document. 428 + 429 + - [crs]: the coordinate reference system for all geometries. 430 + - [level]: the Hilbert curve level used to compute 431 + {!field-label.cell} values. All labels in a document use 432 + the same level for consistent spatial resolution. See 433 + {!section:storagelayer} for the level-to-resolution table. 434 + - [provenance]: the list of {!activity} records referenced by 435 + labels and groups. 436 + - [labels]: the label collection. 437 + - [annotations]: free-text annotations anchored to labels. 438 + - [groups]: named subsets of labels. *) 439 + type document = { 440 + crs : crs; 441 + level : int; 442 + provenance : activity list; 443 + labels : label list; 444 + annotations : annotation list; 445 + groups : group list; 446 + } 447 + 448 + (** {1 Constructors} 449 + 450 + Convenience functions that enforce common patterns. Direct 451 + observations ([make_point], [make_polygon]) require an observer 452 + URI. Registry imports ([make_imported]) require a registry URI 453 + and accept an optional observer. Derivations ([make_derived]) 454 + require source label IDs and a method. 455 + 456 + All constructors require [~cell] (Hilbert cell, computed by 457 + the storage layer) and [~id] (unique identifier). 458 + 459 + Classification is always via [~class_dist]. For a definite 460 + class, pass [~class_dist:\[("Panthera leo", 1.0)\]]. *) 461 + 462 + (** Construct a measured point label. 463 + 464 + Example: 465 + {[ 466 + make_point 467 + ~cell:(cell_of_string "a3f2") ~id:"7b1c9d" 468 + ~x:0.1 ~y:52.2 469 + ~observer:"urn:sensor:gps:trimble-r12-0042" 470 + ~accuracy_m:0.02 ~confidence:0.99 471 + ~class_dist:[("Quercus robur", 1.0)] 472 + ~event_date:(event_date_of_string "2023-09-18T13:27:00Z") () 473 + ]} *) 474 + val make_point : 475 + cell:cell -> id:string -> 476 + x:float -> y:float -> 477 + observer:string -> 478 + ?accuracy_m:float -> 479 + ?event_date:event_date -> ?confidence:float -> 480 + ?class_dist:(string * float) list -> 481 + ?activity:string -> 482 + ?properties:(string * string) list -> 483 + unit -> label 484 + 485 + (** Construct a measured polygon label. 486 + 487 + Example: 488 + {[ 489 + make_polygon 490 + ~cell:(cell_of_string "a3f2") ~id:"e4a821" 491 + ~ring:[{x=0.0;y=52.0}; {x=0.1;y=52.0}; 492 + {x=0.1;y=52.1}; {x=0.0;y=52.1}; 493 + {x=0.0;y=52.0}] 494 + ~observer:"orcid:0000-0001-2345-6789" 495 + ~class_dist:[("cropland", 0.9); ("grassland", 0.1)] 496 + ~confidence:0.9 497 + ~event_date:(event_date_of_string "2023-09") () 498 + ]} *) 499 + val make_polygon : 500 + cell:cell -> id:string -> 501 + ring:point list -> 502 + observer:string -> 503 + ?accuracy_m:float -> 504 + ?event_date:event_date -> ?confidence:float -> 505 + ?class_dist:(string * float) list -> 506 + ?activity:string -> 507 + ?properties:(string * string) list -> 508 + unit -> label 509 + 510 + (** Construct a label imported from an external registry. 511 + 512 + The [via] URI identifies the registry record. The [observer] 513 + is optional — many registry records do not expose the original 514 + collector. The [license] is the SPDX identifier for the 515 + imported data. 516 + 517 + Example: 518 + {[ 519 + make_imported 520 + ~cell:(cell_of_string "a3f2") ~id:"8c1d3e" 521 + ~geometry:(Point { x = 0.12; y = 52.21 }) 522 + ~via:"gbif:4023589127" 523 + ~license:"CC-BY-4.0" 524 + ~accuracy_m:100.0 525 + ~class_dist:[("Quercus robur", 1.0)] 526 + ~event_date:(event_date_of_string "2023") 527 + ~properties:[("gbif_dataset", "uk-nbn-atlas")] () 528 + ]} *) 529 + val make_imported : 530 + cell:cell -> id:string -> 531 + geometry:geometry -> 532 + via:string -> 533 + ?observer:string -> ?license:string -> 534 + ?accuracy_m:float -> 535 + ?event_date:event_date -> ?confidence:float -> 536 + ?class_dist:(string * float) list -> 537 + ?activity:string -> 538 + ?properties:(string * string) list -> 539 + unit -> label 540 + 541 + (** Construct a derived label. 542 + 543 + {b Deduplication merges} are a special case of derivation: 544 + 545 + {[ 546 + make_derived 547 + ~cell:(cell_of_string "a3f2") ~id:"merged01" 548 + ~geometry:(Point { x = 0.11; y = 52.205 }) 549 + ~sources:["7b1c9d"; "8c1d3e"] 550 + ~method_:"manual-merge" 551 + ~class_dist:[("Quercus robur", 1.0)] 552 + ~confidence:0.95 () 553 + ]} *) 554 + val make_derived : 555 + cell:cell -> id:string -> 556 + geometry:geometry -> 557 + sources:string list -> 558 + method_:string -> 559 + ?event_date:event_date -> ?confidence:float -> 560 + ?class_dist:(string * float) list -> 561 + ?activity:string -> 562 + ?properties:(string * string) list -> 563 + unit -> label 564 + 565 + (** Construct a simulated label. 566 + 567 + A label produced by a theoretical simulation (population 568 + dynamics, agent-based model, climate projection, etc.). 569 + The [model] URI identifies the simulation code; [run_id] 570 + links all labels from the same execution. 571 + 572 + Example — Lotka-Volterra population model: 573 + {[ 574 + make_simulated 575 + ~cell:(cell_of_string "b7d") ~id:"sim-001" 576 + ~geometry:(Point { x = 35.20; y = -2.50 }) 577 + ~model:"fairground:notebook/lotka-volterra-serengeti:v4" 578 + ~run_id:"lv-run-42" 579 + ~class_dist:[("Panthera leo", 1.0)] 580 + ~confidence:0.60 581 + ~event_date:(event_date_of_string "2024-06-15") 582 + ~properties:[("scenario", "baseline"); ("seed", "42")] () 583 + ]} *) 584 + val make_simulated : 585 + cell:cell -> id:string -> 586 + geometry:geometry -> 587 + model:string -> 588 + run_id:string -> 589 + ?event_date:event_date -> ?confidence:float -> 590 + ?class_dist:(string * float) list -> 591 + ?activity:string -> 592 + ?properties:(string * string) list -> 593 + unit -> label 594 + 595 + (** {1 Accessors} *) 596 + 597 + (** The most likely class from {!field-label.class_dist}. 598 + 599 + Returns [Some class_name] for the highest-probability entry, 600 + or [None] if [class_dist] is empty (unclassified label). *) 601 + val primary_class : label -> string option 602 + 603 + (** Positional accuracy in metres, if this is a measured label. 604 + 605 + Returns [Some metres] for measured labels with a stated 606 + accuracy, [None] for derived and simulated labels. *) 607 + val accuracy_of : label -> float option 608 + 609 + (** Source label IDs, if this is a derived label. 610 + 611 + Returns the list of label IDs that were inputs to the 612 + derivation. Returns [\[\]] for measured and simulated 613 + labels. *) 614 + val sources_of : label -> string list 615 + 616 + (** Registry URI, if this label was imported via a registry. 617 + 618 + Returns [Some uri] for labels imported from GBIF, iNaturalist, 619 + OSM, etc. Returns [None] for direct observations, derived 620 + labels, and simulated labels. *) 621 + val via_of : label -> string option 622 + 623 + (** Is this label synthetic (produced by a simulation)? 624 + 625 + Returns [true] for simulated labels, [false] for measured 626 + and derived labels. Use this to filter synthetic data out 627 + of analyses that must reflect real-world observations only 628 + (e.g. species range computation). *) 629 + val is_simulated : label -> bool 630 + 631 + (** {1 Fingerprinting} 632 + 633 + A fingerprint is a coarse key for finding deduplication 634 + candidates. It combines the Hilbert cell (spatial locality) 635 + with the primary class from {!field-label.class_dist}. 636 + 637 + Two labels with the same fingerprint are worth comparing for 638 + potential deduplication. Different fingerprints guarantee the 639 + labels are either spatially distant or differently classified. 640 + 641 + The event date is deliberately excluded: the same real-world 642 + feature observed at different times should still match as a 643 + candidate, so a human reviewer can decide whether they are 644 + the same feature. *) 645 + 646 + (** Compute the fingerprint of a label. 647 + 648 + Returns [cell ^ "|" ^ primary_class], where [primary_class] 649 + defaults to ["_"] if [class_dist] is empty. *) 650 + val fingerprint : label -> string 651 + 652 + (** {1 Document Construction} *) 653 + 654 + (** Create an empty document in the given CRS. 655 + 656 + @param level Hilbert curve level for spatial cell computation. 657 + Defaults to [12], which gives ~88 m cells for EPSG:4326. 658 + See {!section:storagelayer} for the full level-to-resolution 659 + table. *) 660 + val empty_document : crs:crs -> ?level:int -> unit -> document 661 + 662 + (** {1:storagelayer Storage Layer} 663 + 664 + The data model above is independent of how labels are stored 665 + and indexed. This section specifies the contract between the 666 + core types and a storage backend. 667 + 668 + {2 Hilbert Cell Computation} 669 + 670 + The {!field-label.cell} field on each label is a hex-encoded 671 + Hilbert curve cell index. The storage layer computes it from 672 + the label's {!centroid} at the document's 673 + {!field-document.level}. 674 + 675 + The Hilbert curve maps 2D coordinates to a 1D index that 676 + preserves spatial locality — nearby points in 2D space map to 677 + nearby positions on the curve. This is the key property that 678 + makes sorted-key spatial clustering work. 679 + 680 + Level [n] divides each CRS axis into [2{^n}] cells. For 681 + EPSG:4326 (degrees): 682 + 683 + {v 684 + Level Cell size Hex chars 685 + ───── ───────────── ───────── 686 + 8 ~1.4 km 2 687 + 12 ~88 m 3 688 + 16 ~5.5 m 4 689 + 20 ~0.3 m 5 690 + v} 691 + 692 + The storage layer must provide a function with this signature: 693 + 694 + {[ 695 + val hilbert_cell : level:int -> crs:crs -> point -> cell 696 + ]} 697 + 698 + The {!centroid} function provides the representative point for 699 + any geometry. 700 + 701 + {2 Why Hilbert, not Geohash} 702 + 703 + Geohash uses a Z-order (Morton) curve. Z-order curves have 704 + discontinuities at certain cell boundaries: two points that 705 + are close in 2D space can receive very different hash values 706 + when they fall on opposite sides of a major subdivision. 707 + 708 + The Hilbert curve avoids this: adjacent cells on the curve 709 + are {i always} spatially adjacent. This gives more uniform 710 + spatial clustering and fewer edge-case misses in proximity 711 + queries. 712 + 713 + {2 Reprojection} 714 + 715 + When a document's CRS changes, all {!field-label.cell} values 716 + must be recomputed from the (reprojected) geometries. The 717 + {!field-label.id} fields remain stable — identity is 718 + independent of coordinate system. 719 + 720 + {2 Sorted Keys} 721 + 722 + Concatenating [cell ^ "-" ^ id] (see {!label_name}) produces 723 + a key that sorts spatially. Any system that maintains sorted 724 + order (B-tree, LSM tree, lexicographic file listing) gets 725 + spatial clustering for free: a prefix scan on a cell value 726 + retrieves all labels in that spatial neighbourhood. *)
+28
terradots.opam
··· 1 + # This file is generated by dune, edit dune-project instead 2 + opam-version: "2.0" 3 + synopsis: "Geospatial label store for planetary observation data" 4 + description: """ 5 + A data model for geospatial labels — human observations, registry 6 + imports, simulation outputs, and derived annotations used to train 7 + geospatial foundation models. Supports full provenance tracking, 8 + Hilbert curve spatial indexing, and Darwin Core temporal conventions.""" 9 + license: "ISC" 10 + depends: [ 11 + "dune" {>= "3.16"} 12 + "ocaml" {>= "5.2"} 13 + "odoc" {with-doc} 14 + ] 15 + build: [ 16 + ["dune" "subst"] {dev} 17 + [ 18 + "dune" 19 + "build" 20 + "-p" 21 + name 22 + "-j" 23 + jobs 24 + "@install" 25 + "@runtest" {with-test} 26 + "@doc" {with-doc} 27 + ] 28 + ]