···11+A playground repository where I experiment with label schemas.
22+Not for any third-party usage just yet!
+1359
docs/plans/gbif-mapping.md
···11+# GBIF to Terradots Mapping Plan
22+33+## 1. GBIF and Darwin Core — Primary Sources
44+55+- Darwin Core terms: https://dwc.tdwg.org/terms/
66+- Darwin Core text guide (DwC-A format): https://dwc.tdwg.org/text/
77+- GBIF occurrence API: https://techdocs.gbif.org/en/openapi/v1/occurrence
88+- GBIF species/taxonomy API: https://techdocs.gbif.org/en/openapi/v1/species
99+- GBIF registry API: https://techdocs.gbif.org/en/openapi/v1/registry
1010+- GBIF download formats: https://techdocs.gbif.org/en/data-use/download-formats
1111+- GBIF download API: https://techdocs.gbif.org/en/data-use/api-downloads
1212+- GBIF occurrence issues and flags: https://techdocs.gbif.org/en/data-use/occurrence-issues-and-flags
1313+- GBIF taxonomy interpretation: https://techdocs.gbif.org/en/data-processing/taxonomy-interpretation
1414+- GBIF temporal interpretation: https://techdocs.gbif.org/en/data-processing/temporal-interpretation
1515+- DwC-A guide (IPT manual): https://ipt.gbif.org/manual/en/ipt/latest/dwca-guide
1616+- OccurrenceSearchParameter Java enum: https://gbif.github.io/gbif-api/apidocs/org/gbif/api/model/occurrence/search/OccurrenceSearchParameter.html
1717+- GBIF multimedia extension: https://rs.gbif.org/extension/gbif/1.0/multimedia.xml
1818+1919+---
2020+2121+## 2. Darwin Core Standard
2222+2323+Darwin Core (DwC) is the vocabulary standard maintained by Biodiversity Information Standards (TDWG) that GBIF uses as its primary exchange format. It defines a set of terms for biodiversity data — each term has a stable URI (e.g., `http://rs.tdwg.org/dwc/terms/occurrenceID`) and a simple local name used in practice (e.g., `occurrenceID`).
2424+2525+### 2.1 Record-level Terms
2626+2727+| Term | Description |
2828+|------|-------------|
2929+| `type` | Dublin Core type (typically `PhysicalObject` or `Event`) |
3030+| `modified` | ISO 8601 date-time the record was last changed by the publisher |
3131+| `language` | Language of the record (`en`, etc.) |
3232+| `license` | URI of the license document (e.g. `http://creativecommons.org/licenses/by/4.0/legalcode`) |
3333+| `rightsHolder` | Person or organisation owning or managing rights over the resource |
3434+| `accessRights` | Information about access restrictions |
3535+| `bibliographicCitation` | Citation for the occurrence record itself |
3636+| `references` | URL to a web page or further information about this record |
3737+| `institutionID` | URI identifying the institution that holds the object |
3838+| `collectionID` | URI identifying the collection |
3939+| `datasetID` | Identifier for the dataset within the publisher's system |
4040+| `institutionCode` | Acronym or name of the institution (e.g. `NHM`, `MO`) |
4141+| `collectionCode` | Acronym or name of the collection within the institution |
4242+| `datasetName` | Name of the dataset from which the record was derived |
4343+| `ownerInstitutionCode` | Institution that owns the specimen when it differs from the holding institution |
4444+| `basisOfRecord` | Nature of the record (see §2.6) |
4545+| `informationWithheld` | Free-text note on withheld information |
4646+| `dataGeneralizations` | Actions taken to make the data less specific (e.g. coordinate generalisation) |
4747+| `dynamicProperties` | Additional data as JSON or similar key-value pairs |
4848+4949+### 2.2 Occurrence Terms
5050+5151+| Term | Description |
5252+|------|-------------|
5353+| `occurrenceID` | Globally unique persistent identifier for the occurrence; publisher-assigned |
5454+| `catalogNumber` | Identifier within the collection (e.g. museum accession number) |
5555+| `recordNumber` | Field number given by the collector at time of recording |
5656+| `recordedBy` | Semicolon-separated list of names of people who made the observation/collection |
5757+| `recordedByID` | Semicolon-separated list of identifiers (e.g. ORCID URIs) for `recordedBy` persons |
5858+| `individualCount` | Number of individuals present at the time of the occurrence |
5959+| `organismQuantity` | Quantity of organisms (e.g. `5%` cover) |
6060+| `organismQuantityType` | Units for `organismQuantity` (e.g. `percentageCover`) |
6161+| `sex` | Sex of the biological individual |
6262+| `lifeStage` | Age or life stage of the organism |
6363+| `reproductiveCondition` | Reproductive condition (e.g. `flowering`, `pregnant`) |
6464+| `caste` | Caste (for eusocial organisms) |
6565+| `behavior` | Behaviour exhibited by the organism at the time of occurrence |
6666+| `vitality` | Indication of survival of the organism (`alive`, `dead`) |
6767+| `establishmentMeans` | How the organism came to be at the location (native, introduced, etc.) |
6868+| `degreeOfEstablishment` | Degree to which organism has established in a given place |
6969+| `pathway` | Process by which an organism arrived at a given place |
7070+| `occurrenceStatus` | `present` or `absent` |
7171+| `preparations` | List of preparations and preservation methods for specimens |
7272+| `disposition` | Current disposition of specimen (e.g. `in collection`, `missing`) |
7373+| `associatedMedia` | List of media associated with the occurrence |
7474+| `associatedOccurrences` | List of identifiers of other occurrences associated with this one |
7575+| `associatedReferences` | List of publication identifiers associated with this occurrence |
7676+| `associatedSequences` | List of genetic sequence identifiers |
7777+| `associatedTaxa` | List of taxa associated with the occurrence |
7878+| `otherCatalogNumbers` | List of other catalog numbers for the same item |
7979+| `occurrenceRemarks` | Comments or notes about the occurrence |
8080+8181+### 2.3 Organism Terms
8282+8383+| Term | Description |
8484+|------|-------------|
8585+| `organismID` | Identifier for the organism (persistent across occurrences of the same individual) |
8686+| `organismName` | Text name for the organism (e.g. band number for a bird) |
8787+| `organismScope` | Description of the kind of organism instance |
8888+8989+### 2.4 Event Terms
9090+9191+| Term | Description |
9292+|------|-------------|
9393+| `eventID` | Identifier for the event (a sampling event or visit) |
9494+| `parentEventID` | Identifier of the parent event (e.g. a campaign containing this visit) |
9595+| `eventType` | Type of the event (e.g. `Survey`, `Transect`) |
9696+| `fieldNumber` | Identifier given to the event in the field |
9797+| `eventDate` | ISO 8601 date or interval when the event occurred (see §2.7) |
9898+| `eventTime` | Time of day of the event |
9999+| `startDayOfYear` | Earliest ordinal day of year (1–366) |
100100+| `endDayOfYear` | Latest ordinal day of year (1–366) |
101101+| `year` | Four-digit year |
102102+| `month` | Integer month (1–12) |
103103+| `day` | Integer day of month |
104104+| `verbatimEventDate` | Original verbatim text of the event date |
105105+| `habitat` | Category or description of the habitat |
106106+| `samplingProtocol` | Names, references, or descriptions of methods used to sample the occurrence |
107107+| `sampleSizeValue` | Numeric value of the sample size |
108108+| `sampleSizeUnit` | Units for the sample size value |
109109+| `samplingEffort` | Amount of effort expended during the event |
110110+| `fieldNotes` | Transcription of field notes or reference to their location |
111111+| `eventRemarks` | Comments or notes about the event |
112112+113113+### 2.5 Location Terms
114114+115115+| Term | Description |
116116+|------|-------------|
117117+| `locationID` | Identifier for the location |
118118+| `higherGeographyID` | Identifier for the broader geographic region |
119119+| `higherGeography` | Combination of geographic place names (broader to more specific) |
120120+| `continent` | Name of the continent |
121121+| `waterBody` | Name of the water body |
122122+| `islandGroup` | Name of the island group |
123123+| `island` | Name of the island |
124124+| `country` | Name of the country |
125125+| `countryCode` | ISO 3166-1 alpha-2 country code |
126126+| `stateProvince` | Name of the state, province, or region |
127127+| `county` | Name of the county, shire, or department |
128128+| `municipality` | Name of the city, town, or municipality |
129129+| `locality` | Specific textual description of the place |
130130+| `verbatimLocality` | Original textual description of the place |
131131+| `verbatimElevation` | Original description of the elevation |
132132+| `minimumElevationInMeters` | Lower limit of elevation range in metres |
133133+| `maximumElevationInMeters` | Upper limit of elevation range in metres |
134134+| `verbatimDepth` | Original description of the depth |
135135+| `minimumDepthInMeters` | Minimum depth below the water surface |
136136+| `maximumDepthInMeters` | Maximum depth below the water surface |
137137+| `decimalLatitude` | Latitude in decimal degrees (WGS84); range −90 to +90 |
138138+| `decimalLongitude` | Longitude in decimal degrees (WGS84); range −180 to +180 |
139139+| `geodeticDatum` | Ellipsoid, geodetic datum, or SRS of the coordinates |
140140+| `coordinateUncertaintyInMeters` | Horizontal radius (in metres) of the smallest circle containing the whole location |
141141+| `coordinatePrecision` | Decimal precision of the coordinates (0.0001 = ~11 m precision) |
142142+| `pointRadiusSpatialFit` | Ratio of the area of the supplied point-radius to the true footprint |
143143+| `verbatimCoordinates` | Original verbatim coordinates |
144144+| `verbatimCoordinateSystem` | Coordinate format of the verbatim coordinates (e.g. `decimal degrees`) |
145145+| `verbatimSRS` | Spatial reference system of the verbatim coordinates |
146146+| `footprintWKT` | Well-Known Text representation of the full footprint of the location |
147147+| `footprintSRS` | SRS for the WKT footprint |
148148+| `footprintSpatialFit` | Ratio of the footprint WKT area to the true footprint |
149149+| `georeferencedBy` | Name(s) of the person(s) who georeference the location |
150150+| `georeferencedDate` | Date when the location was georeferenced |
151151+| `georeferenceProtocol` | Description of the method used to determine coordinates |
152152+| `georeferenceSources` | Resources used in the georeference |
153153+| `georeferenceRemarks` | Notes on the georeference |
154154+155155+### 2.6 basisOfRecord Values
156156+157157+The `basisOfRecord` term is a controlled vocabulary indicating the nature of the evidence:
158158+159159+| Value | Meaning |
160160+|-------|---------|
161161+| `HumanObservation` | Observation recorded by a human in the field |
162162+| `MachineObservation` | Observation made by a machine (camera trap, acoustic sensor) |
163163+| `PreservedSpecimen` | A preserved specimen in a collection (herbarium, museum) |
164164+| `FossilSpecimen` | A fossil specimen |
165165+| `LivingSpecimen` | A living specimen in cultivation or captivity |
166166+| `MaterialSample` | A sample (e.g. DNA extract, soil sample) |
167167+| `MaterialCitation` | A citation of occurrence in the literature |
168168+| `Occurrence` | Unclassified occurrence record |
169169+170170+### 2.7 eventDate Format
171171+172172+GBIF follows Darwin Core's ISO 8601-1:2019 convention for `eventDate`. Terradots already adopts this same convention for `event_date`. Supported forms:
173173+174174+| Format | Example |
175175+|--------|---------|
176176+| Year only | `2023` |
177177+| Year-Month | `2023-09` |
178178+| Date | `2023-09-18` |
179179+| Date-time (UTC) | `2023-09-18T13:27:00Z` |
180180+| Date-time with offset | `2023-09-18T13:27:00+05:30` |
181181+| Interval | `2023-09-05/2023-09-18` |
182182+183183+The eventDate records when the observation took place, not when it was entered into a database.
184184+185185+### 2.8 Identification Terms
186186+187187+| Term | Description |
188188+|------|-------------|
189189+| `identificationID` | Identifier for the determination of the taxon |
190190+| `verbatimIdentification` | Taxon identification as originally given by the identifier |
191191+| `identificationQualifier` | Brief phrase to express uncertainty about the identification (e.g. `cf.`, `aff.`) |
192192+| `identifiedBy` | Name(s) of the person who assigned the taxon to the specimen/observation |
193193+| `identifiedByID` | Identifier(s) for `identifiedBy` persons (e.g. ORCID URIs) |
194194+| `dateIdentified` | Date the taxonomic determination was made |
195195+| `identificationReferences` | References used in the determination |
196196+| `identificationVerificationStatus` | Categorical assessment of the quality of the identification |
197197+| `identificationRemarks` | Comments or notes about the identification |
198198+| `typeStatus` | Nomenclatural type status of the specimen |
199199+200200+### 2.9 Taxon Terms
201201+202202+| Term | Description |
203203+|------|-------------|
204204+| `taxonID` | Identifier for the taxon concept in the publisher's system |
205205+| `scientificNameID` | Identifier for the nomenclatural details of the name |
206206+| `acceptedNameUsageID` | Identifier for the accepted name, when the record is a synonym |
207207+| `parentNameUsageID` | Identifier for the direct parent in the classification |
208208+| `taxonConceptID` | Identifier for the taxon concept (as opposed to name) |
209209+| `scientificName` | Full scientific name including authorship |
210210+| `acceptedNameUsage` | The accepted name of the taxon if this record is a synonym |
211211+| `parentNameUsage` | The name of the immediate parent taxon |
212212+| `originalNameUsage` | Protonym or basionym |
213213+| `verbatimTaxonRank` | The taxon rank as it appeared in the original record |
214214+| `taxonRank` | The rank of the most specific name in the scientificName (SPECIES, GENUS, etc.) |
215215+| `kingdom` | Kingdom in the classification |
216216+| `phylum` | Phylum (Division) in the classification |
217217+| `class` | Class in the classification |
218218+| `order` | Order in the classification |
219219+| `superfamily` | Superfamily |
220220+| `family` | Family in the classification |
221221+| `subfamily` | Subfamily |
222222+| `tribe` | Tribe |
223223+| `subtribe` | Subtribe |
224224+| `genus` | Genus in the classification |
225225+| `genericName` | Genus portion of the scientific name |
226226+| `subgenus` | Subgenus in the classification |
227227+| `infragenericEpithet` | Infrageneric epithet |
228228+| `specificEpithet` | Species epithet |
229229+| `infraspecificEpithet` | Infraspecific epithet |
230230+| `cultivarEpithet` | Cultivar name |
231231+| `vernacularName` | Common name |
232232+| `nomenclaturalCode` | Nomenclatural code governing the name (ICNafp, ICZN, etc.) |
233233+| `taxonomicStatus` | Whether the name is accepted or a synonym |
234234+| `nomenclaturalStatus` | Status of the name per the relevant nomenclatural code |
235235+| `taxonRemarks` | Comments or notes about the taxon |
236236+237237+---
238238+239239+## 3. GBIF Occurrence Record Structure
240240+241241+GBIF interprets Darwin Core records from publishers and adds its own enrichment fields. A GBIF occurrence record (as returned by the Occurrence API or in a DwC-A download) contains the Darwin Core fields above plus the following GBIF-specific additions.
242242+243243+### 3.1 GBIF-added Identifiers
244244+245245+| Field | Description |
246246+|-------|-------------|
247247+| `gbifID` (or `key`) | GBIF's own integer identifier for this occurrence record; globally unique |
248248+| `datasetKey` | UUID of the dataset in the GBIF registry |
249249+| `publishingOrgKey` | UUID of the publishing organisation |
250250+| `installationKey` | UUID of the IPT/installation that hosts the dataset |
251251+| `hostingOrganizationKey` | UUID of the hosting organisation |
252252+| `networkKeys` | Array of UUIDs of networks the dataset belongs to |
253253+| `protocol` | Protocol used to harvest the data (`DWC_ARCHIVE`, `EML`, etc.) |
254254+255255+### 3.2 GBIF Taxonomy Backbone Fields
256256+257257+GBIF matches every occurrence to its taxonomic backbone and adds integer keys at each taxonomic rank:
258258+259259+| Field | Description |
260260+|-------|-------------|
261261+| `taxonKey` | GBIF backbone key for the matched taxon at its stated rank |
262262+| `acceptedTaxonKey` | Key of the accepted name (equals `taxonKey` if already accepted) |
263263+| `kingdomKey` | Backbone key for kingdom |
264264+| `phylumKey` | Backbone key for phylum |
265265+| `classKey` | Backbone key for class |
266266+| `orderKey` | Backbone key for order |
267267+| `superfamilyKey` | Backbone key for superfamily |
268268+| `familyKey` | Backbone key for family |
269269+| `subfamilyKey` | Backbone key for subfamily |
270270+| `tribeKey` | Backbone key for tribe |
271271+| `subtribeKey` | Backbone key for subtribe |
272272+| `genusKey` | Backbone key for genus |
273273+| `subgenusKey` | Backbone key for subgenus |
274274+| `speciesKey` | Backbone key for the species (even if record is a subspecies) |
275275+| `acceptedScientificName` | Scientific name of the accepted taxon on the backbone |
276276+| `verbatimScientificName` | Scientific name as originally supplied by publisher |
277277+| `verbatimScientificNameAuthorship` | Authorship as originally supplied |
278278+| `taxonomicStatus` | `ACCEPTED`, `SYNONYM`, `DOUBTFUL`, etc. as determined by GBIF |
279279+| `iucnRedListCategory` | IUCN Red List category if available (LC, NT, VU, EN, CR, EW, EX) |
280280+281281+### 3.3 GBIF Geospatial Fields
282282+283283+| Field | Description |
284284+|-------|-------------|
285285+| `hasCoordinate` | Boolean; whether the record has non-null decimal lat/lon |
286286+| `hasGeospatialIssues` | Boolean; whether any geospatial issue flags are present |
287287+| `distanceFromCentroidInMeters` | Distance from the nearest country/region centroid (centroid-mismatch indicator) |
288288+| `repatriated` | Whether the occurrence comes from a country that does not own the dataset |
289289+| `gbifRegion` | GBIF region of the publishing organisation |
290290+| `publishedByGbifRegion` | GBIF region of the publisher |
291291+| `level0Gid`, `level0Name` | GADM level 0 (country) grid identifier and name |
292292+| `level1Gid`, `level1Name` | GADM level 1 (state/province) grid identifier and name |
293293+| `level2Gid`, `level2Name` | GADM level 2 (county/district) grid identifier and name |
294294+| `level3Gid`, `level3Name` | GADM level 3 (municipality) grid identifier and name |
295295+| `continent` | Interpreted continent name |
296296+| `publishingCountry` | ISO country code of the publishing organisation |
297297+298298+### 3.4 GBIF Processing Metadata
299299+300300+| Field | Description |
301301+|-------|-------------|
302302+| `lastCrawled` | ISO 8601 timestamp when GBIF last crawled the dataset |
303303+| `lastParsed` | ISO 8601 timestamp when GBIF last parsed this record |
304304+| `lastInterpreted` | ISO 8601 timestamp when GBIF last interpreted this record |
305305+| `crawlId` | Internal crawl batch identifier |
306306+| `isInCluster` | Whether the record has been matched to a cluster of duplicate/related records |
307307+| `isSequenced` | Whether a DNA sequence is associated with this record |
308308+| `isInvasive` | Whether the species is listed as invasive in any source |
309309+| `relativeOrganismQuantity` | Normalised organism quantity across the dataset |
310310+| `projectId` | Identifier for the project associated with the occurrence |
311311+312312+### 3.5 GBIF Issue Fields (Download-specific)
313313+314314+In DwC-A and CSV downloads, issues are split into two columns:
315315+316316+| Field | Description |
317317+|-------|-------------|
318318+| `issue` | Comma-separated list of all issue flags present on the record |
319319+| `taxonomicIssue` | Comma-separated list of taxonomy-related issue flags only |
320320+| `nonTaxonomicIssue` | Comma-separated list of all other issue flags |
321321+322322+In the JSON API, `issues` is an array of string enum values.
323323+324324+### 3.6 Complete Field Example (JSON API)
325325+326326+A real GBIF occurrence record (gbifID: 3034438331) from Xeno-canto via the Netherlands Biodiversity Data Centre:
327327+328328+```json
329329+{
330330+ "key": 3034438331,
331331+ "datasetKey": "...",
332332+ "occurrenceID": "https://data.biodiversitydata.nl/xeno-canto/observation/XC165501",
333333+ "basisOfRecord": "HUMAN_OBSERVATION",
334334+ "occurrenceStatus": "PRESENT",
335335+ "scientificName": "Tephrodornis pondicerianus pondicerianus (Gmelin, 1789)",
336336+ "acceptedScientificName": "Tephrodornis pondicerianus pondicerianus",
337337+ "taxonRank": "SUBSPECIES",
338338+ "taxonomicStatus": "ACCEPTED",
339339+ "kingdom": "Animalia", "kingdomKey": 1,
340340+ "class": "Aves", "classKey": 212,
341341+ "order": "Passeriformes", "orderKey": 729,
342342+ "family": "Tephrodornithidae",
343343+ "genus": "Tephrodornis",
344344+ "species": "Tephrodornis pondicerianus",
345345+ "speciesKey": 2489935,
346346+ "decimalLatitude": 18.3669,
347347+ "decimalLongitude": 73.7512,
348348+ "continent": "ASIA",
349349+ "country": "India",
350350+ "countryCode": "IN",
351351+ "stateProvince": "Maharashtra",
352352+ "eventDate": "2026-01-14",
353353+ "recordedBy": "Sarthak Awhad",
354354+ "behavior": "call",
355355+ "license": "http://creativecommons.org/licenses/by-nc/4.0/legalcode",
356356+ "issues": ["CONTINENT_DERIVED_FROM_COORDINATES"],
357357+ "media": [
358358+ {
359359+ "type": "StillImage",
360360+ "format": "image/png",
361361+ "identifier": "https://...",
362362+ "title": "Spectrogram ...",
363363+ "created": "2026-01-14",
364364+ "creator": "Sarthak Awhad",
365365+ "license": "http://creativecommons.org/licenses/by-nc-sa/3.0/"
366366+ },
367367+ {
368368+ "type": "Sound",
369369+ "format": "audio/mpeg",
370370+ "identifier": "https://...",
371371+ "title": "Bird call recording"
372372+ }
373373+ ],
374374+ "lastInterpreted": "2026-03-03T11:18:18.520Z",
375375+ "isInCluster": false,
376376+ "isSequenced": false
377377+}
378378+```
379379+380380+---
381381+382382+## 4. GBIF Taxonomic Backbone
383383+384384+### 4.1 Overview
385385+386386+The GBIF Backbone Taxonomy (also called the GBIF Taxonomic Backbone or GBIF Checklist) is a single synthetic checklist derived from approximately 100 authoritative taxonomic sources. It provides a stable reference for resolving taxonomic names used in occurrence records.
387387+388388+**Key URL:** `https://api.gbif.org/v1/species/{nubKey}` returns a backbone taxon record.
389389+390390+### 4.2 Taxon Key Fields
391391+392392+Every taxon in the backbone has a unique integer key (`nubKey` or `usageKey`). When GBIF matches an occurrence to the backbone, it populates keys at each taxonomic rank:
393393+394394+| API field | Meaning |
395395+|-----------|---------|
396396+| `nubKey` | The backbone key (also called `usageKey` in the species API) |
397397+| `taxonKey` | On occurrences: the key for the matched rank; equals `nubKey` |
398398+| `speciesKey` | Key for the species-level ancestor (set even for subspecies records) |
399399+| `genusKey` | Key for genus; `familyKey`; `orderKey`; `classKey`; `phylumKey`; `kingdomKey` |
400400+| `acceptedTaxonKey` | Key of the accepted name if the matched name is a synonym |
401401+| `parentKey` | Key of the immediate taxonomic parent in the backbone |
402402+| `basionymKey` | Key of the original name on which this name is based |
403403+| `nameKey` | Key for the name itself (distinct from the usage in the taxonomy) |
404404+405405+### 4.3 Taxonomic Status Values
406406+407407+| Value | Meaning |
408408+|-------|---------|
409409+| `ACCEPTED` | This is the current accepted name |
410410+| `SYNONYM` | The name is a synonym; use `acceptedTaxonKey` for the accepted name |
411411+| `DOUBTFUL` | Uncertain status; may or may not be accepted |
412412+| `HETEROTYPIC_SYNONYM` | Synonym based on a different type |
413413+| `HOMOTYPIC_SYNONYM` | Synonym based on the same type (objective synonym) |
414414+| `PROPARTE_SYNONYM` | Only a part of the concept is synonymised |
415415+| `MISAPPLIED` | The name has been misapplied to a different taxon |
416416+417417+### 4.4 Taxonomy Matching
418418+419419+GBIF's taxonomy interpretation process:
420420+421421+1. Tries to match using the identifier field if present (`taxonID`, `scientificNameID`, `taxonConceptID`) — this is preferred over name-matching.
422422+2. Falls back to matching the `scientificName` string (with authorship stripped if necessary).
423423+3. If a full species match fails, tries genus, family, and higher ranks.
424424+4. Records the quality of the match in `issues` flags (e.g., `TAXON_MATCH_FUZZY`, `TAXON_MATCH_HIGHERRANK`).
425425+426426+GBIF now supports two taxonomies: the legacy GBIF Backbone and the Catalogue of Life Extended Release (COL XR), the latter integrated through ChecklistBank.
427427+428428+### 4.5 Example Backbone Record (Passer domesticus)
429429+430430+```json
431431+{
432432+ "nubKey": 5231190,
433433+ "taxonID": "gbif:5231190",
434434+ "nameKey": 8290258,
435435+ "parentKey": 2492321,
436436+ "kingdom": "Animalia", "kingdomKey": 1,
437437+ "phylum": "Chordata", "phylumKey": 44,
438438+ "class": "Aves", "classKey": 212,
439439+ "order": "Passeriformes", "orderKey": 729,
440440+ "family": "Passeridae", "familyKey": 5264,
441441+ "genus": "Passer", "genusKey": 2492321,
442442+ "species": "Passer domesticus", "speciesKey": 5231190,
443443+ "scientificName": "Passer domesticus (Linnaeus, 1758)",
444444+ "canonicalName": "Passer domesticus",
445445+ "rank": "SPECIES",
446446+ "taxonomicStatus": "ACCEPTED",
447447+ "numDescendants": 15,
448448+ "datasetKey": "d7dddbf4-2cf0-4f39-9b2a-bb099caae36c"
449449+}
450450+```
451451+452452+---
453453+454454+## 5. GBIF Dataset Structure
455455+456456+### 5.1 Dataset Record Fields
457457+458458+A GBIF dataset is published by an organisation and registered in the GBIF Registry. Key fields:
459459+460460+| Field | Description |
461461+|-------|-------------|
462462+| `key` (UUID) | The `datasetKey` referenced on every occurrence in the dataset |
463463+| `doi` | Dataset DOI (e.g. `10.15468/rxbp4w`) — citable identifier |
464464+| `title` | Human-readable dataset name |
465465+| `type` | `OCCURRENCE`, `CHECKLIST`, `SAMPLING_EVENT`, or `METADATA` |
466466+| `subtype` | Further specialisation (e.g. `SPECIMEN`, `OBSERVATION`) |
467467+| `publishingOrganizationKey` | UUID of the organisation that published the dataset |
468468+| `license` | License string (e.g. `http://creativecommons.org/licenses/by/4.0/legalcode`) |
469469+| `description` | Free-text description |
470470+| `language` | Primary language of the metadata |
471471+| `version` | Dataset version |
472472+| `modified` | Date of last modification |
473473+| `pubDate` | Publication date |
474474+| `endpoints` | Array of URLs where the data is available (DwC-A endpoint, EML endpoint) |
475475+| `contacts` | List of people with roles (creator, metadata author, point of contact) |
476476+| `identifiers` | Additional identifiers (DOI, UUID, etc.) |
477477+478478+### 5.2 Publishing Organisation
479479+480480+The `publishingOrganizationKey` links to the GBIF Registry entry for the institution. Example keys:
481481+- `90fd6680-349f-11d8-aa2d-b8a03c50a862` — Missouri Botanical Garden
482482+- ROR identifiers can sometimes be found in organisation records
483483+484484+**API:** `GET https://api.gbif.org/v1/organization/{key}` returns organisation details.
485485+486486+### 5.3 Dataset Networks
487487+488488+Datasets may belong to one or more GBIF networks (e.g., eBird, Ocean Biodiversity Information System, Xeno-canto). The `networkKeys` array on occurrence records links to these.
489489+490490+---
491491+492492+## 6. GBIF API
493493+494494+### 6.1 Base URL
495495+496496+All GBIF REST API calls use: `https://api.gbif.org/v1/`
497497+498498+### 6.2 Occurrence Search API
499499+500500+**Endpoint:** `GET https://api.gbif.org/v1/occurrence/search`
501501+502502+**Key parameters:**
503503+504504+| Parameter | Description | Example |
505505+|-----------|-------------|---------|
506506+| `taxonKey` | Backbone taxon key; includes all descendants and synonyms | `taxonKey=5231190` |
507507+| `scientificName` | Fuzzy name search | `scientificName=Passer+domesticus` |
508508+| `datasetKey` | Filter to a specific dataset UUID | |
509509+| `country` | ISO 3166-1 alpha-2 country code | `country=GB` |
510510+| `continent` | One of AFRICA, ANTARCTICA, ASIA, EUROPE, NORTH_AMERICA, OCEANIA, SOUTH_AMERICA | |
511511+| `geometry` | WKT geometry (POLYGON, MULTIPOLYGON, LINESTRING, POINT) for spatial filter | |
512512+| `geoDistance` | Distance filter: `lat,lon,distance` | `geoDistance=51.5,-0.1,10km` |
513513+| `decimalLatitude` | Latitude range | `decimalLatitude=50,60` |
514514+| `decimalLongitude` | Longitude range | |
515515+| `hasCoordinate` | `true` to require coordinates | `hasCoordinate=true` |
516516+| `hasGeospatialIssue` | `false` to exclude records with coordinate problems | `hasGeospatialIssue=false` |
517517+| `basisOfRecord` | Filter by basis of record | `basisOfRecord=HUMAN_OBSERVATION` |
518518+| `occurrenceStatus` | `PRESENT` or `ABSENT` | |
519519+| `mediaType` | `StillImage`, `MovingImage`, or `Sound` | |
520520+| `year` | Year range | `year=2020,2025` |
521521+| `month` | Month (1–12) | |
522522+| `eventDate` | Date or range in ISO 8601 | `eventDate=2023-01-01,2023-12-31` |
523523+| `institutionCode` | Institution code abbreviation | |
524524+| `collectionCode` | Collection code abbreviation | |
525525+| `catalogNumber` | Catalog number in the collection | |
526526+| `recordedBy` | Name of the recorder | |
527527+| `identifiedBy` | Name of the identifier | |
528528+| `typeStatus` | Type status of the specimen | |
529529+| `occurrenceId` | The publisher's `occurrenceID` value | |
530530+| `gadmGid` | GADM geographic unit identifier | |
531531+| `gadmLevel0Gid` | GADM country-level code | |
532532+| `iucnRedListCategory` | IUCN category (LC, NT, VU, EN, CR, EW, EX) | |
533533+| `isSequenced` | `true` for records with DNA sequences | |
534534+| `isInCluster` | `true` for records matched to a cluster | |
535535+| `establishmentMeans` | How the organism arrived (NATIVE, INTRODUCED, etc.) | |
536536+| `degreeOfEstablishment` | Level of establishment | |
537537+| `pathway` | Pathway of introduction | |
538538+| `kingdomKey`, `phylumKey`, `classKey`, `orderKey`, `familyKey`, `genusKey` | Higher taxon filters by backbone key | |
539539+| `acceptedTaxonKey` | Filter by accepted backbone taxon | |
540540+| `verbatimScientificName` | Match on the verbatim name before GBIF interpretation | |
541541+542542+**Pagination:**
543543+544544+| Parameter | Description | Default / Max |
545545+|-----------|-------------|---------------|
546546+| `offset` | Number of results to skip | 0 |
547547+| `limit` | Maximum results per page | 20 / 300 |
548548+549549+The response `count` field gives the total matching records. The `endOfRecords` boolean signals the last page. Pagination is capped at `offset + limit ≤ 100,000` for the search API; use the download API for larger extracts.
550550+551551+**Response structure:**
552552+553553+```json
554554+{
555555+ "offset": 0,
556556+ "limit": 20,
557557+ "endOfRecords": false,
558558+ "count": 1234567,
559559+ "results": [ { /* occurrence record */ }, ... ]
560560+}
561561+```
562562+563563+### 6.3 Single Occurrence Lookup
564564+565565+`GET https://api.gbif.org/v1/occurrence/{gbifID}` — returns full occurrence JSON.
566566+567567+`GET https://api.gbif.org/v1/occurrence/{gbifID}/verbatim` — returns the uninterpreted Darwin Core record as received from the publisher.
568568+569569+`GET https://api.gbif.org/v1/occurrence/{gbifID}/fragment` — returns the raw fragment as harvested.
570570+571571+### 6.4 Species (Taxonomy) API
572572+573573+**Endpoint:** `GET https://api.gbif.org/v1/species/{nubKey}` — returns a backbone taxon record.
574574+575575+**Name matching:** `GET https://api.gbif.org/v1/species/match?name=Passer+domesticus` — fuzzy match a name to the backbone.
576576+577577+**Children:** `GET https://api.gbif.org/v1/species/{nubKey}/children` — direct children in the taxonomy.
578578+579579+**Vernacular names:** `GET https://api.gbif.org/v1/species/{nubKey}/vernacularNames`
580580+581581+**Distributions:** `GET https://api.gbif.org/v1/species/{nubKey}/distributions`
582582+583583+### 6.5 Dataset API
584584+585585+`GET https://api.gbif.org/v1/dataset/{datasetKey}` — dataset metadata.
586586+587587+`GET https://api.gbif.org/v1/dataset/search?q=birds` — search datasets.
588588+589589+`GET https://api.gbif.org/v1/organization/{orgKey}` — organisation metadata.
590590+591591+### 6.6 Download API
592592+593593+**Create a download request** (authenticated, POST):
594594+595595+```
596596+POST https://api.gbif.org/v1/occurrence/download/request
597597+Content-Type: application/json
598598+Authorization: Basic {base64(user:password)}
599599+```
600600+601601+```json
602602+{
603603+ "creator": "username",
604604+ "notificationAddresses": ["user@example.com"],
605605+ "format": "DWCA",
606606+ "predicate": {
607607+ "type": "and",
608608+ "predicates": [
609609+ { "type": "equals", "key": "TAXON_KEY", "value": "212" },
610610+ { "type": "equals", "key": "HAS_COORDINATE", "value": "true" },
611611+ { "type": "equals", "key": "HAS_GEOSPATIAL_ISSUE", "value": "false" }
612612+ ]
613613+ }
614614+}
615615+```
616616+617617+**Available formats:** `DWCA`, `SIMPLE_CSV`, `SIMPLE_PARQUET`, `SPECIES_LIST`
618618+619619+**Poll status:** `GET https://api.gbif.org/v1/occurrence/download/{downloadKey}`
620620+621621+When `status = "SUCCEEDED"`, the `downloadLink` field provides the download URL and `doi` provides the citable DOI.
622622+623623+---
624624+625625+## 7. GBIF Data Quality Flags (Issues)
626626+627627+GBIF assigns issue flags to occurrence records during interpretation. These indicate data quality problems. More than 60 flags exist, grouped into the following categories.
628628+629629+### 7.1 Geospatial Issues
630630+631631+| Flag | Severity | Meaning |
632632+|------|----------|---------|
633633+| `ZERO_COORDINATE` | Error | Coordinates are exactly 0°N 0°E — likely a null placeholder |
634634+| `COORDINATE_OUT_OF_RANGE` | Error | Latitude outside −90..90 or longitude outside −180..180 |
635635+| `COORDINATE_INVALID` | Error | Coordinate value cannot be interpreted at all |
636636+| `COORDINATE_ROUNDED` | Info | Original coordinates were rounded to 6 decimal places (~11 cm precision) |
637637+| `COORDINATE_REPROJECTED` | Info | Coordinates successfully converted from a non-WGS84 datum to WGS84 |
638638+| `COORDINATE_REPROJECTION_SUSPICIOUS` | Warning | Reprojection succeeded but caused a shift of more than 0.1° |
639639+| `COORDINATE_REPROJECTION_FAILED` | Error | Cannot reproject from the stated datum to WGS84 |
640640+| `GEODETIC_DATUM_ASSUMED_WGS84` | Info | No datum supplied; WGS84 assumed |
641641+| `GEODETIC_DATUM_INVALID` | Warning | Stated datum cannot be matched to a known SRS |
642642+| `FOOTPRINT_SRS_INVALID` | Warning | SRS for footprint WKT cannot be matched |
643643+| `FOOTPRINT_WKT_MISMATCH` | Warning | Footprint WKT conflicts with given decimal coordinates |
644644+| `FOOTPRINT_WKT_INVALID` | Warning | Footprint WKT cannot be parsed |
645645+| `COORDINATE_UNCERTAINTY_METERS_INVALID` | Warning | Uncertainty value is non-numeric or implausible |
646646+| `COORDINATE_PRECISION_INVALID` | Warning | Precision value is non-numeric or unreasonably extreme |
647647+| `PRESUMED_NEGATED_LONGITUDE` | Warning | Negating longitude would resolve country mismatch |
648648+| `PRESUMED_NEGATED_LATITUDE` | Warning | Negating latitude would resolve country mismatch |
649649+| `PRESUMED_SWAPPED_COORDINATE` | Warning | Latitude and longitude appear transposed |
650650+| `COUNTRY_COORDINATE_MISMATCH` | Error | Coordinates do not fall within the stated country |
651651+| `COUNTRY_MISMATCH` | Warning | Country name and country code are contradictory |
652652+| `COUNTRY_DERIVED_FROM_COORDINATES` | Info | Country name was determined from coordinates, not supplied |
653653+| `COUNTRY_INVALID` | Warning | Country name/code does not match known vocabulary |
654654+| `CONTINENT_COORDINATE_MISMATCH` | Warning | Coordinates outside the stated continent |
655655+| `CONTINENT_DERIVED_FROM_COORDINATES` | Info | Continent determined from coordinates |
656656+| `CONTINENT_DERIVED_FROM_COUNTRY` | Info | Continent determined from country |
657657+| `CONTINENT_INVALID` | Warning | Continent does not match known vocabulary |
658658+| `CONTINENT_COUNTRY_MISMATCH` | Warning | Interpreted continent and country do not correspond |
659659+| `DEPTH_MIN_MAX_SWAPPED` | Warning | Minimum depth greater than maximum depth |
660660+| `DEPTH_NON_NUMERIC` | Warning | Depth cannot be interpreted as a number |
661661+| `DEPTH_UNLIKELY` | Warning | Depth outside plausible range (below Mariana Trench) |
662662+| `DEPTH_NOT_METRIC` | Warning | Depth appears to be in feet, not metres |
663663+| `ELEVATION_MIN_MAX_SWAPPED` | Warning | Minimum elevation greater than maximum |
664664+| `ELEVATION_NON_NUMERIC` | Warning | Elevation cannot be interpreted as a number |
665665+| `ELEVATION_UNLIKELY` | Warning | Elevation above 17,000 m or below −11,000 m |
666666+| `ELEVATION_NOT_METRIC` | Warning | Elevation appears to be in feet, not metres |
667667+668668+### 7.2 Taxonomic Issues
669669+670670+| Flag | Meaning |
671671+|------|---------|
672672+| `TAXON_MATCH_HIGHERRANK` | Only a higher-rank match was found on the backbone (e.g. genus-level match for a species record) |
673673+| `TAXON_MATCH_NONE` | No match found on the backbone |
674674+| `TAXON_MATCH_FUZZY` | Only an imprecise, non-exact match was found |
675675+| `TAXON_MATCH_AGGREGATE` | Match only possible at species aggregate/complex level |
676676+| `SCIENTIFIC_NAME_AND_ID_INCONSISTENT` | Scientific name does not match the name for the supplied identifier |
677677+| `TAXON_MATCH_NAME_AND_ID_AMBIGUOUS` | Backbone ID and name-based lookup give different results |
678678+| `SCIENTIFIC_NAME_ID_NOT_FOUND` | Supplied scientific name ID could not be found in any checklist |
679679+| `TAXON_CONCEPT_ID_NOT_FOUND` | Taxon concept ID not found |
680680+| `TAXON_ID_NOT_FOUND` | Taxon ID not found |
681681+| `TAXON_MATCH_SCIENTIFIC_NAME_ID_IGNORED` | Scientific name ID was not used in matching |
682682+| `TAXON_MATCH_TAXON_CONCEPT_ID_IGNORED` | Taxon concept ID was not used in matching |
683683+| `TAXON_MATCH_TAXON_ID_IGNORED` | Taxon ID was not used in matching |
684684+685685+### 7.3 Date Issues
686686+687687+| Flag | Meaning |
688688+|------|---------|
689689+| `RECORDED_DATE_INVALID` | Event date cannot be interpreted (invalid date, wrong format, missing parts) |
690690+| `RECORDED_DATE_MISMATCH` | Event date string contradicts individual year/month/day fields |
691691+| `RECORDED_DATE_UNLIKELY` | Date is in the future or predates 1600 |
692692+| `IDENTIFIED_DATE_UNLIKELY` | Identification date is in the future or predates 1700 |
693693+| `IDENTIFIED_DATE_INVALID` | Identification date cannot be interpreted |
694694+| `MULTIMEDIA_DATE_INVALID` | Media creation date is invalid |
695695+| `MODIFIED_DATE_INVALID` | Modified date is invalid |
696696+| `MODIFIED_DATE_UNLIKELY` | Modified date is in the future or predates Unix epoch |
697697+| `GEOREFERENCED_DATE_INVALID` | Georeference date is invalid |
698698+| `GEOREFERENCED_DATE_UNLIKELY` | Georeference date is in the future or predates 1700 |
699699+700700+### 7.4 Vocabulary Issues
701701+702702+| Flag | Meaning |
703703+|------|---------|
704704+| `BASIS_OF_RECORD_INVALID` | Basis of record value not in the controlled vocabulary |
705705+| `TYPE_STATUS_INVALID` | Type status value not in the controlled vocabulary |
706706+| `OCCURRENCE_STATUS_UNPARSABLE` | Occurrence status cannot be matched to `present` or `absent` |
707707+| `OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT` | `occurrenceStatus` was inferred from `individualCount` rather than explicitly supplied |
708708+709709+### 7.5 Collection/Institution Issues
710710+711711+| Flag | Meaning |
712712+|------|---------|
713713+| `COLLECTION_MATCH_NONE` | Institution and collection codes could not be matched to GRSciColl |
714714+| `COLLECTION_MATCH_FUZZY` | Institution and collection codes matched GRSciColl only fuzzily |
715715+| `INSTITUTION_MATCH_NONE` | Institution code alone could not be matched to GRSciColl |
716716+| `INSTITUTION_MATCH_FUZZY` | Institution code matched GRSciColl only fuzzily |
717717+718718+### 7.6 Recommended Quality Filters
719719+720720+For most ML training purposes, apply these filters in the occurrence search API:
721721+722722+```
723723+hasCoordinate=true
724724+hasGeospatialIssue=false
725725+occurrenceStatus=PRESENT
726726+```
727727+728728+Or equivalently, use the download predicate API with:
729729+730730+```json
731731+{ "type": "equals", "key": "HAS_COORDINATE", "value": "true" },
732732+{ "type": "equals", "key": "HAS_GEOSPATIAL_ISSUE", "value": "false" }
733733+```
734734+735735+---
736736+737737+## 8. Licensing
738738+739739+### 8.1 License Options
740740+741741+GBIF datasets use one of three Creative Commons licenses (or CC0 public domain dedication). The license applies to the occurrence data in the dataset; individual multimedia assets may carry their own separate license.
742742+743743+| License | SPDX Identifier | Restrictions |
744744+|---------|----------------|--------------|
745745+| CC0 1.0 Universal | `CC0-1.0` | No restrictions; public domain dedication |
746746+| CC BY 4.0 | `CC-BY-4.0` | Attribution required |
747747+| CC BY-NC 4.0 | `CC-BY-NC-4.0` | Attribution + non-commercial use only |
748748+749749+GBIF stores the license as a full URI in occurrence records:
750750+751751+| URI | SPDX |
752752+|-----|------|
753753+| `http://creativecommons.org/publicdomain/zero/1.0/legalcode` | `CC0-1.0` |
754754+| `http://creativecommons.org/licenses/by/4.0/legalcode` | `CC-BY-4.0` |
755755+| `http://creativecommons.org/licenses/by-nc/4.0/legalcode` | `CC-BY-NC-4.0` |
756756+757757+Older datasets may use CC BY-NC-SA or CC BY-SA; these appear as `http://creativecommons.org/licenses/by-nc-sa/3.0/` etc.
758758+759759+### 8.2 License per Record
760760+761761+In a GBIF download, every occurrence row carries the dataset's license in the `license` field. The license is inherited from the dataset, not set per-occurrence.
762762+763763+### 8.3 Media Licensing
764764+765765+Individual media objects (images, sounds) carried in the multimedia extension may have a different license from the occurrence itself. For example, a CC0 dataset might include CC BY-NC-SA spectrograms produced by Xeno-canto. The `license` and `rightsHolder` fields in the multimedia extension apply specifically to each media item.
766766+767767+### 8.4 Implications for Training Data
768768+769769+- **CC0**: Can be used freely including for commercial ML training.
770770+- **CC-BY**: Can be used for commercial ML training; output model or downstream products must attribute the data source.
771771+- **CC-BY-NC**: Cannot be used for commercial ML training. Suitable for academic/research use only.
772772+773773+A mixed-license Terradots document must propagate the most restrictive license of its constituent labels to the document as a whole, or track per-label licensing.
774774+775775+---
776776+777777+## 9. Multimedia Extensions
778778+779779+### 9.1 Overview
780780+781781+GBIF uses the GBIF Multimedia Extension to attach images, sounds, and videos to occurrence records. In the JSON API, media items are in the `media` array of each occurrence. In DwC-A downloads, they are in the `multimedia.txt` extension file.
782782+783783+### 9.2 Multimedia Fields
784784+785785+| Field | Description |
786786+|-------|-------------|
787787+| `type` | Media type from Dublin Core: `StillImage`, `MovingImage`, `Sound` |
788788+| `format` | MIME type (e.g. `image/jpeg`, `audio/mpeg`, `video/mp4`) |
789789+| `identifier` | Direct URL to the media file |
790790+| `references` | URL to a web page about the media item |
791791+| `title` | Title of the media item |
792792+| `description` | Description of the item (e.g. "Dorsal view of specimen") |
793793+| `source` | Original source of the item |
794794+| `audience` | Target audience |
795795+| `created` | ISO 8601 date-time when the media was created |
796796+| `creator` | Person or organisation who created the media |
797797+| `contributor` | Person who contributed the media |
798798+| `publisher` | Publisher of the media |
799799+| `license` | License URI for this specific media item |
800800+| `rightsHolder` | Person or organisation holding rights over the media |
801801+802802+### 9.3 Example Media Object
803803+804804+```json
805805+{
806806+ "type": "StillImage",
807807+ "format": "image/jpeg",
808808+ "identifier": "https://data.nhm.ac.uk/media/BMNHE_1900813.jpg",
809809+ "title": "BMNHE_1900813_465514",
810810+ "created": "2016-03-02",
811811+ "license": "http://creativecommons.org/licenses/by/4.0/legalcode",
812812+ "rightsHolder": "The Trustees of the Natural History Museum, London"
813813+}
814814+```
815815+816816+### 9.4 Media Types for Biodiversity
817817+818818+The three GBIF media types correspond to biodiversity use cases:
819819+820820+| Type | Example |
821821+|------|---------|
822822+| `StillImage` | Specimen photograph, field photo, drone image, camera trap frame |
823823+| `MovingImage` | Camera trap video |
824824+| `Sound` | Xeno-canto bird call recording, bat echolocation recording |
825825+826826+---
827827+828828+## 10. Darwin Core Archive (DwC-A) Format
829829+830830+### 10.1 Overview
831831+832832+A Darwin Core Archive is a ZIP file containing tab-separated text files and XML metadata. It implements a **star schema**: one core file (e.g., `occurrence.txt`) at the centre, with optional extension files (e.g., `multimedia.txt`) linked by a shared identifier.
833833+834834+### 10.2 ZIP Contents (GBIF Download)
835835+836836+| File | Required | Description |
837837+|------|----------|-------------|
838838+| `occurrence.txt` | Yes | Interpreted occurrence records (after GBIF processing) |
839839+| `verbatim.txt` | No | Original unmodified records as received from publishers |
840840+| `multimedia.txt` | No | Media extension — images, audio, video |
841841+| `meta.xml` | Yes | Archive descriptor: file formats, column names, term URIs |
842842+| `metadata.xml` | Yes | Dataset-level metadata in Ecological Metadata Language (EML) |
843843+| `rights.txt` | No | License and rights information |
844844+| `citations.txt` | No | Citation text for each dataset included |
845845+| `dataset/*.xml` | No | Individual EML files for each contributing dataset |
846846+847847+### 10.3 meta.xml Structure
848848+849849+The `meta.xml` file describes the files in the archive using XML. Simplified structure:
850850+851851+```xml
852852+<archive xmlns="http://rs.tdwg.org/dwc/text/"
853853+ metadata="metadata.xml">
854854+ <core encoding="UTF-8" fieldsTerminatedBy="\t"
855855+ linesTerminatedBy="\n" fieldsEnclosedBy=""
856856+ ignoreHeaderLines="1"
857857+ rowType="http://rs.tdwg.org/dwc/terms/Occurrence">
858858+ <files>
859859+ <location>occurrence.txt</location>
860860+ </files>
861861+ <id index="0"/> <!-- gbifID is the core record identifier -->
862862+ <field index="1" term="http://rs.tdwg.org/dwc/terms/datasetKey"/>
863863+ <field index="2" term="http://rs.tdwg.org/dwc/terms/occurrenceID"/>
864864+ <field index="3" term="http://rs.tdwg.org/dwc/terms/kingdom"/>
865865+ <!-- ... one <field> per column ... -->
866866+ </core>
867867+ <extension encoding="UTF-8" fieldsTerminatedBy="\t"
868868+ linesTerminatedBy="\n" fieldsEnclosedBy=""
869869+ ignoreHeaderLines="1"
870870+ rowType="http://rs.gbif.org/terms/1.0/Multimedia">
871871+ <files>
872872+ <location>multimedia.txt</location>
873873+ </files>
874874+ <coreid index="0"/> <!-- links back to occurrence.txt gbifID -->
875875+ <field index="1" term="http://purl.org/dc/terms/type"/>
876876+ <field index="2" term="http://purl.org/dc/terms/format"/>
877877+ <field index="3" term="http://purl.org/dc/elements/1.1/identifier"/>
878878+ <!-- ... -->
879879+ </extension>
880880+</archive>
881881+```
882882+883883+### 10.4 occurrence.txt Simple CSV Columns
884884+885885+The Simple CSV download (or the core of a DwC-A) contains these columns in order:
886886+887887+`gbifID`, `datasetKey`, `occurrenceID`, `kingdom`, `phylum`, `class`, `order`, `family`, `genus`, `species`, `infraspecificEpithet`, `taxonRank`, `scientificName`, `verbatimScientificName`, `verbatimScientificNameAuthorship`, `countryCode`, `locality`, `stateProvince`, `occurrenceStatus`, `individualCount`, `publishingOrgKey`, `decimalLatitude`, `decimalLongitude`, `coordinateUncertaintyInMeters`, `coordinatePrecision`, `elevation`, `elevationAccuracy`, `depth`, `depthAccuracy`, `eventDate`, `day`, `month`, `year`, `taxonKey`, `speciesKey`, `basisOfRecord`, `institutionCode`, `collectionCode`, `catalogNumber`, `recordNumber`, `identifiedBy`, `dateIdentified`, `license`, `rightsHolder`, `recordedBy`, `typeStatus`, `establishmentMeans`, `lastInterpreted`, `mediaType`, `issue`
888888+889889+The full DwC-A `occurrence.txt` contains all the above plus approximately 150 additional columns including verbatim fields, georeference fields, geological fields, and all GBIF-added backbone keys.
890890+891891+### 10.5 multimedia.txt Columns
892892+893893+`gbifID`, `type`, `format`, `identifier`, `references`, `title`, `description`, `source`, `audience`, `created`, `creator`, `contributor`, `publisher`, `license`, `rightsHolder`
894894+895895+The `gbifID` in multimedia.txt is a foreign key referencing `gbifID` in occurrence.txt.
896896+897897+### 10.6 Downloading and Citing DwC-A
898898+899899+GBIF generates a DOI for each download (e.g., `10.15468/dl.xxxxx`). This DOI should be cited in any publication using the data. The `citations.txt` file in the archive contains the suggested citation text for each dataset.
900900+901901+---
902902+903903+## 11. Terradots Mapping Plan
904904+905905+This section maps every GBIF/Darwin Core concept to the Terradots type system defined in `terradots.mli`.
906906+907907+### 11.1 Origin Type
908908+909909+Every GBIF occurrence is a **direct empirical observation or specimen** — not computed or simulated. It therefore maps to `Measured`.
910910+911911+```ocaml
912912+Measured {
913913+ observer : string option; (* recordedBy / recordedByID *)
914914+ via : string option; (* "gbif:{gbifID}" *)
915915+ license : string option; (* SPDX from license URI *)
916916+ accuracy_m : float option; (* coordinateUncertaintyInMeters *)
917917+}
918918+```
919919+920920+### 11.2 Geometry
921921+922922+| GBIF field | Terradots | Notes |
923923+|------------|-----------|-------|
924924+| `decimalLongitude` | `Point { x = lon; ... }` | x is longitude in EPSG:4326 |
925925+| `decimalLatitude` | `Point { ...; y = lat }` | y is latitude in EPSG:4326 |
926926+| `footprintWKT` | `Polygon` or `Multi` | If present, use footprint geometry instead of point |
927927+928928+GBIF coordinates are always WGS84 (EPSG:4326) after interpretation. Set `document.crs = "EPSG:4326"`.
929929+930930+Most GBIF occurrence records are point observations. The `footprintWKT` field, when populated, describes the full spatial extent and should be preferred when it is valid (not flagged with `FOOTPRINT_WKT_INVALID` or `FOOTPRINT_WKT_MISMATCH`).
931931+932932+```ocaml
933933+(* Typical point occurrence *)
934934+let geom = Point { x = decimalLongitude; y = decimalLatitude }
935935+936936+(* Occurrence with valid footprint *)
937937+let geom = parse_wkt footprintWKT (* → Polygon or Multi *)
938938+```
939939+940940+### 11.3 Identity: label.id and origin.via
941941+942942+| GBIF field | Terradots field | Value |
943943+|------------|----------------|-------|
944944+| `gbifID` | `origin.via` | `"gbif:{gbifID}"` (matches the URI scheme in the mli doc) |
945945+| `occurrenceID` | `properties` | `("dwc:occurrenceID", value)` — publisher's ID; not stable as a Terradots id |
946946+| — | `label.id` | Generate UUID at import time for stable local identity |
947947+948948+The `via` URI format `gbif:4023589127` is already listed as a recognised scheme in the Terradots `.mli` documentation comment:
949949+950950+```
951951+gbif:4023589127 GBIF occurrence
952952+```
953953+954954+The `label.id` should be a UUID generated at import time — do not use `gbifID` directly as the Terradots id, because the gbifID is an integer and UUIDs are recommended for the id field.
955955+956956+Alternatively use `"gbif-" ^ string_of_int gbifID` as the Terradots id if stability over re-imports is preferred.
957957+958958+### 11.4 Scientific Name → class_dist
959959+960960+| GBIF field | Terradots field | Value |
961961+|------------|----------------|-------|
962962+| `species` | `class_dist` | `[("Passer domesticus", 1.0)]` — canonical binomial at species level |
963963+| `acceptedScientificName` | `class_dist` | Use when `species` is empty (subspecies/genus-level records) |
964964+| `scientificName` | `class_dist` | Fallback when above are empty |
965965+966966+**Recommendation:** Prefer the species-level name from the `species` column/field for the class, so that subspecies records and species records can be compared and deduplicated. If the match is only at genus or higher (`TAXON_MATCH_HIGHERRANK` flag), use the accepted scientific name at the matched rank.
967967+968968+```ocaml
969969+let class_name =
970970+ if species <> "" then species (* "Passer domesticus" *)
971971+ else if acceptedScientificName <> "" then acceptedScientificName
972972+ else scientificName
973973+in
974974+let class_dist = [(class_name, 1.0)] (* definite classification *)
975975+```
976976+977977+GBIF classifications are determinate — a taxonomist identified the specimen. Confidence = 1.0 for confirmed identifications.
978978+979979+### 11.5 eventDate → event_date
980980+981981+Both Darwin Core `eventDate` and Terradots `event_date` use the same ISO 8601 convention. Pass through verbatim.
982982+983983+| GBIF field | Terradots field | Notes |
984984+|------------|----------------|-------|
985985+| `eventDate` | `event_date` | Direct passthrough; supports `2023`, `2023-09`, `2023-09-18`, intervals |
986986+| `year`/`month`/`day` | (reconstruct if `eventDate` is absent) | Combine as `"{year}-{month:02}-{day:02}"` |
987987+988988+```ocaml
989989+let event_date =
990990+ if eventDate <> "" then Some (event_date_of_string eventDate)
991991+ else match year, month, day with
992992+ | Some y, Some m, Some d ->
993993+ Some (event_date_of_string (Printf.sprintf "%04d-%02d-%02d" y m d))
994994+ | Some y, Some m, None ->
995995+ Some (event_date_of_string (Printf.sprintf "%04d-%02d" y m))
996996+ | Some y, None, None ->
997997+ Some (event_date_of_string (string_of_int y))
998998+ | _ -> None
999999+```
10001000+10011001+### 11.6 coordinateUncertaintyInMeters → accuracy_m
10021002+10031003+| GBIF field | Terradots field | Notes |
10041004+|------------|----------------|-------|
10051005+| `coordinateUncertaintyInMeters` | `origin.accuracy_m` | Direct; metres |
10061006+10071007+When this field is absent (many older records lack it), use `None`. Do not default to zero — zero would be misleading. The `COORDINATE_UNCERTAINTY_METERS_INVALID` issue flag indicates the value was unparseable.
10081008+10091009+```ocaml
10101010+let accuracy_m =
10111011+ match coordinateUncertaintyInMeters with
10121012+ | Some v when v > 0.0 -> Some v
10131013+ | _ -> None
10141014+```
10151015+10161016+### 11.7 basisOfRecord → properties / origin distinction
10171017+10181018+`basisOfRecord` does not change the Terradots `origin` type — all GBIF records are `Measured`. However it provides important context about the nature of the measurement.
10191019+10201020+| basisOfRecord | Recommended handling |
10211021+|---------------|---------------------|
10221022+| `HumanObservation` | `observer` = `recordedByID` or free-text `recordedBy`; accuracy from GPS uncertainty |
10231023+| `MachineObservation` | `observer` = instrument URI or `institutionCode`; camera trap, acoustic sensor |
10241024+| `PreservedSpecimen` | accuracy_m typically large (georeferenced from label/locality); use `coordinateUncertaintyInMeters` |
10251025+| `FossilSpecimen` | Same as PreservedSpecimen; also note in `properties` |
10261026+| `LivingSpecimen` | Living plant/animal in a garden or zoo |
10271027+| `MaterialSample` | A physical sample (e.g. DNA extract, water sample) |
10281028+| `MaterialCitation` | A citation of an occurrence in published literature |
10291029+10301030+Store in properties for downstream inspection:
10311031+10321032+```ocaml
10331033+("gbif:basisOfRecord", basisOfRecord)
10341034+```
10351035+10361036+### 11.8 occurrenceID / gbifID → id and via URI
10371037+10381038+| GBIF field | Terradots field | Value |
10391039+|------------|----------------|-------|
10401040+| `gbifID` | `origin.via` | `"gbif:" ^ string_of_int gbifID` |
10411041+| `occurrenceID` | `properties` | `("dwc:occurrenceID", occurrenceID)` |
10421042+| (generated) | `label.id` | UUID or `"gbif-" ^ string_of_int gbifID` |
10431043+10441044+Prefer `origin.via` for the GBIF URI because it signals "imported from external registry" and is how the Terradots model is designed for registry imports. The `occurrenceID` (publisher's identifier) is stored as a property for cross-referencing.
10451045+10461046+### 11.9 recordedBy → origin.observer
10471047+10481048+| GBIF field | Terradots field | Value |
10491049+|------------|----------------|-------|
10501050+| `recordedByID` (ORCID URIs) | `origin.observer` | Use the first ORCID URI: `"orcid:0000-0001-2345-6789"` |
10511051+| `recordedBy` (free text) | `origin.observer` | Use as fallback: `"gbif:recorder/" ^ url_encode(recordedBy)` |
10521052+| `identifiedBy` | `properties` | `("dwc:identifiedBy", identifiedBy)` |
10531053+| `recordedBy` | `properties` | `("dwc:recordedBy", recordedBy)` — always store the human-readable name |
10541054+10551055+If `recordedByID` contains multiple ORCID URIs (semicolon-separated), use the first as `observer` and store the full list in `properties`.
10561056+10571057+For institutionally-collected specimens where `recordedBy` is absent, use the institution as the observer: `"https://ror.org/{ror_id}"` if a ROR ID is available, or `"gbif:institution/" ^ institutionCode`.
10581058+10591059+### 11.10 license → origin.license
10601060+10611061+Convert the GBIF license URI to an SPDX identifier:
10621062+10631063+```ocaml
10641064+let spdx_of_gbif_license = function
10651065+ | s when String.is_prefix s "http://creativecommons.org/publicdomain/zero/" -> "CC0-1.0"
10661066+ | s when String.is_prefix s "http://creativecommons.org/licenses/by/4.0/" -> "CC-BY-4.0"
10671067+ | s when String.is_prefix s "http://creativecommons.org/licenses/by-nc/4.0/"-> "CC-BY-NC-4.0"
10681068+ | s when String.is_prefix s "http://creativecommons.org/licenses/by-sa/4.0/"-> "CC-BY-SA-4.0"
10691069+ | s when String.is_prefix s "http://creativecommons.org/licenses/by-nc-sa/" -> "CC-BY-NC-SA-4.0"
10701070+ | s -> s (* store the raw URI if not recognised *)
10711071+```
10721072+10731073+Note: licence comes from the **dataset**, not the occurrence. The record-level `license` field on the occurrence is that dataset license propagated down.
10741074+10751075+### 11.11 Dataset → activity
10761076+10771077+A GBIF dataset maps to a Terradots `activity` representing the import of that dataset.
10781078+10791079+| GBIF dataset field | Terradots `activity` field | Value |
10801080+|--------------------|---------------------------|-------|
10811081+| `datasetKey` | `activity_id` | `"gbif:dataset:" ^ datasetKey` |
10821082+| Organisation/person from dataset | `agent` | `"https://ror.org/..." ^ publishingOrg` or `institutionCode` |
10831083+| Download date or `pubDate` | `date` | ISO 8601 date of the GBIF download or dataset publication |
10841084+| Dataset `title` | `description` | Human-readable dataset name |
10851085+10861086+All occurrence labels from the same dataset share the same `activity_id`.
10871087+10881088+```ocaml
10891089+let act = {
10901090+ activity_id = "gbif:dataset:e053ff53-c156-4e2e-b9b5-4462e9625424";
10911091+ agent = "gbif:org/90fd6680-349f-11d8-aa2d-b8a03c50a862";
10921092+ date = "2026-03-04"; (* date of import into Terradots *)
10931093+ description = Some "Tropicos Specimens Non-MO (Missouri Botanical Garden)";
10941094+}
10951095+```
10961096+10971097+Alternatively, when doing a bulk GBIF download across many datasets, create a single activity for the whole download:
10981098+10991099+```ocaml
11001100+let act = {
11011101+ activity_id = "gbif:download:10.15468/dl.xxxxx"; (* the download DOI *)
11021102+ agent = "gbif:download";
11031103+ date = download_date;
11041104+ description = Some "GBIF occurrence download, Aves, 2020-2025";
11051105+}
11061106+```
11071107+11081108+### 11.12 Issues/Flags → annotations and properties
11091109+11101110+GBIF issue flags are stored in `label.properties` for programmatic filtering, and may optionally generate Terradots `annotation` records.
11111111+11121112+**In properties (always):**
11131113+11141114+```ocaml
11151115+("gbif:issues", String.concat "," issues_list)
11161116+```
11171117+11181118+**As annotations (optional, for important flags):**
11191119+11201120+```ocaml
11211121+(* Example: coordinate mismatch warrants a human-readable annotation *)
11221122+if List.mem "COUNTRY_COORDINATE_MISMATCH" issues then
11231123+ let ann = {
11241124+ id = uuid ();
11251125+ text = "GBIF flag COUNTRY_COORDINATE_MISMATCH: coordinates do not fall within the stated country";
11261126+ anchors = [label.id];
11271127+ }
11281128+```
11291129+11301130+**Recommended issue filters at import time:** Skip records where `hasGeospatialIssue = true` (i.e., any of `ZERO_COORDINATE`, `COORDINATE_OUT_OF_RANGE`, `COORDINATE_INVALID`, `COUNTRY_COORDINATE_MISMATCH`, `COORDINATE_REPROJECTION_FAILED` are present) unless the use case specifically requires imprecise records.
11311131+11321132+### 11.13 Taxonomic Hierarchy → properties
11331133+11341134+The full taxonomic hierarchy should be stored in `properties` for downstream filtering and analysis:
11351135+11361136+```ocaml
11371137+let taxonomy_properties = List.filter_map (fun (k, v) ->
11381138+ if v = "" then None else Some (k, v))
11391139+[
11401140+ ("dwc:kingdom", kingdom);
11411141+ ("dwc:phylum", phylum);
11421142+ ("dwc:class", class_);
11431143+ ("dwc:order", order);
11441144+ ("dwc:family", family);
11451145+ ("dwc:genus", genus);
11461146+ ("dwc:species", species);
11471147+ ("dwc:taxonRank", taxonRank);
11481148+ ("gbif:taxonKey", string_of_int taxonKey);
11491149+ ("gbif:speciesKey", string_of_int speciesKey);
11501150+ ("gbif:kingdomKey", string_of_int kingdomKey);
11511151+ ("gbif:familyKey", string_of_int familyKey);
11521152+ ("gbif:genusKey", string_of_int genusKey);
11531153+ ("dwc:taxonomicStatus", taxonomicStatus);
11541154+ ("dwc:vernacularName", vernacularName);
11551155+ ("gbif:iucnCategory", iucnRedListCategory);
11561156+]
11571157+```
11581158+11591159+### 11.14 Complete Import Example
11601160+11611161+**GBIF occurrence (Common Woodshrike, Xeno-canto):**
11621162+11631163+```json
11641164+{
11651165+ "gbifID": 3034438331,
11661166+ "occurrenceID": "https://data.biodiversitydata.nl/xeno-canto/observation/XC165501",
11671167+ "basisOfRecord": "HUMAN_OBSERVATION",
11681168+ "species": "Tephrodornis pondicerianus",
11691169+ "scientificName": "Tephrodornis pondicerianus pondicerianus (Gmelin, 1789)",
11701170+ "decimalLatitude": 18.3669,
11711171+ "decimalLongitude": 73.7512,
11721172+ "coordinateUncertaintyInMeters": null,
11731173+ "eventDate": "2026-01-14",
11741174+ "recordedBy": "Sarthak Awhad",
11751175+ "license": "http://creativecommons.org/licenses/by-nc/4.0/legalcode",
11761176+ "issues": ["CONTINENT_DERIVED_FROM_COORDINATES"],
11771177+ "kingdom": "Animalia", "class": "Aves",
11781178+ "order": "Passeriformes", "family": "Tephrodornithidae",
11791179+ "datasetKey": "..."
11801180+}
11811181+```
11821182+11831183+**Terradots output:**
11841184+11851185+```ocaml
11861186+let label = make_imported
11871187+ ~cell:(hilbert_cell ~level:12 ~crs:"EPSG:4326" {x=73.7512; y=18.3669})
11881188+ ~id:"<uuid>"
11891189+ ~geometry:(Point {x=73.7512; y=18.3669})
11901190+ ~via:"gbif:3034438331"
11911191+ ~observer:"gbif:recorder/Sarthak+Awhad" (* no ORCID available *)
11921192+ ~license:"CC-BY-NC-4.0"
11931193+ (* accuracy_m omitted: coordinateUncertaintyInMeters absent *)
11941194+ ~event_date:(event_date_of_string "2026-01-14")
11951195+ ~class_dist:[("Tephrodornis pondicerianus", 1.0)]
11961196+ ~confidence:1.0
11971197+ ~activity:"gbif:dataset:..."
11981198+ ~properties:[
11991199+ ("dwc:occurrenceID", "https://data.biodiversitydata.nl/xeno-canto/observation/XC165501");
12001200+ ("gbif:basisOfRecord", "HUMAN_OBSERVATION");
12011201+ ("dwc:kingdom", "Animalia");
12021202+ ("dwc:class", "Aves");
12031203+ ("dwc:order", "Passeriformes");
12041204+ ("dwc:family", "Tephrodornithidae");
12051205+ ("dwc:genus", "Tephrodornis");
12061206+ ("dwc:species", "Tephrodornis pondicerianus");
12071207+ ("dwc:taxonRank", "SUBSPECIES");
12081208+ ("dwc:recordedBy", "Sarthak Awhad");
12091209+ ("dwc:behavior", "call");
12101210+ ("gbif:issues", "CONTINENT_DERIVED_FROM_COORDINATES");
12111211+ ("gbif:datasetKey", "...");
12121212+ ]
12131213+ ()
12141214+```
12151215+12161216+---
12171217+12181218+## 12. Mapping Summary Table
12191219+12201220+| GBIF / Darwin Core Concept | Terradots Mapping | Status |
12211221+|---|---|---|
12221222+| GBIF occurrence | `label` with `Measured` origin | Direct |
12231223+| `decimalLongitude` | `Point { x = lon; ... }` in EPSG:4326 | Direct |
12241224+| `decimalLatitude` | `Point { ...; y = lat }` in EPSG:4326 | Direct |
12251225+| `footprintWKT` | `Polygon` or `Multi` geometry | Direct (if valid) |
12261226+| `coordinateUncertaintyInMeters` | `origin.accuracy_m` | Direct |
12271227+| `species` / `acceptedScientificName` | `class_dist = [(name, 1.0)]` | Direct |
12281228+| `eventDate` | `event_date` | Direct (same ISO 8601 convention) |
12291229+| `year`/`month`/`day` | `event_date` (reconstructed) | Direct |
12301230+| `basisOfRecord` | `properties[("gbif:basisOfRecord", ...)]` | Property |
12311231+| `gbifID` | `origin.via = "gbif:{gbifID}"` | Direct (URI scheme documented) |
12321232+| `occurrenceID` | `properties[("dwc:occurrenceID", ...)]` | Property |
12331233+| `label.id` | UUID generated at import | UUID |
12341234+| `license` URI | `origin.license` (as SPDX string) | Direct (with URI→SPDX conversion) |
12351235+| Dataset (`datasetKey`) | `activity` with `activity_id = "gbif:dataset:{key}"` | Direct |
12361236+| GBIF download DOI | `activity` with `activity_id = "gbif:download:{doi}"` | Direct |
12371237+| `recordedByID` (ORCID) | `origin.observer = "orcid:{orcid}"` | Direct |
12381238+| `recordedBy` (free text) | `origin.observer` (with `gbif:recorder/` prefix) | Workaround |
12391239+| `institutionCode` | `origin.observer` fallback or `properties` | Partial |
12401240+| `kingdom`/`phylum`/`class`/`order`/`family`/`genus` | `properties[("dwc:{rank}", name)]` | Property |
12411241+| `taxonKey`/`speciesKey`/etc. backbone keys | `properties[("gbif:{rank}Key", key)]` | Property |
12421242+| `taxonomicStatus` | `properties[("dwc:taxonomicStatus", ...)]` | Property |
12431243+| `vernacularName` | `properties[("dwc:vernacularName", ...)]` | Property |
12441244+| `iucnRedListCategory` | `properties[("gbif:iucnCategory", ...)]` | Property |
12451245+| Issue flags | `properties[("gbif:issues", comma_list)]` | Property |
12461246+| Issue flags (serious) | `annotation` records | Optional |
12471247+| `media` array items | No direct mapping | Gap (§13.1) |
12481248+| `occurrenceStatus = ABSENT` | No direct mapping | Gap (§13.2) |
12491249+| `individualCount` | `properties[("dwc:individualCount", ...)]` | Property |
12501250+| `sex` / `lifeStage` / `behavior` | `properties[("dwc:{field}", ...)]` | Property |
12511251+| `locality` / `stateProvince` / `country` | `properties[("dwc:{field}", ...)]` | Property |
12521252+| `samplingProtocol` | `properties[("dwc:samplingProtocol", ...)]` | Property |
12531253+| `identifiedBy` / `dateIdentified` | `properties[("dwc:{field}", ...)]` | Property |
12541254+| `catalogNumber` / `collectionCode` | `properties[("dwc:{field}", ...)]` | Property |
12551255+| `isInCluster` | `properties[("gbif:isInCluster", "true")]` | Property |
12561256+| `dynamicProperties` | `properties[("dwc:dynamicProperties", json_str)]` | Property |
12571257+12581258+---
12591259+12601260+## 13. Gaps and Required Extensions
12611261+12621262+### 13.1 Multimedia / Media Objects
12631263+12641264+**Problem:** GBIF occurrences carry a `media` array with URLs to specimen photographs, spectrograms, and audio recordings. These are first-class data in biodiversity ML (e.g., training image classifiers on iNaturalist photos). Terradots has no dedicated field for attached media.
12651265+12661266+**Options:**
12671267+1. **Store media URLs in properties:** `("gbif:media:0:url", "https://...")`, `("gbif:media:0:type", "StillImage")`, etc. Works but ugly and not structured.
12681268+2. **Add `media` field to `label`:** A list of `{ url: string; media_type: string; license: string option; ... }` records. Clean but requires a schema extension.
12691269+3. **Create a separate media label:** For each media item, create a second `label` at the same geometry with `class_dist = [(media_type, 1.0)]` and `via` pointing to the media URL. The two labels are linked via a `group`.
12701270+12711271+**Recommendation:** Short-term, use option 1 with a fixed property naming convention. Longer-term, add a `media : media_item list` field to `label` where `media_item` carries URL, type, format, license, and creator.
12721272+12731273+### 13.2 Absent Occurrences (occurrenceStatus = ABSENT)
12741274+12751275+**Problem:** GBIF includes absence records (`occurrenceStatus = ABSENT`) from systematic surveys where a species was looked for but not found. These are important for species distribution modelling. Terradots has no direct way to represent absence.
12761276+12771277+**Options:**
12781278+1. **Property flag:** `("gbif:occurrenceStatus", "ABSENT")`. Requires consumers to filter.
12791279+2. **Negative confidence:** Use `confidence = 0.0` to signal absence. Unconventional.
12801280+3. **Add a boolean `absent` field to `label`**, or an `OccurrenceStatus` variant to complement `origin`.
12811281+12821282+**Recommendation:** For now, store `("gbif:occurrenceStatus", status)` in properties and document the convention. Flag this as requiring a first-class extension for biodiversity SDM workflows.
12831283+12841284+### 13.3 Observer Lists (Multiple Recorders)
12851285+12861286+**Problem:** GBIF `recordedBy` is a semicolon-separated list of names (e.g. `"Alice Smith; Bob Jones"`). Terradots `origin.observer` is a single URI string. There is no support for multiple co-observers.
12871287+12881288+**Options:**
12891289+1. **Use the first recorder** as observer and store the full list in properties.
12901290+2. **Concatenate** as a custom URI: `"gbif:recorders/" ^ url_encode(recordedBy)`. Ugly.
12911291+3. **Add `observers : string list` to `Measured` origin.**
12921292+12931293+**Recommendation:** Use option 1 short-term. For citizen science datasets (iNaturalist, eBird) with many multi-observer records, a list field would be valuable.
12941294+12951295+### 13.4 Subspecies and Infraspecific Taxa in class_dist
12961296+12971297+**Problem:** GBIF may match an occurrence to a subspecies (e.g., `Tephrodornis pondicerianus pondicerianus`). The `species` field gives the species-level name. Using species-level for `class_dist` loses subspecies precision; using the full trinomial makes deduplication harder (different subspecies authorities won't match).
12981298+12991299+**Recommendation:** Store the species-level name in `class_dist` (for matching/deduplication), store the full `scientificName` and `taxonRank` in `properties`. When subspecies identity matters, query via the `gbif:taxonKey` property.
13001300+13011301+### 13.5 Cluster Detection (isInCluster)
13021302+13031303+**Problem:** GBIF detects clusters of probable duplicate occurrences across datasets (`isInCluster = true`). These should map to Terradots deduplication via `Derived` labels, but GBIF does not expose the cluster members directly in the occurrence record — only whether a record is in a cluster.
13041304+13051305+**Recommendation:** Store `("gbif:isInCluster", "true")` as a property and flag these records as deduplication candidates. Use Terradots fingerprinting (spatial cell + class) to find the intra-document candidates for review.
13061306+13071307+### 13.6 Verbatim vs. Interpreted Fields
13081308+13091309+**Problem:** GBIF provides both the original publisher data (`verbatimScientificName`, `verbatimEventDate`, `verbatimCoordinates`) and its interpreted versions. Terradots has no parallel verbatim/interpreted distinction.
13101310+13111311+**Recommendation:** Store important verbatim fields in properties with a `verbatim:` prefix: `("verbatim:scientificName", verbatimScientificName)`, `("verbatim:eventDate", verbatimEventDate)`. The interpreted values go into the primary fields.
13121312+13131313+### 13.7 Sampling Events and Effort
13141314+13151315+**Problem:** GBIF sampling-event datasets carry structured survey effort information (`samplingProtocol`, `sampleSizeValue`, `sampleSizeUnit`, `samplingEffort`, `eventID`, `parentEventID`). These describe the survey design, not just individual records. Terradots `activity` is close but lacks these structured fields.
13161316+13171317+**Recommendation:** Store sampling metadata in `properties`. The `eventID` → Terradots `group` mapping (group all occurrences from the same survey event) would be a useful convention but requires `group` to support arbitrary membership links (currently `group.members` is a list of label IDs).
13181318+13191319+### 13.8 GBIF Download Citation
13201320+13211321+**Problem:** GBIF requires citation of downloads (DOI). The Terradots `activity` `description` field can hold this, but there is no structured citation field.
13221322+13231323+**Recommendation:** Store the download DOI in the activity and document the convention:
13241324+13251325+```ocaml
13261326+{
13271327+ activity_id = "gbif:download:10.15468/dl.xxxxx";
13281328+ agent = "GBIF.org";
13291329+ date = "2026-03-04";
13301330+ description = Some "GBIF Occurrence Download https://doi.org/10.15468/dl.xxxxx";
13311331+}
13321332+```
13331333+13341334+---
13351335+13361336+## 14. Recommended Property Key Conventions
13371337+13381338+When storing GBIF/Darwin Core metadata in `label.properties`, use the following prefix conventions to avoid key collisions and aid downstream processing:
13391339+13401340+| Prefix | Meaning | Example |
13411341+|--------|---------|---------|
13421342+| `dwc:` | Darwin Core standard term | `dwc:kingdom`, `dwc:recordedBy` |
13431343+| `gbif:` | GBIF-added or GBIF-specific field | `gbif:taxonKey`, `gbif:basisOfRecord`, `gbif:issues` |
13441344+| `gbif:dataset` | Dataset-level information | `gbif:datasetKey`, `gbif:datasetName` |
13451345+| `verbatim:` | Verbatim (pre-interpretation) value | `verbatim:scientificName` |
13461346+| `media:N:` | N-th media item fields | `gbif:media:0:url`, `gbif:media:0:type` |
13471347+13481348+---
13491349+13501350+## 15. Recommended Priority for Extensions
13511351+13521352+Ordered by impact on faithful GBIF import:
13531353+13541354+1. **Add `media` field to `label`** (§13.1) — critical for photo-based and audio-based biodiversity ML (iNaturalist, Xeno-canto, camera traps). High priority.
13551355+2. **Add absence support to `label`** (§13.2) — required for species distribution modelling, which depends on both presence and absence records. High priority for SDM use cases.
13561356+3. **Add `observers : string list` to `Measured` origin** (§13.3) — medium priority for large citizen-science imports.
13571357+4. **Add structured sampling-event metadata to `activity`** (§13.7) — needed for systematic survey datasets. Medium priority.
13581358+5. **Add `properties` to `group`** (already identified in the OSM mapping) — needed for GBIF network and dataset-group metadata. Medium priority.
13591359+6. **Structured citation field in `activity`** (§13.8) — low priority; the description field is an adequate workaround.
+990
docs/plans/inaturalist-mapping.md
···11+# iNaturalist to Terradots Mapping Plan
22+33+## 1. iNaturalist Data Model Overview
44+55+### Primary Sources
66+77+- API v1 reference (Swagger UI): https://api.inaturalist.org/v1/docs/
88+- API recommended practices: https://www.inaturalist.org/pages/api+recommended+practices
99+- Help — quality grades: https://help.inaturalist.org/en/support/solutions/articles/151000169936-what-is-the-data-quality-assessment-and-how-do-observations-qualify-to-become-research-grade-
1010+- Help — identifications: https://help.inaturalist.org/en/support/solutions/articles/151000194901-how-do-identifications-work-
1111+- Help — community taxon: https://help.inaturalist.org/en/support/solutions/articles/151000173076-what-are-the-community-taxon-and-the-observation-taxon-
1212+- Help — geoprivacy: https://help.inaturalist.org/en/support/solutions/articles/151000169938-what-is-geoprivacy-what-does-it-mean-for-an-observation-to-be-obscured-
1313+- Help — licenses: https://help.inaturalist.org/en/support/solutions/articles/151000173511-how-do-licenses-work-on-inaturalist-should-i-change-my-licenses-
1414+- Help — annotations: https://help.inaturalist.org/en/support/solutions/articles/151000191830-what-are-the-definitions-of-inaturalist-annotations-
1515+- Annotation values: https://www.inaturalist.org/pages/annotationvalues
1616+- Help — projects: https://help.inaturalist.org/en/support/solutions/articles/151000176472-understanding-projects-on-inaturalist
1717+- Open data (S3): https://github.com/inaturalist/inaturalist-open-data
1818+- GBIF occurrence export: https://www.gbif.org/dataset/50c9509d-22c7-4a22-a47d-8c48425ef4a7
1919+2020+---
2121+2222+## 2. Observation Structure
2323+2424+An observation is the core record in iNaturalist: a single organism (or sign of one) observed at a place and time, with photographic or audio evidence, zero or more community identifications, and extensible metadata.
2525+2626+### 2.1 Top-level Observation Fields
2727+2828+| Field | Type | Description |
2929+|-------|------|-------------|
3030+| `id` | integer | Auto-incrementing primary key; stable, globally unique within iNaturalist. |
3131+| `uuid` | UUID string | RFC 4122 UUID; used in some export formats. The `id` field is more commonly used in API calls. |
3232+| `created_at` | ISO 8601 datetime | When the record was created on the server (not when observed). |
3333+| `updated_at` | ISO 8601 datetime | Last server-side modification time. |
3434+| `observed_on` | date string `YYYY-MM-DD` | Calendar date of observation as stated by the observer. |
3535+| `time_observed_at` | ISO 8601 datetime or null | Date+time of observation in UTC, if the observer supplied a time. |
3636+| `observed_time_zone` | IANA timezone string | Observer's local timezone (e.g. `"America/Los_Angeles"`). |
3737+| `created_time_zone` | IANA timezone string | Server's timezone at record creation. |
3838+| `species_guess` | string | Free-text identification entered by the observer; may not match any taxon. |
3939+| `taxon_id` | integer | ID of the taxon the observer believes this is (from their own identification). |
4040+| `community_taxon_id` | integer or null | ID of the taxon the community algorithm has converged on. May differ from `taxon_id`. |
4141+| `quality_grade` | enum string | One of `"casual"`, `"needs_id"`, or `"research"`. |
4242+| `description` | string or null | Free-text notes from the observer. |
4343+| `place_guess` | string or null | Human-readable location name entered by the observer. |
4444+| `location` | string | `"latitude,longitude"` in WGS84; null for private observations. |
4545+| `latitude` | float or null | WGS84 decimal degrees. May be randomised if `obscured=true`. |
4646+| `longitude` | float or null | WGS84 decimal degrees. May be randomised if `obscured=true`. |
4747+| `positional_accuracy` | integer (metres) or null | Stated GPS accuracy radius. For obscured observations iNaturalist sets this to the diameter of the obscuration cell (~22,000 m at equator). |
4848+| `geoprivacy` | enum string or null | `"open"`, `"obscured"`, or `"private"`. `null` means the observer has not set it (defaults to open). |
4949+| `taxon_geoprivacy` | enum string or null | Geoprivacy applied automatically because the identified taxon has an at-risk conservation status. Same values as `geoprivacy`. |
5050+| `obscured` | boolean | `true` if coordinates have been offset (either by `geoprivacy` or `taxon_geoprivacy`). |
5151+| `coordinates_obscured` | boolean | Synonym for `obscured` in some API contexts. |
5252+| `map_scale` | integer or null | Map zoom level at which the observation was pinned; rarely used. |
5353+| `captive` | boolean | `true` if the organism is not wild (captive animal, cultivated plant). Captive/cultivated observations receive quality grade `"casual"`. |
5454+| `license_code` | string or null | CC license code for the observation record itself. Separate from photo licenses. Values: `"cc-by"`, `"cc-by-nc"`, `"cc-by-sa"`, `"cc-by-nd"`, `"cc-by-nc-sa"`, `"cc-by-nc-nd"`, `"cc0"`, or `null` for "all rights reserved". |
5555+| `out_of_range` | boolean | Community has flagged that the taxon is outside its expected range at this location. |
5656+| `spam` | boolean | Flagged as spam. |
5757+| `mappable` | boolean | Whether to show on public maps (requires non-private geoprivacy and verifiable status). |
5858+| `uri` | string | Canonical URL: `"https://www.inaturalist.org/observations/{id}"`. |
5959+| `url` | string | Same as `uri` in some contexts. |
6060+| `num_identification_agreements` | integer | Count of identifications that agree with the community taxon. |
6161+| `num_identification_disagreements` | integer | Count of identifications that disagree. |
6262+| `identifications_most_agree` | boolean | `true` if `num_agreements > num_disagreements`. |
6363+| `identifications_most_disagree` | boolean | `true` if `num_disagreements >= num_agreements`. |
6464+| `place_ids` | integer array | IDs of all places (at any admin level) that contain this observation's coordinates. |
6565+| `project_ids` | integer array | IDs of projects that include this observation. |
6666+| `reviewed_by` | integer array | User IDs of people who have "reviewed" this observation. |
6767+| `faves_count` | integer | Number of users who have faved this observation. |
6868+| `comments_count` | integer | Number of comments. |
6969+| `identifications_count` | integer | Total identification records (including withdrawn). |
7070+7171+### 2.2 Nested Objects Embedded in an Observation
7272+7373+- `user` — observer user object (see §6)
7474+- `taxon` — the observer's current identified taxon (see §4)
7575+- `community_taxon` — same structure as `taxon`; the community consensus taxon (may be null)
7676+- `identifications` — array of identification objects (see §3)
7777+- `photos` — array of photo objects (see §5.1)
7878+- `sounds` — array of sound objects (see §5.2)
7979+- `annotations` — array of annotation objects (see §8)
8080+- `observation_field_values` — array of observation field value objects (see §9)
8181+- `ofvs` — shorthand alias for `observation_field_values`
8282+- `tags` — array of free-text tag strings
8383+- `comments` — array of comment objects (user + body + timestamp)
8484+- `faves` — array of fave objects (user + vote score)
8585+- `outlinks` — array of external links (source + url)
8686+- `votes` — array of vote objects (vote_scope, vote_flag, user)
8787+- `quality_metrics` — array of data quality votes cast via the DQA
8888+- `conservation_status` — conservation status info for the identified taxon
8989+9090+---
9191+9292+## 3. Identification System
9393+9494+Each observation can have multiple identifications from different users. Identifications are the mechanism by which iNaturalist assigns a community consensus taxon.
9595+9696+### 3.1 Identification Object Fields
9797+9898+| Field | Type | Description |
9999+|-------|------|-------------|
100100+| `id` | integer | Unique identification record ID. |
101101+| `uuid` | UUID string | |
102102+| `created_at` | ISO 8601 datetime | |
103103+| `updated_at` | ISO 8601 datetime | |
104104+| `taxon_id` | integer | Taxon this identification proposes. |
105105+| `taxon` | object | Embedded taxon object. |
106106+| `user` | object | User who made the identification. |
107107+| `user_id` | integer | |
108108+| `observation_id` | integer | |
109109+| `current` | boolean | `false` if the user has since superseded or withdrawn it. |
110110+| `current_taxon` | boolean | `true` if the proposed taxon is still the current observation taxon. |
111111+| `withdrawn` | boolean | User explicitly withdrew this identification. |
112112+| `hidden` | boolean | Hidden by moderators or the observer. |
113113+| `disagreement` | boolean | `true` if this ID explicitly disagrees with an existing broader ID (the user was prompted "does this look like X?" and answered no). |
114114+| `vision` | boolean | Identification was made with the help of iNaturalist's computer vision suggestions. |
115115+| `category` | enum string | One of: `"improving"`, `"supporting"`, `"leading"`, `"maverick"`. See §3.2. |
116116+| `body` | string or null | Optional text comment explaining the identification. |
117117+| `flags` | array | Flags (spam/inappropriate) on this identification. |
118118+| `moderator_actions` | array | Moderator actions taken on this identification. |
119119+| `taxon_change` | object or null | If the taxon was subsequently merged/split, links to the taxon change. |
120120+121121+### 3.2 Identification Categories
122122+123123+iNaturalist classifies each current (non-withdrawn) identification into one of four categories:
124124+125125+| Category | Meaning |
126126+|----------|---------|
127127+| `improving` | The first ID that moved the community taxon to a finer rank than it was before. Typically the first identification at species level when the community was previously at genus. |
128128+| `supporting` | Agrees with the current community taxon at the same or finer rank. Adds weight to the existing consensus. |
129129+| `leading` | The most specific current ID, but the community has not yet converged to that rank (not enough agreeing IDs). Essentially a proposal waiting for support. |
130130+| `maverick` | Disagrees with the community taxon at a rank that conflicts with the majority view. These "outlier" IDs do not count toward the community taxon but are preserved and visible. |
131131+132132+### 3.3 Community Taxon Algorithm
133133+134134+The community taxon is the finest-rank taxon that more than 2/3 of the current (non-withdrawn) identifications agree with. The algorithm:
135135+136136+1. Collect all current identifications.
137137+2. For each identification, the identifier implicitly agrees with all taxa ancestral to their proposed taxon (e.g., an ID of *Panthera leo* implies agreement with *Panthera*, *Felidae*, *Carnivora*, etc.).
138138+3. Starting from the finest rank proposed and working upward, find the most specific taxon where the proportion of agreeing identifiers exceeds 2/3.
139139+4. That taxon becomes the community taxon.
140140+5. If no taxon reaches the 2/3 threshold at any rank, the community taxon is null.
141141+142142+The **observation taxon** (the one displayed as the primary label) is normally set to the community taxon, but if the observer opts out of community ID, it stays as their own identification.
143143+144144+### 3.4 Disagreement Mechanics
145145+146146+When a user adds an ID that is a sibling or cousin of an existing ID (not an ancestor), iNaturalist prompts them: "This is a different {rank} from the previous ID. Is this organism not a {previous taxon}?" If they confirm, the `disagreement` flag is set to `true` on their identification, and the community taxon rolls back to the lowest common ancestor of the two conflicting IDs.
147147+148148+### 3.5 Maverick IDs
149149+150150+An identification marked `"maverick"` is one that is inconsistent with the current community taxon (i.e., the identifier has chosen a taxon that is not a descendant of, or ancestor to, the community taxon). Maverick IDs are retained and visible, and can shift the community taxon if subsequent identifiers agree with them.
151151+152152+---
153153+154154+## 4. Quality Grades
155155+156156+iNaturalist assigns one of three quality grades to every observation.
157157+158158+### 4.1 Grade Definitions
159159+160160+| Grade | Meaning |
161161+|-------|---------|
162162+| `casual` | Fails minimum verifiability criteria, or community has voted it down via the DQA. Not shared with GBIF or other data partners. |
163163+| `needs_id` | Verifiable (has date, location, media, wild organism) but community has not yet reached consensus at species level with sufficient agreement. |
164164+| `research` | Verifiable AND the community taxon is at species rank or finer with more than 2/3 agreement, OR the community has voted "as good as it can be" via the DQA. Shared with GBIF and downstream partners. |
165165+166166+### 4.2 Criteria for "Verifiable" (prerequisite for needs_id and research)
167167+168168+An observation must have all four of:
169169+1. A date
170170+2. Geographic coordinates
171171+3. At least one photo or sound
172172+4. Organism is wild or naturalised (not captive/cultivated)
173173+174174+### 4.3 Data Quality Assessment (DQA) Votes
175175+176176+Any user can vote on a set of quality flags, each of which can push an observation toward `casual`:
177177+178178+| DQA flag | Effect if majority votes "no" |
179179+|----------|------------------------------|
180180+| Date is accurate | → casual |
181181+| Location is accurate | → casual |
182182+| Organism is wild/naturalized | → casual |
183183+| Evidence of organism (not just habitat photo) | → casual |
184184+| Evidence is recent (not > ~100 years old) | → casual |
185185+| Single subject (not multiple unrelated taxa) | → casual |
186186+| No artificial manipulation of image/sound | → casual |
187187+| ID is supported (community can ID to family or below) | → casual |
188188+189189+The DQA can also vote "as good as it can be" (community agrees no further identification is possible), which allows research grade at genus or higher.
190190+191191+### 4.4 Automatic Captive/Cultivated Voting
192192+193193+If 80% or more of all observations of a taxon in a 0.2° × 0.2° cell are voted captive/cultivated, new observations of that taxon in that cell are automatically flagged captive (pushed to casual) unless the observer asserts otherwise.
194194+195195+---
196196+197197+## 5. Taxon Model
198198+199199+### 5.1 Taxon Object Fields
200200+201201+| Field | Type | Description |
202202+|-------|------|-------------|
203203+| `id` | integer | Unique taxon ID within iNaturalist's taxonomy. |
204204+| `uuid` | UUID string | |
205205+| `name` | string | Scientific name (binomial for species, uninomial for higher ranks). |
206206+| `rank` | string | Taxonomic rank: `"kingdom"`, `"phylum"`, `"class"`, `"order"`, `"family"`, `"genus"`, `"species"`, `"subspecies"`, and many intermediate ranks (`"tribe"`, `"subtribe"`, `"variety"`, etc.). |
207207+| `rank_level` | integer | Numeric rank for comparison. Lower = finer. Species = 10, genus = 20, family = 30, order = 40, class = 50, phylum = 60, kingdom = 70. |
208208+| `ancestry` | string | Slash-separated string of ancestor taxon IDs from root to immediate parent, e.g. `"48460/1/2/355675/3"`. |
209209+| `ancestor_ids` | integer array | Same information as `ancestry` but as an array, in order from root to immediate parent. |
210210+| `parent_id` | integer | Immediate parent taxon ID. |
211211+| `iconic_taxon_id` | integer | ID of the iconic taxon (broad grouping used for display). |
212212+| `iconic_taxon_name` | string | One of: `"Animalia"`, `"Plantae"`, `"Fungi"`, `"Chromista"`, `"Protozoa"`, `"Mollusca"`, `"Reptilia"`, `"Aves"`, `"Amphibia"`, `"Actinopterygii"`, `"Mammalia"`, `"Insecta"`, `"Arachnida"`, `"unknown"`. |
213213+| `preferred_common_name` | string or null | Preferred common name in the user's locale. |
214214+| `names` | array | All common names across all locales (only included if `all_names` param requested). |
215215+| `is_active` | boolean | `false` if taxon has been synonymised, split, or otherwise inactivated. |
216216+| `extinct` | boolean | Taxon is considered extinct. |
217217+| `gbif_id` | integer or null | Corresponding GBIF backbone taxon ID, if linked. |
218218+| `wikipedia_url` | string or null | |
219219+| `wikipedia_summary` | string or null | Short excerpt from Wikipedia. |
220220+| `complete_rank` | string or null | For "complete" taxa (where all children are represented), the finest rank of completeness. |
221221+| `complete_species_count` | integer | Count of species in this taxon's subtree. |
222222+| `observations_count` | integer | Total observations of this taxon and its descendants on iNaturalist. |
223223+| `vision` | boolean | Taxon is included in iNaturalist's computer vision model. |
224224+| `default_photo` | object | A single photo representing this taxon (thumbnail). |
225225+| `taxon_photos` | array | All photos curated for this taxon page. |
226226+| `conservation_status` | object or null | Conservation status in the observer's jurisdiction (IUCN category, source, authority). |
227227+| `listed_taxa` | array | Species list associations. |
228228+| `establishment_means` | object or null | Whether the taxon is native, introduced, or endemic at the observation location. |
229229+230230+### 5.2 Rank Level Reference
231231+232232+```
233233+stateofmatter = 100
234234+kingdom = 70
235235+subkingdom = 67
236236+phylum = 60
237237+subphylum = 57
238238+superclass = 53
239239+class = 50
240240+subclass = 47
241241+infraclass = 45
242242+subterclass = 44
243243+superorder = 43
244244+order = 40
245245+suborder = 37
246246+infraorder = 35
247247+parvorder = 34
248248+zoosection = 33
249249+zoosubsection = 32
250250+superfamily = 30 (same level as family in rank_level)
251251+epifamily = 27
252252+family = 30
253253+subfamily = 27
254254+supertribe = 26
255255+tribe = 25
256256+subtribe = 24
257257+genus = 20
258258+subgenus = 15
259259+section = 13
260260+subsection = 12
261261+species = 10
262262+subspecies = 5
263263+variety = 5
264264+form = 5
265265+```
266266+267267+---
268268+269269+## 6. Photo and Sound Attachments
270270+271271+### 6.1 Photo Object
272272+273273+| Field | Type | Description |
274274+|-------|------|-------------|
275275+| `id` | integer | Photo record ID. |
276276+| `uuid` | UUID string | |
277277+| `url` | string | URL of the square thumbnail (75×75 px). Pattern: `"https://static.inaturalist.org/photos/{id}/square.jpg?..."`. Replace `square` with `thumb` (100px), `small` (240px), `medium` (500px), `large` (1024px), or `original` for other sizes. |
278278+| `original_dimensions` | object | `{ width: int, height: int }` of the original image. |
279279+| `license_code` | string or null | Per-photo CC license code. May differ from the observation's `license_code`. Values: `"cc-by"`, `"cc-by-nc"`, `"cc-by-sa"`, `"cc-by-nd"`, `"cc-by-nc-sa"`, `"cc-by-nc-nd"`, `"cc0"`, or `null` ("all rights reserved"). |
280280+| `attribution` | string | Human-readable attribution string, e.g. `"(c) Alice Smith, some rights reserved (CC BY-NC)"`. |
281281+| `flags` | array | Moderation flags. |
282282+| `native_photo_id` | string or null | ID in the originating service (Flickr, etc.) if synced. |
283283+284284+### 6.2 Sound Object
285285+286286+| Field | Type | Description |
287287+|-------|------|-------------|
288288+| `id` | integer | Sound record ID. |
289289+| `uuid` | UUID string | |
290290+| `file_url` | string | URL to the audio file (typically MP3 or WAV). |
291291+| `file_content_type` | string | MIME type, e.g. `"audio/mpeg"`. |
292292+| `license_code` | string or null | Same CC license codes as photos. |
293293+| `attribution` | string | Human-readable attribution. |
294294+295295+### 6.3 iNaturalist Open Data (AWS S3)
296296+297297+iNaturalist publishes a daily snapshot of openly-licensed photos to a public S3 bucket (`s3://inaturalist-open-data/`). The snapshot has four TSV files:
298298+299299+- `observations.csv` — `observation_uuid`, `observer_id`, `latitude`, `longitude`, `positional_accuracy`, `taxon_id`, `quality_grade`, `observed_on`
300300+- `photos.csv` — `photo_uuid`, `photo_id`, `observation_uuid`, `observer_id`, `extension`, `license`, `width`, `height`, `position`
301301+- `taxa.csv` — `taxon_id`, `ancestry`, `rank_level`, `rank`, `name`, `active`
302302+- `observers.csv` — `observer_id`, `login`, `name`
303303+304304+Photo files are at `s3://inaturalist-open-data/photos/{photo_id}/{size}.{ext}` where size ∈ `{square, small, medium, large, original}`.
305305+306306+---
307307+308308+## 7. User Model
309309+310310+| Field | Type | Description |
311311+|-------|------|-------------|
312312+| `id` | integer | User ID (stable, numeric). |
313313+| `uuid` | UUID string | |
314314+| `login` | string | Username (URL-safe, changeable but rarely changed in practice). |
315315+| `name` | string or null | Display name (free text). |
316316+| `icon` | string or null | URL to profile thumbnail. |
317317+| `icon_url` | string or null | Same as `icon` in some contexts. |
318318+| `observations_count` | integer | Total non-deleted observations. |
319319+| `identifications_count` | integer | Total identifications made. |
320320+| `journal_posts_count` | integer | |
321321+| `species_count` | integer | Distinct species observed. |
322322+| `created_at` | ISO 8601 datetime | Account creation date. |
323323+| `site_id` | integer | Which iNaturalist network node (iNat.org=1, iNat.ca=3, etc.) the user belongs to. |
324324+| `roles` | array of strings | Site roles, e.g. `"curator"`, `"admin"`. |
325325+| `orcid` | string or null | ORCID iD if the user has linked one. |
326326+327327+---
328328+329329+## 8. Projects
330330+331331+iNaturalist has three project types:
332332+333333+### 8.1 Traditional Projects
334334+335335+- Observations must be manually added by a project member.
336336+- The observer must join the project to add observations.
337337+- Administrators can access private/obscured coordinates of member observations.
338338+- Can require observation fields to be filled in.
339339+- Can maintain a species checklist.
340340+341341+### 8.2 Collection Projects
342342+343343+- Automatically aggregates observations matching a saved filter (species, place, date range, etc.).
344344+- Observations are not explicitly added; membership is not required.
345345+- Administrators cannot access private/obscured coordinates.
346346+- Equivalent to a bookmarked search with a project page.
347347+348348+### 8.3 Umbrella Projects
349349+350350+- A collection of other projects (traditional or collection).
351351+- Aggregates their observations into one page.
352352+353353+### 8.4 Project Object Fields (key fields)
354354+355355+| Field | Type | Description |
356356+|-------|------|-------------|
357357+| `id` | integer | Project ID. |
358358+| `title` | string | Human-readable name. |
359359+| `slug` | string | URL-safe identifier: `inaturalist.org/projects/{slug}`. |
360360+| `project_type` | string | `"collection"`, `"umbrella"`, or `null` (traditional). |
361361+| `description` | string | |
362362+| `icon` | string | URL to project icon. |
363363+| `banner_color` | string | Hex colour. |
364364+| `location` | string | Lat/lng if pinned to a location. |
365365+| `place_id` | integer or null | Associated place. |
366366+| `user` | object | Project admin/creator. |
367367+| `members_count` | integer | |
368368+| `observations_count` | integer | |
369369+| `species_count` | integer | |
370370+371371+---
372372+373373+## 9. Annotations
374374+375375+Annotations are structured key-value tags applied to observations. Each attribute has a controlled vocabulary of allowed values. Annotations are voted on; the displayed value is the one with the most positive votes.
376376+377377+### 9.1 Current Annotation Attributes and Values
378378+379379+| Attribute | Applies to | Allowed Values |
380380+|-----------|-----------|----------------|
381381+| **Life Stage** | Animals | Adult, Egg, Juvenile, Larva, Nymph, Pupa, Subimago, Teneral |
382382+| **Sex** | Animals | Female, Male, Cannot Be Determined |
383383+| **Plant Phenology** | Plants | Flowering, Flower Budding, Fruiting, No Evidence of Flowering |
384384+| **Alive or Dead** | Animals | Alive, Dead, Cannot Be Determined |
385385+| **Evidence of Presence** | Non-plant, non-human | Bone, Construction, Feather, Egg, Gall, Hair, Leafmine, Molt, Organism, Scat, Track |
386386+| **Established** | Amphibians and Reptiles | Not Established (for escaped/vagrant animals outside established populations) |
387387+388388+### 9.2 Annotation Object Fields
389389+390390+| Field | Type | Description |
391391+|-------|------|-------------|
392392+| `uuid` | UUID string | |
393393+| `controlled_attribute_id` | integer | ID of the attribute (Life Stage=1, Sex=9, Plant Phenology=12, Alive or Dead=17, Evidence of Presence=22, Established=35, etc.). |
394394+| `controlled_attribute` | object | `{ id, label, uri }` — the attribute definition. |
395395+| `controlled_value_id` | integer | ID of the selected value. |
396396+| `controlled_value` | object | `{ id, label, uri }` — the value definition. |
397397+| `user` | object | User who added the annotation. |
398398+| `user_id` | integer | |
399399+| `votes` | array | Vote records on this annotation. Each vote has `{ user_id, vote_flag: bool }`. |
400400+| `vote_score` | integer | Net upvotes minus downvotes. |
401401+| `concatenated_attr_val` | string | Convenience string: `"{attribute_label}={value_label}"`, e.g. `"Life Stage=Adult"`. |
402402+403403+Annotations are retrieved via `GET /controlled_terms` (returns all attributes) and `GET /controlled_terms/for_taxon?taxon_id={id}` (returns applicable attributes).
404404+405405+---
406406+407407+## 10. Observation Fields
408408+409409+Observation fields (OFVs) are project-defined or community-defined custom key-value fields. Any user can define a field and apply it to any observation.
410410+411411+### 10.1 Observation Field Object
412412+413413+```json
414414+{
415415+ "id": 25,
416416+ "name": "Associated species",
417417+ "description": "A second species that was seen near this organism",
418418+ "datatype": "taxon",
419419+ "allowed_values": null,
420420+ "units": null,
421421+ "users_count": 14823,
422422+ "values_count": 34021
423423+}
424424+```
425425+426426+| Field | Type | Description |
427427+|-------|------|-------------|
428428+| `id` | integer | Field definition ID. |
429429+| `name` | string | Field name (not unique globally; users can create duplicates). |
430430+| `description` | string | What the field captures. |
431431+| `datatype` | string | One of: `"text"`, `"numeric"`, `"date"`, `"time"`, `"datetime"`, `"taxon"`, `"dna"`. |
432432+| `allowed_values` | string or null | Pipe-delimited list of allowed values for text fields acting as enumerations, e.g. `"yes\|no\|maybe"`. |
433433+| `units` | string or null | Unit label for numeric fields. |
434434+435435+### 10.2 Observation Field Value Object (on an observation)
436436+437437+```json
438438+{
439439+ "id": 12345678,
440440+ "uuid": "...",
441441+ "field_id": 25,
442442+ "observation_field": { "id": 25, "name": "Associated species", "datatype": "taxon" },
443443+ "value": "Quercus robur",
444444+ "taxon": { ... },
445445+ "user": { ... },
446446+ "created_at": "2023-05-10T12:00:00Z",
447447+ "updated_at": "2023-05-10T12:00:00Z"
448448+}
449449+```
450450+451451+---
452452+453453+## 11. iNaturalist API
454454+455455+### 11.1 Base URL and Authentication
456456+457457+- Base URL: `https://api.inaturalist.org/v1/`
458458+- Authentication: JSON Web Token (JWT) in `Authorization` header. Obtain via `/users/api_token` using an OAuth2 access token from the v1 Rails app.
459459+- Rate limits: ~1 request/second, ~10,000 requests/day per IP. Bulk data should use the open data S3 snapshot or GBIF exports instead.
460460+461461+### 11.2 Key Endpoints
462462+463463+| Endpoint | Method | Description |
464464+|----------|--------|-------------|
465465+| `/observations` | GET | Search observations with 100+ filter parameters. Key params: `taxon_id`, `taxon_name`, `place_id`, `user_id`, `quality_grade`, `license`, `photos`, `sounds`, `geoprivacy`, `lat`, `lng`, `radius`, `d1`, `d2`, `per_page` (max 200), `page`. |
466466+| `/observations/{id}` | GET | Single observation by ID. Returns full detail including all nested objects. |
467467+| `/observations/species_counts` | GET | Aggregated species counts for a set of observations. |
468468+| `/observations/identifiers` | GET | Users who have identified, with counts. |
469469+| `/observations/observers` | GET | Observers, with counts. |
470470+| `/observations/histogram` | GET | Time series histogram. |
471471+| `/taxa` | GET | Search taxa. Params: `q`, `rank`, `iconic_taxa`, `per_page`. |
472472+| `/taxa/{id}` | GET | Taxon by ID (accepts comma-separated list). |
473473+| `/taxa/autocomplete` | GET | Autocomplete for taxon names; used for ID suggestions. |
474474+| `/identifications` | GET | Search identifications. Params: `observation_id`, `user_id`, `taxon_id`, `category`, `current`. |
475475+| `/identifications/{id}` | GET | Single identification. |
476476+| `/places/{id}` | GET | Place by ID or slug. |
477477+| `/places/nearby` | GET | Places near a bounding box. |
478478+| `/projects` | GET | Search projects. |
479479+| `/projects/{id}` | GET | Project details. |
480480+| `/controlled_terms` | GET | All annotation attribute definitions. |
481481+| `/controlled_terms/for_taxon` | GET | Annotation attributes applicable to a taxon. |
482482+| `/users/{id}` | GET | User profile. |
483483+| `/search` | GET | Site-wide search (taxa, places, projects, users). |
484484+485485+### 11.3 Pagination
486486+487487+- Default `per_page` = 30, maximum = 200.
488488+- Pagination via `page` parameter.
489489+- Hard cap: only the first 10,000 results are accessible via pagination (offset limit in Elasticsearch). For larger datasets use the open data export.
490490+491491+### 11.4 Response Envelope
492492+493493+```json
494494+{
495495+ "total_results": 1234,
496496+ "page": 1,
497497+ "per_page": 30,
498498+ "results": [ ... ]
499499+}
500500+```
501501+502502+---
503503+504504+## 12. Export Formats
505505+506506+### 12.1 CSV Export (inaturalist.org/observations/export)
507507+508508+Available to any logged-in user. Columns include:
509509+510510+| Column | Description |
511511+|--------|-------------|
512512+| `id` | Observation ID |
513513+| `observed_on_string` | Observation date as entered |
514514+| `observed_on` | `YYYY-MM-DD` |
515515+| `time_observed_at` | ISO 8601 UTC datetime or blank |
516516+| `user_id` | Observer user ID |
517517+| `user_login` | Observer username |
518518+| `created_at` | Record creation datetime |
519519+| `updated_at` | Last update datetime |
520520+| `quality_grade` | `casual`, `needs_id`, or `research` |
521521+| `license` | License code for the observation |
522522+| `url` | Canonical URL |
523523+| `image_url` | URL of first photo (square size) |
524524+| `sound_url` | URL of first sound |
525525+| `tag_list` | Comma-delimited free-text tags |
526526+| `description` | Observer notes |
527527+| `num_identification_agreements` | |
528528+| `num_identification_disagreements` | |
529529+| `captive_cultivated` | Boolean |
530530+| `oauth_application_id` | App used to create observation |
531531+| `place_guess` | Free-text location name |
532532+| `latitude` | Potentially obscured decimal degrees |
533533+| `longitude` | Potentially obscured decimal degrees |
534534+| `positional_accuracy` | Metres (inflated for obscured) |
535535+| `geoprivacy` | |
536536+| `taxon_geoprivacy` | |
537537+| `coordinates_obscured` | Boolean |
538538+| `positioning_method` | GPS, manual, etc. |
539539+| `positioning_device` | Device description |
540540+| `species_guess` | Free-text species name |
541541+| `scientific_name` | Scientific name of community taxon |
542542+| `common_name` | Common name of community taxon |
543543+| `iconic_taxon_name` | Broad group (Aves, Insecta, etc.) |
544544+| `taxon_id` | Community taxon ID |
545545+546546+### 12.2 Darwin Core Archive (DwC-A) to GBIF
547547+548548+iNaturalist publishes all research-grade observations with open licenses to GBIF weekly as a Darwin Core Archive. The GBIF dataset is at https://www.gbif.org/dataset/50c9509d-22c7-4a22-a47d-8c48425ef4a7.
549549+550550+Key Darwin Core term mappings from iNaturalist:
551551+552552+| DwC Term | iNaturalist Source |
553553+|----------|-------------------|
554554+| `occurrenceID` | `https://www.inaturalist.org/observations/{id}` |
555555+| `basisOfRecord` | `HumanObservation` |
556556+| `scientificName` | Community taxon `name` |
557557+| `taxonRank` | Community taxon `rank` |
558558+| `kingdom`, `phylum`, `class`, `order`, `family`, `genus` | Extracted from taxon ancestry |
559559+| `decimalLatitude` | `latitude` |
560560+| `decimalLongitude` | `longitude` |
561561+| `coordinateUncertaintyInMeters` | `positional_accuracy` |
562562+| `eventDate` | `observed_on` or `time_observed_at` |
563563+| `recordedBy` | Observer `login` or `name` |
564564+| `license` | Observation `license_code` |
565565+| `rightsHolder` | Observer name |
566566+| `datasetName` | `"iNaturalist Research-grade Observations"` |
567567+| `collectionCode` | `"Observations"` |
568568+| `institutionCode` | `"iNaturalist"` |
569569+| `gbifID` | Assigned by GBIF on ingestion |
570570+571571+---
572572+573573+## 13. Licensing
574574+575575+### 13.1 License Levels
576576+577577+iNaturalist applies licenses at two independent levels:
578578+1. **Observation record** (metadata: species, date, location) — controlled by `license_code` on the observation object.
579579+2. **Photos and sounds** — each photo/sound has its own `license_code`.
580580+581581+Both can be set independently by the observer.
582582+583583+### 13.2 License Values
584584+585585+| iNaturalist `license_code` | SPDX Identifier | Description |
586586+|---------------------------|----------------|-------------|
587587+| `cc0` | `CC0-1.0` | Public domain dedication; no rights reserved. |
588588+| `cc-by` | `CC-BY-4.0` | Attribution required. |
589589+| `cc-by-nc` | `CC-BY-NC-4.0` | Attribution + non-commercial use only. **Default for new accounts.** |
590590+| `cc-by-sa` | `CC-BY-SA-4.0` | Attribution + share-alike. |
591591+| `cc-by-nd` | `CC-BY-ND-4.0` | Attribution + no derivatives. |
592592+| `cc-by-nc-sa` | `CC-BY-NC-SA-4.0` | Attribution + non-commercial + share-alike. |
593593+| `cc-by-nc-nd` | `CC-BY-NC-ND-4.0` | Attribution + non-commercial + no derivatives. |
594594+| `null` | *(none)* | All rights reserved; no reuse without explicit permission. |
595595+596596+### 13.3 GBIF Eligibility
597597+598598+Only observations (and their photos) with `cc0`, `cc-by`, or `cc-by-nc` licenses are shared with GBIF.
599599+600600+---
601601+602602+## 14. Geoprivacy
603603+604604+### 14.1 Settings
605605+606606+| `geoprivacy` value | Behaviour |
607607+|--------------------|-----------|
608608+| `"open"` (or `null`) | Exact coordinates are public. `obscured=false`. |
609609+| `"obscured"` | Coordinates are randomised to a point within a 0.2° × 0.2° bounding box containing the true location. `obscured=true`. `positional_accuracy` is set to the diameter of this cell (~22,000 m at equator; smaller at higher latitudes). The true coordinates are stored in private fields (`private_latitude`, `private_longitude`) not visible publicly. |
610610+| `"private"` | No coordinates appear in the public API response at all. `latitude` and `longitude` are `null`. Not shared with GBIF or data partners. |
611611+612612+### 14.2 Taxon Geoprivacy
613613+614614+If the community taxon has an at-risk conservation status (e.g. IUCN Vulnerable or higher, or a national red list status), iNaturalist automatically obscures coordinates regardless of the observer's geoprivacy setting. The `taxon_geoprivacy` field records this. The effective geoprivacy is the more restrictive of `geoprivacy` and `taxon_geoprivacy`.
615615+616616+### 14.3 Trusted Access
617617+618618+The true coordinates are visible to:
619619+- The observer themselves
620620+- Users the observer has granted "trust" to
621621+- Curators of traditional projects that the observation belongs to (if the observer granted the project trust)
622622+623623+---
624624+625625+## 15. Place System
626626+627627+Places are named geographic regions in iNaturalist used for filtering and associating species checklists.
628628+629629+### 15.1 Place Types
630630+631631+Standard admin levels: `"Country"` (admin_level=0), `"State"` (1), `"County"` (2), `"Town"` (3). Also: `"Open Space"`, `"National Park"`, `"Continent"`, `"Island"`, `"Point of Interest"`, and community-created custom places.
632632+633633+### 15.2 Place Object Fields
634634+635635+| Field | Type | Description |
636636+|-------|------|-------------|
637637+| `id` | integer | Place ID. |
638638+| `name` | string | |
639639+| `display_name` | string | Full name with country/state context. |
640640+| `place_type` | string | Type name. |
641641+| `admin_level` | integer or null | 0=country, 1=state, 2=county, 3=town. |
642642+| `ancestry` | string | Slash-delimited parent place IDs. |
643643+| `parent_id` | integer or null | Immediate parent place ID. |
644644+| `bbox_area` | float | Area of bounding box in square degrees. |
645645+| `latitude` | float | Centroid latitude. |
646646+| `longitude` | float | Centroid longitude. |
647647+| `swlat`, `swlng`, `nelat`, `nelng` | float | Bounding box corners. |
648648+| `geometry_geojson` | object | Full polygon geometry as GeoJSON. |
649649+| `check_list_id` | integer | ID of the default species checklist for this place. |
650650+651651+### 15.3 Observations and Places
652652+653653+Each observation carries a `place_ids` array containing the IDs of all places (at any hierarchy level) whose polygon contains the observation's coordinates. This enables filtering by any place, including deeply nested administrative subdivisions.
654654+655655+---
656656+657657+## 16. Mapping Plan: iNaturalist → Terradots
658658+659659+This section defines the mapping from each iNaturalist concept to the Terradots label store data model (`Terradots.mli`).
660660+661661+### 16.1 Observation → Label
662662+663663+Each iNaturalist observation maps to a single Terradots `label` constructed via `make_imported`.
664664+665665+### 16.2 Field-by-Field Mapping
666666+667667+| iNaturalist Field | Terradots Field | Mapping Notes |
668668+|-------------------|----------------|---------------|
669669+| `latitude`, `longitude` | `geometry = Point { x=longitude; y=latitude }` | WGS84 → EPSG:4326 Point. x=lon, y=lat per Terradots convention. |
670670+| `positional_accuracy` | `origin.Measured.accuracy_m` | Direct mapping. Already in metres. For obscured observations this will be ~22,000 m, which faithfully captures the uncertainty introduced by obscuration. |
671671+| `observed_on` / `time_observed_at` | `event_date` | Use `time_observed_at` (ISO 8601 datetime) if available, otherwise `observed_on` (date only). Both are valid Darwin Core `eventDate` formats. |
672672+| `id` | `origin.Measured.via` = `"inaturalist:observation/{id}"` | The `via` URI uniquely identifies the source record. Matches the URI scheme already defined in `Terradots.mli`: `inaturalist:observation/12345`. |
673673+| `uuid` | `properties [("inaturalist_uuid", uuid)]` | Preserve as property for round-trip fidelity. |
674674+| `user.id` | `origin.Measured.observer` = `"inaturalist:user/{user_id}"` | Encodes the observer as a URI in the iNaturalist namespace. If the user has an ORCID (`user.orcid`), prefer `"orcid:{orcid}"` instead. |
675675+| `license_code` | `origin.Measured.license` | Map to SPDX: `cc0` → `"CC0-1.0"`, `cc-by` → `"CC-BY-4.0"`, `cc-by-nc` → `"CC-BY-NC-4.0"`, etc. `null` → `None` (no free license). |
676676+| Community taxon + identifications | `class_dist` | **The key mapping.** See §16.3 below. |
677677+| `quality_grade` | `confidence` | Map `"research"` → `1.0`, `"needs_id"` → `0.5`, `"casual"` → `0.0` (or omit). Also store raw value in `properties`. See §16.4. |
678678+| `obscured` | `properties [("geoprivacy", "obscured")]` | Store geoprivacy status in properties. When `obscured=true` the `positional_accuracy` is already inflated, so `accuracy_m` correctly reflects the uncertainty. |
679679+| `geoprivacy` / `taxon_geoprivacy` | `properties` | Store both raw values for downstream consumers that need to know why coordinates were obscured. |
680680+| Photos | `properties [("inaturalist_photos", ...)]` | See §16.5. |
681681+| Sounds | `properties [("inaturalist_sounds", ...)]` | See §16.5. |
682682+| `place_guess` | `properties [("place_guess", place_guess)]` | Human-readable location hint. |
683683+| `captive` | `properties [("captive", "true")]` | Flag only when `true`. |
684684+| `description` | `properties [("description", description)]` | Observer notes. |
685685+| `species_guess` | `properties [("species_guess", species_guess)]` | Original free-text species guess. |
686686+| `taxon.iconic_taxon_name` | `properties [("iconic_taxon", name)]` | Broad group (Aves, Insecta, etc.). |
687687+| `taxon.rank` | `properties [("taxon_rank", rank)]` | |
688688+| `taxon.ancestry` | `properties [("taxon_ancestry", ancestry)]` | Slash-delimited ancestor IDs. |
689689+| Annotations | `properties` | See §16.6. |
690690+| Observation fields | `properties` | See §16.7. |
691691+| Project membership | `groups` | See §16.8. |
692692+693693+### 16.3 Identifications → class_dist
694694+695695+This is the most semantically rich mapping. The Terradots `class_dist` field is a probability distribution over class names — exactly what the iNaturalist identification system computes.
696696+697697+**Algorithm:**
698698+699699+1. Collect all `current=true` (non-withdrawn) identifications.
700700+2. For each identification, the identifier's implicit weight in the consensus is 1 vote for the proposed taxon and all its ancestors.
701701+3. Compute per-taxon vote counts from all current identifications.
702702+4. Compute the total voter count N = number of distinct current identifiers.
703703+5. Derive probabilities: `p(taxon) = votes_for_taxon / N`.
704704+6. Use the **scientific name** of each taxon as the class name string.
705705+7. Only include taxa at species rank or finer in the distribution (higher ranks can be included as properties or annotations separately).
706706+707707+**Example:** 3 identifiers propose *Calochortus vestae*, 1 proposes *Calochortus venustus*. Both taxa have *Calochortus* as parent.
708708+709709+- *Calochortus vestae*: 3/4 = 0.75
710710+- *Calochortus venustus*: 1/4 = 0.25
711711+- *Calochortus* (genus): 4/4 = 1.0 (but we may omit genus-level entries in favour of species)
712712+713713+Recommended: include only the finest-rank taxa proposed by at least one identifier, normalised to sum to 1.0:
714714+715715+```
716716+class_dist = [("Calochortus vestae", 0.75); ("Calochortus venustus", 0.25)]
717717+```
718718+719719+**Single-taxon research grade case:** All identifiers agree → `class_dist = [("Calochortus vestae", 1.0)]`.
720720+721721+**Degenerate case (community taxon at genus level):** Report the genus as the sole class:
722722+723723+```
724724+class_dist = [("Calochortus", 1.0)]
725725+```
726726+727727+**No identifications or community taxon only at family or above:** `class_dist = []` (unclassified) or `class_dist = [("Calochortaceae", 1.0)]` depending on the use case.
728728+729729+**Alternative simpler approach** (for implementors who only want the community consensus):
730730+731731+Use `community_taxon.name` as the sole class with `confidence` derived from `quality_grade`:
732732+733733+```
734734+class_dist = [("Calochortus vestae", 1.0)]
735735+confidence = 1.0 (* research grade *)
736736+```
737737+738738+The richer approach using all identifications is recommended because it preserves uncertainty information that is lost in the simpler approach.
739739+740740+### 16.4 quality_grade → confidence
741741+742742+| `quality_grade` | Suggested `confidence` | Rationale |
743743+|-----------------|----------------------|-----------|
744744+| `"research"` | `0.95` (or `1.0`) | Community consensus at species level with ≥ 2/3 agreement; high reliability. |
745745+| `"needs_id"` | `0.5` | Some identification activity but no consensus yet. |
746746+| `"casual"` | `0.1` (or omit) | May lack date, location, or evidence; or organism not wild. Very low reliability. |
747747+748748+Also store the raw `quality_grade` string in `properties [("quality_grade", quality_grade)]` so downstream systems can apply their own thresholds.
749749+750750+### 16.5 Photos and Sounds → properties
751751+752752+Terradots has no dedicated media field. Store photo information as properties:
753753+754754+```
755755+properties = [
756756+ ("inaturalist_photo_count", "2");
757757+ ("inaturalist_photo_0_id", "61482854");
758758+ ("inaturalist_photo_0_url", "https://static.inaturalist.org/photos/61482854/medium.jpg");
759759+ ("inaturalist_photo_0_license","CC-BY-NC-4.0");
760760+ ("inaturalist_photo_0_attr", "(c) Alice Smith, some rights reserved (CC BY-NC)");
761761+ ("inaturalist_sound_count", "0");
762762+]
763763+```
764764+765765+**Limitation:** The Terradots `properties` field is `(string * string) list`, which means we linearise the photo array. A proper extension would add a `media` field typed as a list of `{ url; license; attribution; role }` records — see §17.1.
766766+767767+### 16.6 Annotations → properties
768768+769769+Store each annotation as a property pair using the concatenated attribute/value label:
770770+771771+```
772772+properties = [
773773+ ("annotation_life_stage", "Adult");
774774+ ("annotation_sex", "Female");
775775+ ("annotation_plant_phenology", "Flowering");
776776+ ("annotation_alive_or_dead", "Alive");
777777+ ("annotation_evidence", "Organism");
778778+]
779779+```
780780+781781+This is lossy in that it discards vote counts. If vote scores are needed, include them:
782782+783783+```
784784+ ("annotation_life_stage_votes", "3");
785785+```
786786+787787+**Better approach:** The Terradots `annotation` type (free-text `text` anchored to label IDs) is not a natural fit for structured controlled-vocabulary annotations. See §17.2 for the recommended extension.
788788+789789+### 16.7 Observation Fields → properties
790790+791791+Store as `("ofv_{field_name}", value)` pairs:
792792+793793+```
794794+properties = [
795795+ ("ofv_associated_species", "Quercus robur");
796796+ ("ofv_microhabitat", "bark");
797797+ ("ofv_behaviour", "foraging");
798798+]
799799+```
800800+801801+Sanitise `field_name` by lowercasing and replacing spaces/punctuation with underscores. Collisions are possible (two fields with the same name after normalisation); suffix with field ID if necessary.
802802+803803+### 16.8 Projects → groups
804804+805805+Each project that includes the observation (from `project_ids`) should be represented as a Terradots `group`:
806806+807807+```ocaml
808808+{
809809+ id = "inaturalist:project/4";
810810+ activity = Some activity_id;
811811+ members = [ label_id_1; label_id_2; ... ];
812812+}
813813+```
814814+815815+**Collection projects** are automatically computed by iNaturalist and are equivalent to saved searches; these may not need to be materialised as Terradots groups unless the use case requires it. **Traditional projects** with explicit membership are more meaningful to preserve as groups.
816816+817817+### 16.9 Import Activity Record
818818+819819+Create one `activity` record for the import batch:
820820+821821+```ocaml
822822+{
823823+ activity_id = "import-inaturalist-2025-09-01";
824824+ agent = "inaturalist-importer:v1";
825825+ date = "2025-09-01";
826826+ description = Some "Bulk import of iNaturalist research-grade observations for taxon 64083";
827827+}
828828+```
829829+830830+---
831831+832832+## 17. Concepts That Don't Map Cleanly
833833+834834+### 17.1 Media Attachments
835835+836836+**Problem:** Terradots has no `media` field. Photos and sounds are first-class scientific evidence in iNaturalist — the photo IS the observation evidence, not just metadata.
837837+838838+**Recommended extension:**
839839+840840+```ocaml
841841+type media_item = {
842842+ url : string;
843843+ license : string option; (* SPDX *)
844844+ attribution : string;
845845+ role : string; (* "photo" | "sound" *)
846846+ position : int; (* ordering within observation *)
847847+}
848848+```
849849+850850+Add `media : media_item list` to the `label` type, or model it as a separate document-level collection keyed by label ID.
851851+852852+### 17.2 Structured Annotations (Controlled Vocabulary)
853853+854854+**Problem:** The Terradots `annotation` type is free-text (`text : string`) intended for human commentary. iNaturalist annotations are structured (attribute ID + value ID + vote count) controlled vocabulary terms. Storing them as flat properties loses the structure.
855855+856856+**Recommended extension:** Add a `structured_annotation` type:
857857+858858+```ocaml
859859+type structured_annotation = {
860860+ label_id : string;
861861+ attribute : string; (* e.g. "Life Stage" *)
862862+ value : string; (* e.g. "Adult" *)
863863+ votes : int; (* net positive votes *)
864864+ source : string; (* e.g. "inaturalist:annotation/12345" *)
865865+}
866866+```
867867+868868+Or model annotations as a sub-`class_dist` over trait categories.
869869+870870+### 17.3 Identification History and Withdrawn IDs
871871+872872+**Problem:** Terradots only stores the current state. iNaturalist's identification history (which IDs were proposed, withdrawn, and in what order, including maverick IDs) is a rich audit trail.
873873+874874+**Partial solution:** Store the full identification JSON blob in a property:
875875+876876+```
877877+properties = [("inaturalist_identifications_json", "...")]
878878+```
879879+880880+Or represent each historical identification as a separate `Derived` label with `method_="inaturalist-identification"`, sourced from the observation label. This adds complexity but preserves full provenance.
881881+882882+### 17.4 Community Taxon vs. Observer Taxon Divergence
883883+884884+**Problem:** When an observer opts out of community ID, `taxon_id` (observer's ID) and `community_taxon_id` differ. The Terradots `class_dist` naturally represents the community view; the observer's dissent is an outlier.
885885+886886+**Solution:** Record both:
887887+- `class_dist` — derived from community identifications
888888+- `properties [("observer_taxon", community_taxon.name)]` — observer's own ID
889889+- `properties [("observer_opted_out_of_community_id", "true")]` — flag when relevant
890890+891891+### 17.5 Obscured Coordinates and Downstream ML Use
892892+893893+**Problem:** When `obscured=true`, the `latitude`/`longitude` in the API response are randomised within a ~0.2° cell. The true coordinates are not available. Storing these as a Point geometry gives a false sense of precision.
894894+895895+**Solutions (in order of preference):**
896896+1. Use `accuracy_m` = positional_accuracy (which iNaturalist sets to the cell diameter, ~22,000 m). This correctly signals that the point location is unreliable.
897897+2. Store the 0.2° bounding box as a `Polygon` geometry (if the cell corners can be computed), making the uncertainty geometrically explicit.
898898+3. Tag with `properties [("geoprivacy", "obscured")]` and filter these out of precision-sensitive analyses.
899899+900900+**Private observations** (`geoprivacy="private"`) have no coordinates at all; they cannot be imported as Point labels. They can be imported as unlocated metadata with a null/stub geometry, or skipped entirely.
901901+902902+### 17.6 Taxon Changes (Synonymisation, Splits, Lumps)
903903+904904+**Problem:** iNaturalist's taxonomy is dynamic. A taxon may be synonymised into another, split into two, or lumped. The `taxon.is_active=false` flag signals this, and `current_synonymous_taxon_ids` points to the replacement. Historical labels imported at one point in time may have stale taxon names.
905905+906906+**Solution:** Record the taxon ID and name at import time:
907907+908908+```
909909+properties = [
910910+ ("inaturalist_taxon_id", "64083");
911911+ ("inaturalist_taxon_name", "Calochortus vestae");
912912+ ("inaturalist_taxon_rank", "species");
913913+ ("inaturalist_taxon_is_active", "true");
914914+]
915915+```
916916+917917+When re-importing, compare against stored taxon ID and emit a `Derived` label if the taxon has changed.
918918+919919+### 17.7 Observation Field Namespace Collisions
920920+921921+**Problem:** iNaturalist observation fields are user-created and have no global namespace. Two fields named "Behaviour" by different users may mean different things. Field IDs are globally unique but names are not.
922922+923923+**Solution:** Key properties by field ID rather than field name:
924924+925925+```
926926+properties = [
927927+ ("ofv_id_25", "Quercus robur"); (* field 25: Associated species *)
928928+ ("ofv_id_25_name", "Associated species");
929929+]
930930+```
931931+932932+This is verbose but unambiguous.
933933+934934+### 17.8 Vote/Fave Metadata
935935+936936+iNaturalist observations carry `faves_count` and vote data. These have no Terradots equivalent and are social engagement metrics rather than scientific metadata. Store as properties if needed:
937937+938938+```
939939+properties = [("inaturalist_faves_count", "12")]
940940+```
941941+942942+### 17.9 Species Checklists and Range Data
943943+944944+iNaturalist places have associated species checklists ("this species has been recorded in this region"). These are not observations and have no direct Terradots equivalent. They could be imported as `Derived` labels (derived from the observation corpus) or handled at the document/groups level as regional metadata.
945945+946946+---
947947+948948+## 18. Summary Mapping Table
949949+950950+```
951951+iNaturalist concept → Terradots field / type
952952+───────────────────────────────────────────────────────────────────────────
953953+observation → label (via make_imported)
954954+latitude, longitude → geometry = Point { x=lon; y=lat }
955955+positional_accuracy → origin.Measured.accuracy_m
956956+observed_on / time_observed_at → event_date
957957+id → origin.Measured.via = "inaturalist:observation/{id}"
958958+uuid → properties [("inaturalist_uuid", ...)]
959959+user.id → origin.Measured.observer = "inaturalist:user/{id}"
960960+user.orcid → origin.Measured.observer = "orcid:{orcid}" (preferred)
961961+license_code → origin.Measured.license (SPDX mapped)
962962+community taxon + identifications → class_dist (probability distribution)
963963+quality_grade = "research" → confidence = 0.95
964964+quality_grade = "needs_id" → confidence = 0.5
965965+quality_grade = "casual" → confidence = 0.1
966966+quality_grade (raw) → properties [("quality_grade", ...)]
967967+geoprivacy / obscured → properties [("geoprivacy", ...)]
968968+taxon.name → class_dist primary class name string
969969+taxon.iconic_taxon_name → properties [("iconic_taxon", ...)]
970970+taxon.rank → properties [("taxon_rank", ...)]
971971+taxon.ancestry → properties [("taxon_ancestry", ...)]
972972+taxon.gbif_id → properties [("gbif_taxon_id", ...)]
973973+photos (array) → properties [("inaturalist_photo_N_...", ...)] ★ EXTENSION NEEDED
974974+sounds (array) → properties [("inaturalist_sound_N_...", ...)] ★ EXTENSION NEEDED
975975+annotations (controlled vocab) → properties [("annotation_{attr}", value)] ★ EXTENSION NEEDED
976976+observation_field_values → properties [("ofv_id_{id}", value)]
977977+project_ids (traditional) → groups
978978+description → properties [("description", ...)]
979979+place_guess → properties [("place_guess", ...)]
980980+captive=true → properties [("captive", "true")]
981981+species_guess → properties [("species_guess", ...)]
982982+import batch → activity record
983983+```
984984+985985+### Items requiring Terradots model extensions (★)
986986+987987+1. **Media field** (`media : media_item list`) — photos and sounds as structured list with URL, license, attribution, role.
988988+2. **Structured annotations** — controlled-vocabulary attribute/value pairs with vote scores, distinct from free-text `annotation` records.
989989+3. **Polygon geometry for obscured observations** — represent the 0.2° × 0.2° obscuration cell as a Polygon rather than a misleading Point.
990990+```
···11+<!DOCTYPE html>
22+<html lang="en">
33+<head>
44+<meta charset="UTF-8">
55+<meta name="viewport" content="width=device-width, initial-scale=1.0">
66+<title>Terradots Label Store — Specification</title>
77+<style>
88+/* ═══════════════════════════════════════════════════════════
99+ CSS — self-contained, no external dependencies
1010+ ═══════════════════════════════════════════════════════════ */
1111+:root {
1212+ --fg: #1a1a2e;
1313+ --bg: #ffffff;
1414+ --bg-alt: #f8f9fc;
1515+ --muted: #6b7280;
1616+ --border: #e2e8f0;
1717+ --accent: #2563eb;
1818+ --accent-light: #eff6ff;
1919+ --code-bg: #f1f5f9;
2020+ --header-bg: #0f172a;
2121+ --header-fg: #f1f5f9;
2222+2323+ /* layer colours */
2424+ --c-camera: #ea580c;
2525+ --c-gps: #2563eb;
2626+ --c-gbif: #16a34a;
2727+ --c-inat: #0d9488;
2828+ --c-iucn: #dc2626;
2929+ --c-sim: #7c3aed;
3030+ --c-habitat: #65a30d;
3131+ --c-habitat-fill: rgba(101,163,13,0.25);
3232+ --c-range: #3b82f6;
3333+ --c-aoh: #16a34a;
3434+ --c-aoh-fill: rgba(22,163,74,0.3);
3535+3636+ --font-mono: 'SF Mono', 'Cascadia Code', 'Fira Code', 'Consolas', monospace;
3737+ --font-sans: -apple-system, BlinkMacSystemFont, 'Segoe UI', system-ui, sans-serif;
3838+}
3939+4040+* { margin: 0; padding: 0; box-sizing: border-box; }
4141+4242+body {
4343+ font-family: var(--font-sans);
4444+ color: var(--fg);
4545+ background: var(--bg);
4646+ line-height: 1.7;
4747+ font-size: 15px;
4848+}
4949+5050+/* ── Header ────────────────────────────────────────────── */
5151+header {
5252+ background: var(--header-bg);
5353+ color: var(--header-fg);
5454+ padding: 2.5rem 2rem 2rem;
5555+}
5656+header .inner {
5757+ max-width: 1200px;
5858+ margin: 0 auto;
5959+}
6060+header h1 {
6161+ font-size: 1.8rem;
6262+ font-weight: 700;
6363+ letter-spacing: -0.02em;
6464+}
6565+header .subtitle {
6666+ color: #94a3b8;
6767+ font-size: 1rem;
6868+ margin-top: 0.4rem;
6969+}
7070+header .meta {
7171+ margin-top: 0.8rem;
7272+ font-size: 0.8rem;
7373+ color: #64748b;
7474+}
7575+7676+/* ── Layout ────────────────────────────────────────────── */
7777+main {
7878+ max-width: 1200px;
7979+ margin: 0 auto;
8080+ padding: 0 2rem 6rem;
8181+}
8282+section {
8383+ margin-top: 2.5rem;
8484+}
8585+h2 {
8686+ font-size: 1.35rem;
8787+ font-weight: 700;
8888+ margin-bottom: 0.8rem;
8989+ padding-bottom: 0.4rem;
9090+ border-bottom: 2px solid var(--border);
9191+ letter-spacing: -0.01em;
9292+}
9393+h3 {
9494+ font-size: 1.05rem;
9595+ font-weight: 600;
9696+ margin-top: 1.5rem;
9797+ margin-bottom: 0.5rem;
9898+ color: #334155;
9999+}
100100+h4 {
101101+ font-size: 0.95rem;
102102+ font-weight: 600;
103103+ margin-top: 1.2rem;
104104+ margin-bottom: 0.4rem;
105105+}
106106+p { margin-bottom: 0.9rem; }
107107+a { color: var(--accent); text-decoration: none; }
108108+a:hover { text-decoration: underline; }
109109+ul, ol { margin-bottom: 0.9rem; padding-left: 1.5rem; }
110110+li { margin-bottom: 0.3rem; }
111111+strong { font-weight: 600; }
112112+113113+/* ── Code blocks ───────────────────────────────────────── */
114114+code {
115115+ font-family: var(--font-mono);
116116+ font-size: 0.88em;
117117+ background: var(--code-bg);
118118+ padding: 0.15em 0.4em;
119119+ border-radius: 4px;
120120+}
121121+pre {
122122+ background: var(--code-bg);
123123+ border: 1px solid var(--border);
124124+ border-radius: 6px;
125125+ padding: 1rem 1.2rem;
126126+ overflow-x: auto;
127127+ font-family: var(--font-mono);
128128+ font-size: 0.82rem;
129129+ line-height: 1.6;
130130+ margin-bottom: 1rem;
131131+}
132132+pre code {
133133+ background: none;
134134+ padding: 0;
135135+}
136136+137137+/* ── Tables ────────────────────────────────────────────── */
138138+table {
139139+ width: 100%;
140140+ border-collapse: collapse;
141141+ margin-bottom: 1rem;
142142+ font-size: 0.88rem;
143143+}
144144+th, td {
145145+ text-align: left;
146146+ padding: 0.5rem 0.8rem;
147147+ border-bottom: 1px solid var(--border);
148148+}
149149+th {
150150+ font-weight: 600;
151151+ background: var(--bg-alt);
152152+}
153153+td code {
154154+ font-size: 0.85em;
155155+}
156156+157157+/* ── Map container ─────────────────────────────────────── */
158158+.map-container {
159159+ display: grid;
160160+ grid-template-columns: 1fr 280px;
161161+ gap: 0;
162162+ border: 1px solid var(--border);
163163+ border-radius: 8px;
164164+ overflow: hidden;
165165+ background: var(--bg-alt);
166166+ margin-bottom: 1.5rem;
167167+}
168168+@media (max-width: 800px) {
169169+ .map-container { grid-template-columns: 1fr; }
170170+}
171171+172172+.map-svg-wrap {
173173+ position: relative;
174174+ min-height: 520px;
175175+ background: #f0f4f8;
176176+}
177177+.map-svg-wrap svg {
178178+ width: 100%;
179179+ height: 100%;
180180+ display: block;
181181+}
182182+183183+.map-sidebar {
184184+ background: #fff;
185185+ border-left: 1px solid var(--border);
186186+ padding: 1rem;
187187+ overflow-y: auto;
188188+ max-height: 620px;
189189+ font-size: 0.82rem;
190190+}
191191+.map-sidebar h3 {
192192+ font-size: 0.95rem;
193193+ margin-top: 0;
194194+ margin-bottom: 0.6rem;
195195+ color: var(--fg);
196196+}
197197+198198+/* ── Layer toggles ─────────────────────────────────────── */
199199+.layer-controls {
200200+ display: flex;
201201+ flex-wrap: wrap;
202202+ gap: 0.3rem 0.8rem;
203203+ padding: 0.7rem 1rem;
204204+ background: #fff;
205205+ border-bottom: 1px solid var(--border);
206206+ font-size: 0.82rem;
207207+}
208208+.layer-controls label {
209209+ display: inline-flex;
210210+ align-items: center;
211211+ gap: 0.3rem;
212212+ cursor: pointer;
213213+ white-space: nowrap;
214214+}
215215+.layer-controls .swatch {
216216+ display: inline-block;
217217+ width: 12px;
218218+ height: 12px;
219219+ border-radius: 3px;
220220+ border: 1px solid rgba(0,0,0,0.15);
221221+}
222222+223223+/* ── Stats bar ─────────────────────────────────────────── */
224224+.stats-grid {
225225+ display: grid;
226226+ grid-template-columns: repeat(auto-fit, minmax(100px, 1fr));
227227+ gap: 0.5rem;
228228+ margin-bottom: 1rem;
229229+}
230230+.stat-card {
231231+ background: var(--bg-alt);
232232+ border: 1px solid var(--border);
233233+ border-radius: 6px;
234234+ padding: 0.6rem 0.7rem;
235235+ text-align: center;
236236+}
237237+.stat-card .val {
238238+ font-size: 1.3rem;
239239+ font-weight: 700;
240240+ color: var(--accent);
241241+}
242242+.stat-card .lbl {
243243+ font-size: 0.72rem;
244244+ color: var(--muted);
245245+ text-transform: uppercase;
246246+ letter-spacing: 0.04em;
247247+}
248248+249249+/* ── Detail panel ──────────────────────────────────────── */
250250+.detail-panel {
251251+ background: #fff;
252252+ border: 1px solid var(--border);
253253+ border-radius: 6px;
254254+ padding: 0.8rem;
255255+ margin-top: 0.5rem;
256256+ font-size: 0.8rem;
257257+ line-height: 1.5;
258258+ display: none;
259259+}
260260+.detail-panel.show { display: block; }
261261+.detail-panel .dp-title {
262262+ font-weight: 700;
263263+ font-size: 0.9rem;
264264+ margin-bottom: 0.4rem;
265265+ color: var(--fg);
266266+}
267267+.detail-panel .dp-row {
268268+ display: flex;
269269+ gap: 0.5rem;
270270+ padding: 0.2rem 0;
271271+ border-bottom: 1px solid #f1f5f9;
272272+}
273273+.detail-panel .dp-key {
274274+ font-weight: 600;
275275+ min-width: 90px;
276276+ color: #475569;
277277+ font-family: var(--font-mono);
278278+ font-size: 0.78rem;
279279+}
280280+.detail-panel .dp-val {
281281+ color: #334155;
282282+ word-break: break-all;
283283+}
284284+285285+/* ── SVG interactive ───────────────────────────────────── */
286286+svg .map-point { cursor: pointer; transition: r 0.15s; }
287287+svg .map-point:hover { r: 7; }
288288+svg .map-poly { cursor: pointer; }
289289+svg .map-poly:hover { opacity: 0.85; }
290290+291291+/* ── Provenance DAG ────────────────────────────────────── */
292292+.dag-container {
293293+ background: var(--bg-alt);
294294+ border: 1px solid var(--border);
295295+ border-radius: 8px;
296296+ overflow-x: auto;
297297+ padding: 1rem;
298298+ margin-bottom: 1.5rem;
299299+}
300300+.dag-container svg {
301301+ display: block;
302302+ margin: 0 auto;
303303+}
304304+305305+/* ── Training cycle diagram ────────────────────────────── */
306306+.cycle-diagram {
307307+ background: var(--bg-alt);
308308+ border: 1px solid var(--border);
309309+ border-radius: 8px;
310310+ overflow-x: auto;
311311+ padding: 1rem;
312312+ margin-bottom: 1.5rem;
313313+}
314314+315315+/* ── Badges ────────────────────────────────────────────── */
316316+.badge {
317317+ display: inline-block;
318318+ font-size: 0.7rem;
319319+ font-family: var(--font-mono);
320320+ padding: 0.15em 0.5em;
321321+ border-radius: 3px;
322322+ vertical-align: middle;
323323+}
324324+.badge-measured { background: #dbeafe; color: #1e40af; }
325325+.badge-derived { background: #fef3c7; color: #92400e; }
326326+.badge-simulated { background: #ede9fe; color: #5b21b6; }
327327+.badge-abstract { background: #f1f5f9; color: #475569; }
328328+329329+/* ── TOC ───────────────────────────────────────────────── */
330330+.toc {
331331+ background: var(--bg-alt);
332332+ border: 1px solid var(--border);
333333+ border-radius: 8px;
334334+ padding: 1.2rem 1.5rem;
335335+ margin-bottom: 2rem;
336336+}
337337+.toc h3 {
338338+ font-size: 0.9rem;
339339+ margin-top: 0;
340340+ margin-bottom: 0.6rem;
341341+ text-transform: uppercase;
342342+ letter-spacing: 0.05em;
343343+ color: var(--muted);
344344+}
345345+.toc ol {
346346+ padding-left: 1.2rem;
347347+ font-size: 0.88rem;
348348+}
349349+.toc li {
350350+ margin-bottom: 0.2rem;
351351+}
352352+.toc a {
353353+ color: var(--fg);
354354+}
355355+.toc a:hover {
356356+ color: var(--accent);
357357+}
358358+359359+/* ── Principle box ─────────────────────────────────────── */
360360+.principle {
361361+ background: var(--accent-light);
362362+ border-left: 3px solid var(--accent);
363363+ padding: 0.8rem 1rem;
364364+ margin-bottom: 1rem;
365365+ border-radius: 0 6px 6px 0;
366366+}
367367+.principle strong {
368368+ color: var(--accent);
369369+}
370370+</style>
371371+</head>
372372+<body>
373373+374374+<!-- ═══════════════════════════════════════════════════════
375375+ HEADER
376376+ ═══════════════════════════════════════════════════════ -->
377377+<header>
378378+ <div class="inner">
379379+ <h1>Terradots Label Store — Specification</h1>
380380+ <div class="subtitle">A data model for geospatial labels with full provenance, uncertainty, and spatial indexing</div>
381381+ <div class="meta">Worked example: Area of Habitat for <em>Panthera leo</em> in the Serengeti · 23 labels · 10 activities · CRS EPSG:4326 · Hilbert level 12</div>
382382+ </div>
383383+</header>
384384+385385+<main>
386386+387387+<!-- ═══════════════════════════════════════════════════════
388388+ TABLE OF CONTENTS
389389+ ═══════════════════════════════════════════════════════ -->
390390+<section>
391391+<div class="toc">
392392+ <h3>Contents</h3>
393393+ <ol>
394394+ <li><a href="#map">Interactive Map — AOH Worked Example</a></li>
395395+ <li><a href="#provenance">Provenance Graph</a></li>
396396+ <li><a href="#training-cycle">Training Cycle</a></li>
397397+ <li><a href="#design">Design Principles</a></li>
398398+ <li><a href="#types">Type Specification</a></li>
399399+ <li><a href="#constructors">Constructors</a></li>
400400+ <li><a href="#accessors">Accessors</a></li>
401401+ <li><a href="#fingerprinting">Fingerprinting</a></li>
402402+ <li><a href="#storage">Storage Layer</a></li>
403403+ </ol>
404404+</div>
405405+</section>
406406+407407+<!-- ═══════════════════════════════════════════════════════
408408+ 1. INTERACTIVE MAP
409409+ ═══════════════════════════════════════════════════════ -->
410410+<section id="map">
411411+<h2>1. Interactive Map — AOH Worked Example</h2>
412412+413413+<p>This map shows all 23 labels from the <em>Panthera leo</em> Area of Habitat example plotted at their real
414414+WGS 84 coordinates in the Serengeti ecosystem. Click any label to see its full metadata. Use the
415415+layer toggles below to show or hide each data source.</p>
416416+417417+<!-- Statistics -->
418418+<div class="stats-grid">
419419+ <div class="stat-card"><div class="val">23</div><div class="lbl">Total Labels</div></div>
420420+ <div class="stat-card"><div class="val">14</div><div class="lbl">Measured</div></div>
421421+ <div class="stat-card"><div class="val">3</div><div class="lbl">Simulated</div></div>
422422+ <div class="stat-card"><div class="val">6</div><div class="lbl">Derived</div></div>
423423+ <div class="stat-card"><div class="val">3,420</div><div class="lbl">AOH km²</div></div>
424424+ <div class="stat-card"><div class="val">70.5%</div><div class="lbl">Habitat</div></div>
425425+</div>
426426+427427+<!-- Layer toggles -->
428428+<div class="layer-controls" id="layerControls">
429429+ <label><input type="checkbox" data-layer="camera" checked><span class="swatch" style="background:var(--c-camera)"></span> Camera Traps</label>
430430+ <label><input type="checkbox" data-layer="gps" checked><span class="swatch" style="background:var(--c-gps)"></span> GPS Collars</label>
431431+ <label><input type="checkbox" data-layer="gbif" checked><span class="swatch" style="background:var(--c-gbif)"></span> GBIF</label>
432432+ <label><input type="checkbox" data-layer="inat" checked><span class="swatch" style="background:var(--c-inat)"></span> iNaturalist</label>
433433+ <label><input type="checkbox" data-layer="iucn" checked><span class="swatch" style="background:var(--c-iucn)"></span> IUCN Range</label>
434434+ <label><input type="checkbox" data-layer="iucn-hab" checked><span class="swatch" style="background:var(--c-iucn)"></span> IUCN Habitat</label>
435435+ <label><input type="checkbox" data-layer="sim" checked><span class="swatch" style="background:var(--c-sim)"></span> Simulated (LV)</label>
436436+ <label><input type="checkbox" data-layer="habitat" checked><span class="swatch" style="background:var(--c-habitat)"></span> Habitat Tiles</label>
437437+ <label><input type="checkbox" data-layer="range" checked><span class="swatch" style="background:var(--c-range)"></span> Species Range</label>
438438+ <label><input type="checkbox" data-layer="aoh" checked><span class="swatch" style="background:var(--c-aoh)"></span> AOH Patches</label>
439439+</div>
440440+441441+<!-- Map + sidebar -->
442442+<div class="map-container">
443443+ <div class="map-svg-wrap" id="mapWrap">
444444+ <svg id="mapSvg" viewBox="0 0 800 600" xmlns="http://www.w3.org/2000/svg">
445445+ </svg>
446446+ </div>
447447+ <div class="map-sidebar" id="mapSidebar">
448448+ <h3>Label Details</h3>
449449+ <p style="color:var(--muted); font-size:0.82rem;">Click a label on the map to view its metadata, origin, classification, and properties.</p>
450450+ <div class="detail-panel" id="detailPanel"></div>
451451+ </div>
452452+</div>
453453+</section>
454454+455455+<!-- ═══════════════════════════════════════════════════════
456456+ 2. PROVENANCE GRAPH
457457+ ═══════════════════════════════════════════════════════ -->
458458+<section id="provenance">
459459+<h2>2. Provenance Graph</h2>
460460+<p>The provenance DAG traces how the AOH label (<code>aoh-001</code>) was derived from upstream
461461+sources. Every arrow means “was computed from”. Measured labels (leaves) have no incoming edges.
462462+Simulated labels are marked with dashed borders. Derived labels show their method.</p>
463463+464464+<div class="dag-container" id="dagContainer">
465465+ <svg id="dagSvg" xmlns="http://www.w3.org/2000/svg"></svg>
466466+</div>
467467+468468+<p>The full provenance tree as text:</p>
469469+<pre><code>AOH polygon (aoh-001) method: aoh:iucn-2022:range-intersect-habitat
470470+ +-- species_range (range-001) method: alpha-shape:alpha-0.005
471471+ | +-- ct-001 Camera trap (34.82, -2.33) Measured
472472+ | +-- ct-002 Camera trap (34.83, -2.32) Measured
473473+ | +-- ct-003 Camera trap (35.01, -2.15) Measured
474474+ | +-- gps-001 GPS leo-007 (34.81, -2.34) Measured (via Movebank)
475475+ | +-- gps-002 GPS leo-007 (34.84, -2.31) Measured (via Movebank)
476476+ | +-- gps-003 GPS leo-007 (34.91, -2.28) Measured (via Movebank)
477477+ | +-- gps-004 GPS leo-012 (35.05, -2.10) Measured (via Movebank)
478478+ | +-- gbif-001 GBIF (34.85, -2.35) Measured (via GBIF)
479479+ | +-- gbif-002 GBIF (35.40, -2.50) Measured (via GBIF)
480480+ | +-- inat-001 iNaturalist (34.95, -2.20) Measured (via iNat)
481481+ +-- iucn-range-001 IUCN expert range Measured (via IUCN)
482482+ +-- hab-001 Habitat tile: core savanna Derived from IUCN hab prefs
483483+ | +-- iucn-hab-001 Savanna preference Measured (via IUCN)
484484+ | +-- iucn-hab-002 Shrubland preference Measured (via IUCN)
485485+ +-- hab-002 Habitat tile: savanna-shrubland Derived from IUCN hab prefs
486486+ +-- iucn-hab-001 (shared)
487487+ +-- iucn-hab-002 (shared)</code></pre>
488488+489489+<p>Note: The training set (<code>ts-001</code>) feeds into the habitat classifier but is not a direct
490490+source of the AOH label. It includes all measured observations plus 3 synthetic (Lotka-Volterra)
491491+labels for a 23% synthetic fraction.</p>
492492+</section>
493493+494494+<!-- ═══════════════════════════════════════════════════════
495495+ 3. TRAINING CYCLE
496496+ ═══════════════════════════════════════════════════════ -->
497497+<section id="training-cycle">
498498+<h2>3. Training Cycle</h2>
499499+<p>Labels flow through a <strong>training/inference cycle</strong>, not a streaming recomputation
500500+pipeline. The cycle has discrete phases, each producing labels that become inputs to the next.</p>
501501+502502+<div class="cycle-diagram" id="cycleContainer">
503503+ <svg id="cycleSvg" xmlns="http://www.w3.org/2000/svg"></svg>
504504+</div>
505505+506506+<h3>Cycle Phases</h3>
507507+<ol>
508508+ <li><strong>Observations accumulate.</strong> Camera traps trigger, GPS collars record fixes, citizen
509509+ scientists upload sightings, museum records are digitised. Each produces a <span class="badge badge-measured">Measured</span> label.
510510+ Registry imports (GBIF, Movebank, iNaturalist) carry their <code>via</code> URI for provenance.</li>
511511+512512+ <li><strong>Training set assembled.</strong> A <span class="badge badge-derived">Derived</span> label
513513+ (<code>ts-001</code>) records exactly which observations were selected, plus any
514514+ <span class="badge badge-simulated">Simulated</span> augmentation. The synthetic fraction (here 23%)
515515+ is stored in <code>properties</code> for transparency.</li>
516516+517517+ <li><strong>Model trained.</strong> The TESSERA habitat classifier trains on the assembled set.
518518+ The activity record links the training run to its notebook, parameters, and timestamp.</li>
519519+520520+ <li><strong>Habitat classified.</strong> Each landscape tile receives a suitability classification
521521+ (savanna 78%, shrubland 13%, etc.) expressed as <code>class_dist</code>. Tiles above the threshold
522522+ become suitable; those below (cropland, settlement) are excluded.</li>
523523+524524+ <li><strong>Species range computed.</strong> An alpha-shape from <em>measured-only</em> observations
525525+ (simulated labels excluded via <code>is_simulated</code>). This ensures the range reflects where
526526+ lions have actually been observed.</li>
527527+528528+ <li><strong>AOH computed.</strong> The intersection of species range with suitable habitat tiles,
529529+ validated against the IUCN expert range. Result: 3,420 km² of suitable habitat out of
530530+ 4,850 km² total range (70.5%).</li>
531531+</ol>
532532+533533+<h3>Recomputation</h3>
534534+<p>When new observations arrive or the model retrains (TESSERA v3.1 → v3.2), downstream
535535+derivations recompute. Each recomputation produces a <em>new</em> label with a new activity record;
536536+the old version is retained for comparison. This is not streaming — it is a deliberate
537537+batch cycle where each phase completes before the next begins.</p>
538538+</section>
539539+540540+<!-- ═══════════════════════════════════════════════════════
541541+ 4. DESIGN PRINCIPLES
542542+ ═══════════════════════════════════════════════════════ -->
543543+<section id="design">
544544+<h2>4. Design Principles</h2>
545545+546546+<div class="principle">
547547+ <strong>Coordinates live in CRS space.</strong>
548548+ All coordinates are in the document's native Coordinate Reference System. Pixel-space mapping
549549+ (affine transforms, SVG viewBox) is a serialisation concern, not a data model concern. The CRS
550550+ is specified per document as any string that <a href="https://proj.org/">PROJ</a> can resolve:
551551+ EPSG codes (<code>"EPSG:4326"</code>), WKT2 strings, or PROJ pipeline definitions.
552552+</div>
553553+554554+<div class="principle">
555555+ <strong>Origin distinguishes measured, derived, and simulated.</strong>
556556+ Every label records how it was produced. <em>Measured</em> labels come from direct observation or
557557+ registry import. <em>Derived</em> labels are computed from other labels (convex hulls, buffers,
558558+ merges). <em>Simulated</em> labels come from theoretical models and must remain identifiable as
559559+ synthetic — they augment training data but do not represent real-world observations.
560560+</div>
561561+562562+<div class="principle">
563563+ <strong>URIs identify observers and registries.</strong>
564564+ Observers (sensors, humans) and external registries are identified by URI. The URI scheme encodes
565565+ the kind of source. Adding a new kind requires no code changes — just use a new URI scheme.
566566+</div>
567567+568568+<table>
569569+ <thead><tr><th>URI</th><th>Meaning</th></tr></thead>
570570+ <tbody>
571571+ <tr><td><code>orcid:0000-0001-2345-6789</code></td><td>Human observer (ORCID)</td></tr>
572572+ <tr><td><code>https://ror.org/035dkdb55</code></td><td>Institution (ROR)</td></tr>
573573+ <tr><td><code>urn:sensor:gps:trimble-r12-0042</code></td><td>GPS receiver</td></tr>
574574+ <tr><td><code>urn:sensor:camera-trap:ct-0042</code></td><td>Camera trap</td></tr>
575575+ <tr><td><code>gbif:4023589127</code></td><td>GBIF occurrence record</td></tr>
576576+ <tr><td><code>inaturalist:observation/12345</code></td><td>iNaturalist observation</td></tr>
577577+ <tr><td><code>osm:node/123456</code></td><td>OpenStreetMap node</td></tr>
578578+ <tr><td><code>movebank:study/1234/individual/leo-007/event/98001</code></td><td>Movebank GPS event</td></tr>
579579+ <tr><td><code>iucn:redlist:22/Panthera-leo:range:2024.1</code></td><td>IUCN range polygon</td></tr>
580580+ <tr><td><code>fairground:notebook/lotka-volterra-serengeti:v4</code></td><td>Simulation model</td></tr>
581581+ </tbody>
582582+</table>
583583+584584+<div class="principle">
585585+ <strong>Identity and spatial indexing are separate.</strong>
586586+ A label has two name components: <code>cell</code> (Hilbert curve cell, recomputed on reprojection)
587587+ and <code>id</code> (UUID, stable forever). Concatenating <code>cell-id</code> gives a spatially-sortable
588588+ unique name. Any sorted index gets spatial clustering for free.
589589+</div>
590590+591591+<div class="principle">
592592+ <strong>Classification is a probability distribution.</strong>
593593+ A label's class is expressed through <code>class_dist</code>, a list of
594594+ <code>(class_name, probability)</code> pairs ordered by decreasing probability.
595595+ A definite classification is <code>[("Panthera leo", 1.0)]</code>.
596596+ An uncertain classification distributes probability across candidates.
597597+ An unclassified label has an empty list.
598598+</div>
599599+600600+<div class="principle">
601601+ <strong>Temporal data follows Darwin Core.</strong>
602602+ The <code>event_date</code> field follows the Darwin Core temporal interpretation convention
603603+ (ISO 8601-1:2019). It records when the observation was made, not when the label was imported.
604604+ Supported formats: precise dates (<code>"2023-09-18"</code>), imprecise dates (<code>"2023-09"</code>
605605+ or <code>"2023"</code>), date-times (<code>"2023-09-18T13:27:00Z"</code>), and intervals
606606+ (<code>"2023-09-05/2023-09-18"</code>).
607607+</div>
608608+609609+<div class="principle">
610610+ <strong>Deduplication is a derivation.</strong>
611611+ Labels imported from multiple sources may refer to the same real-world feature. Dedup is modelled
612612+ as a derivation: find candidate matches (same Hilbert cell, class agreement, temporal overlap),
613613+ let an expert decide, then merge via <code>Derived { sources = [a; b]; method_ = "manual-merge" }</code>.
614614+ Both originals are kept for full provenance.
615615+</div>
616616+617617+<div class="principle">
618618+ <strong>Training cycle, not streaming recomputation.</strong>
619619+ Downstream derivations (habitat classification, species range, AOH) are batch operations that
620620+ recompute when inputs change. Each run gets a new activity record. Old and new versions coexist
621621+ for comparison. This is deliberate: ecological models need stable training windows, not
622622+ continuous flux.
623623+</div>
624624+</section>
625625+626626+<!-- ═══════════════════════════════════════════════════════
627627+ 5. TYPE SPECIFICATION
628628+ ═══════════════════════════════════════════════════════ -->
629629+<section id="types">
630630+<h2>5. Type Specification</h2>
631631+<p>All types are defined in the <code>Terradots</code> OCaml module. The data model is independent of
632632+any serialisation format (SVG, GeoJSON, GeoParquet, etc.).</p>
633633+634634+<h3 id="type-crs">Coordinate Reference Systems</h3>
635635+<pre><code>(** Any string that PROJ can resolve: EPSG codes, WKT2, PROJ pipelines. *)
636636+type crs = string
637637+638638+val wgs84 : crs (* "EPSG:4326" -- lon/lat in degrees *)
639639+val web_mercator : crs (* "EPSG:3857" -- metres, for web tiles *)</code></pre>
640640+<p>The CRS determines the units and meaning of all <code>point</code> coordinates in the document.
641641+For EPSG:4326, <code>x</code> is longitude (degrees east), <code>y</code> is latitude (degrees north).
642642+For projected CRS (UTM, Web Mercator), <code>x</code> is easting (metres), <code>y</code> is
643643+northing (metres).</p>
644644+645645+<h3 id="type-temporal">Temporal</h3>
646646+<pre><code>(** Abstract type -- construct with event_date_of_string,
647647+ inspect with string_of_event_date. *)
648648+type event_date
649649+650650+val event_date_of_string : string -> event_date
651651+val string_of_event_date : event_date -> string</code></pre>
652652+<p>Valid forms: precise dates (<code>"2023-09-18"</code>), imprecise dates (<code>"2023-09"</code>,
653653+<code>"2023"</code>), date-times (<code>"2023-09-18T13:27:00Z"</code>), intervals
654654+(<code>"2023-09-05/2023-09-18"</code>). The abstraction boundary allows future parsing, validation,
655655+and temporal overlap queries. <span class="badge badge-abstract">abstract</span></p>
656656+657657+<h3 id="type-cell">Spatial Indexing (Hilbert Cell)</h3>
658658+<pre><code>(** Abstract type -- a hex-encoded Hilbert curve cell index. *)
659659+type cell
660660+661661+val cell_of_string : string -> cell
662662+val string_of_cell : cell -> string</code></pre>
663663+<p>The Hilbert curve maps 2D coordinates to a 1D index preserving spatial locality. Nearby points
664664+in CRS space get nearby cell values. <span class="badge badge-abstract">abstract</span></p>
665665+666666+<h4>Hilbert Level Table (EPSG:4326)</h4>
667667+<table>
668668+ <thead><tr><th>Level</th><th>Cell size</th><th>Hex chars</th><th>Use case</th></tr></thead>
669669+ <tbody>
670670+ <tr><td>8</td><td>~1.4 km</td><td>2</td><td>Coarse regional indexing</td></tr>
671671+ <tr><td>12</td><td>~88 m</td><td>3</td><td>Standard (this example)</td></tr>
672672+ <tr><td>16</td><td>~5.5 m</td><td>4</td><td>High-resolution surveys</td></tr>
673673+ <tr><td>20</td><td>~0.3 m</td><td>5</td><td>Sub-metre precision</td></tr>
674674+ </tbody>
675675+</table>
676676+677677+<h3 id="type-geometry">Geometry</h3>
678678+<pre><code>(** A point in the document's native CRS. *)
679679+type point = { x : float; y : float }
680680+681681+(** Follows OGC Simple Features / ISO 19125. *)
682682+type geometry =
683683+ | Point of point
684684+ | Polygon of point list (* exterior ring, closed *)
685685+ | Multi of geometry list (* GeometryCollection / Multi* *)
686686+687687+(** Representative point for spatial indexing. *)
688688+val centroid : geometry -> point</code></pre>
689689+<p>Centroid computation: <strong>Point</strong> returns itself. <strong>Polygon</strong> returns the
690690+arithmetic mean of ring vertices. <strong>Multi</strong> returns the centroid of centroids
691691+(unweighted — sufficient for indexing, not for area-weighted analysis).</p>
692692+693693+<h3 id="type-origin">Origin</h3>
694694+<pre><code>type origin =
695695+ | Measured of {
696696+ observer : string option; (* URI of observer *)
697697+ via : string option; (* URI of registry record *)
698698+ license : string option; (* SPDX identifier *)
699699+ accuracy_m : float option; (* positional uncertainty, metres *)
700700+ }
701701+ | Derived of {
702702+ sources : string list; (* IDs of source labels *)
703703+ method_ : string; (* algorithm identifier *)
704704+ }
705705+ | Simulated of {
706706+ model : string; (* URI of simulation model *)
707707+ run_id : string; (* unique run identifier *)
708708+ }</code></pre>
709709+710710+<table>
711711+ <thead><tr><th>Variant</th><th>Fields</th><th>Description</th></tr></thead>
712712+ <tbody>
713713+ <tr>
714714+ <td><span class="badge badge-measured">Measured</span></td>
715715+ <td><code>observer</code>, <code>via</code>, <code>license</code>, <code>accuracy_m</code></td>
716716+ <td>Direct observation or registry import. <code>observer</code> is required for direct obs, optional for imports.
717717+ <code>via</code> is the registry URI (GBIF, Movebank, iNat). <code>accuracy_m</code> is positional uncertainty in metres.</td>
718718+ </tr>
719719+ <tr>
720720+ <td><span class="badge badge-derived">Derived</span></td>
721721+ <td><code>sources</code>, <code>method_</code></td>
722722+ <td>Computed from other labels. <code>sources</code> are label IDs within the same document.
723723+ <code>method_</code> identifies the algorithm (e.g. <code>"convex-hull"</code>, <code>"manual-merge"</code>,
724724+ <code>"alpha-shape:alpha-0.005"</code>).</td>
725725+ </tr>
726726+ <tr>
727727+ <td><span class="badge badge-simulated">Simulated</span></td>
728728+ <td><code>model</code>, <code>run_id</code></td>
729729+ <td>Produced by a theoretical model. <code>model</code> URI identifies the code (e.g. a Fairground notebook).
730730+ <code>run_id</code> links all labels from the same execution.</td>
731731+ </tr>
732732+ </tbody>
733733+</table>
734734+735735+<h3 id="type-activity">Activity (Provenance Audit Record)</h3>
736736+<pre><code>type activity = {
737737+ activity_id : string;
738738+ agent : string; (* who/what: URI, email, tool *)
739739+ date : string; (* ISO 8601 *)
740740+ description : string option; (* free-text note *)
741741+}</code></pre>
742742+<p>An activity captures the “who” and “when” of label creation or derivation. Multiple labels may
743743+share the same activity (e.g. a batch import). Labels reference activities via their
744744+<code>activity</code> field.</p>
745745+746746+<h3 id="type-label">Label</h3>
747747+<pre><code>type label = {
748748+ cell : cell; (* Hilbert cell index *)
749749+ id : string; (* unique identifier *)
750750+ geometry : geometry; (* spatial extent *)
751751+ origin : origin; (* how produced *)
752752+ event_date : event_date option; (* when observed *)
753753+ confidence : float option; (* semantic confidence in [0,1] *)
754754+ class_dist : (string * float) list; (* probability distribution *)
755755+ activity : string option; (* activity ID *)
756756+ properties : (string * string) list; (* extensible metadata *)
757757+}
758758+759759+val label_name : label -> string (* cell ^ "-" ^ id *)</code></pre>
760760+<p>The <code>label</code> type is the central data structure. All fields except <code>cell</code>,
761761+<code>id</code>, <code>geometry</code>, and <code>origin</code> are optional or may be empty.</p>
762762+763763+<h3 id="type-annotation">Annotation</h3>
764764+<pre><code>type annotation = {
765765+ id : string;
766766+ text : string; (* free-text content *)
767767+ anchors : string list; (* label IDs this annotates *)
768768+}</code></pre>
769769+<p>Annotations provide commentary, corrections, or contextual notes without modifying labels.
770770+An annotation may span multiple labels.</p>
771771+772772+<h3 id="type-group">Group</h3>
773773+<pre><code>type group = {
774774+ id : string;
775775+ activity : string option; (* activity that created this group *)
776776+ members : string list; (* label IDs *)
777777+}</code></pre>
778778+<p>Groups organise labels into logical collections (field campaigns, seasonal surveys, thematic subsets).
779779+Purely organisational — they do not affect label semantics. A label may belong to multiple groups.</p>
780780+781781+<h3 id="type-document">Document</h3>
782782+<pre><code>type document = {
783783+ crs : crs;
784784+ level : int; (* Hilbert curve level *)
785785+ provenance : activity list;
786786+ labels : label list;
787787+ annotations : annotation list;
788788+ groups : group list;
789789+}
790790+791791+val empty_document : crs:crs -> ?level:int -> unit -> document</code></pre>
792792+<p>The top-level container: a set of labels in a common CRS, with provenance records, annotations,
793793+and groups. The <code>level</code> parameter defaults to 12 (~88 m cells for EPSG:4326).</p>
794794+</section>
795795+796796+<!-- ═══════════════════════════════════════════════════════
797797+ 6. CONSTRUCTORS
798798+ ═══════════════════════════════════════════════════════ -->
799799+<section id="constructors">
800800+<h2>6. Constructors</h2>
801801+<p>Convenience functions that enforce common patterns. All require <code>~cell</code> and <code>~id</code>.
802802+Classification is always via <code>~class_dist</code>.</p>
803803+804804+<h3>make_point</h3>
805805+<pre><code>val make_point :
806806+ cell:cell -> id:string ->
807807+ x:float -> y:float ->
808808+ observer:string ->
809809+ ?accuracy_m:float ->
810810+ ?event_date:event_date -> ?confidence:float ->
811811+ ?class_dist:(string * float) list ->
812812+ ?activity:string ->
813813+ ?properties:(string * string) list ->
814814+ unit -> label</code></pre>
815815+<p>Construct a measured point label from a direct observation. Requires an <code>observer</code> URI.</p>
816816+817817+<h3>make_polygon</h3>
818818+<pre><code>val make_polygon :
819819+ cell:cell -> id:string ->
820820+ ring:point list ->
821821+ observer:string ->
822822+ ?accuracy_m:float ->
823823+ ?event_date:event_date -> ?confidence:float ->
824824+ ?class_dist:(string * float) list ->
825825+ ?activity:string ->
826826+ ?properties:(string * string) list ->
827827+ unit -> label</code></pre>
828828+<p>Construct a measured polygon label. The ring must be closed (last point = first point).</p>
829829+830830+<h3>make_imported</h3>
831831+<pre><code>val make_imported :
832832+ cell:cell -> id:string ->
833833+ geometry:geometry ->
834834+ via:string ->
835835+ ?observer:string -> ?license:string ->
836836+ ?accuracy_m:float ->
837837+ ?event_date:event_date -> ?confidence:float ->
838838+ ?class_dist:(string * float) list ->
839839+ ?activity:string ->
840840+ ?properties:(string * string) list ->
841841+ unit -> label</code></pre>
842842+<p>Construct a label imported from an external registry. The <code>via</code> URI identifies the
843843+registry record. Observer is optional (many registries do not expose the original collector).</p>
844844+845845+<h3>make_derived</h3>
846846+<pre><code>val make_derived :
847847+ cell:cell -> id:string ->
848848+ geometry:geometry ->
849849+ sources:string list ->
850850+ method_:string ->
851851+ ?event_date:event_date -> ?confidence:float ->
852852+ ?class_dist:(string * float) list ->
853853+ ?activity:string ->
854854+ ?properties:(string * string) list ->
855855+ unit -> label</code></pre>
856856+<p>Construct a derived label. Deduplication merges are a special case:
857857+<code>make_derived ~sources:["a";"b"] ~method_:"manual-merge" ...</code></p>
858858+859859+<h3>make_simulated</h3>
860860+<pre><code>val make_simulated :
861861+ cell:cell -> id:string ->
862862+ geometry:geometry ->
863863+ model:string ->
864864+ run_id:string ->
865865+ ?event_date:event_date -> ?confidence:float ->
866866+ ?class_dist:(string * float) list ->
867867+ ?activity:string ->
868868+ ?properties:(string * string) list ->
869869+ unit -> label</code></pre>
870870+<p>Construct a simulated label. The <code>model</code> URI identifies the simulation code;
871871+<code>run_id</code> links all labels from the same execution.</p>
872872+</section>
873873+874874+<!-- ═══════════════════════════════════════════════════════
875875+ 7. ACCESSORS
876876+ ═══════════════════════════════════════════════════════ -->
877877+<section id="accessors">
878878+<h2>7. Accessors</h2>
879879+<pre><code>(** Most likely class from class_dist, or None if empty. *)
880880+val primary_class : label -> string option
881881+882882+(** Positional accuracy in metres, if Measured. *)
883883+val accuracy_of : label -> float option
884884+885885+(** Source label IDs, if Derived. Empty otherwise. *)
886886+val sources_of : label -> string list
887887+888888+(** Registry URI, if imported via a registry. *)
889889+val via_of : label -> string option
890890+891891+(** True for Simulated labels. *)
892892+val is_simulated : label -> bool</code></pre>
893893+<p>These accessors provide safe pattern-matching over the <code>origin</code> variant without
894894+exposing internal structure. <code>is_simulated</code> is used by the species-range pipeline to
895895+exclude synthetic observations.</p>
896896+</section>
897897+898898+<!-- ═══════════════════════════════════════════════════════
899899+ 8. FINGERPRINTING
900900+ ═══════════════════════════════════════════════════════ -->
901901+<section id="fingerprinting">
902902+<h2>8. Fingerprinting</h2>
903903+<pre><code>(** Coarse key for deduplication candidates.
904904+ Returns cell ^ "|" ^ primary_class (or "_" if unclassified). *)
905905+val fingerprint : label -> string</code></pre>
906906+<p>A fingerprint combines the Hilbert cell (spatial locality) with the primary class. Two labels
907907+with the same fingerprint are worth comparing for potential deduplication. Different fingerprints
908908+guarantee the labels are either spatially distant or differently classified.</p>
909909+<p>The <code>event_date</code> is deliberately excluded: the same real-world feature observed at
910910+different times should still match as a candidate, so a human reviewer can decide whether they
911911+represent the same feature.</p>
912912+</section>
913913+914914+<!-- ═══════════════════════════════════════════════════════
915915+ 9. STORAGE LAYER
916916+ ═══════════════════════════════════════════════════════ -->
917917+<section id="storage">
918918+<h2>9. Storage Layer</h2>
919919+<p>The data model is independent of how labels are stored and indexed. This section specifies the
920920+contract between the core types and a storage backend.</p>
921921+922922+<h3>Hilbert Cell Computation</h3>
923923+<p>The <code>cell</code> field on each label is a hex-encoded Hilbert curve cell index, computed
924924+from the label's <code>centroid</code> at the document's <code>level</code>. The storage layer
925925+must provide:</p>
926926+<pre><code>val hilbert_cell : level:int -> crs:crs -> point -> cell</code></pre>
927927+928928+<h3>Why Hilbert, not Geohash</h3>
929929+<p>Geohash uses a Z-order (Morton) curve. Z-order curves have discontinuities at certain cell
930930+boundaries: two points close in 2D space can receive very different hash values when they fall
931931+on opposite sides of a major subdivision. The Hilbert curve avoids this — adjacent cells on
932932+the curve are <em>always</em> spatially adjacent. This gives more uniform spatial clustering and
933933+fewer edge-case misses in proximity queries.</p>
934934+935935+<h3>Reprojection</h3>
936936+<p>When a document's CRS changes, all <code>cell</code> values must be recomputed from the
937937+(reprojected) geometries. The <code>id</code> fields remain stable — identity is independent
938938+of coordinate system.</p>
939939+940940+<h3>Sorted Keys</h3>
941941+<p>Concatenating <code>cell ^ "-" ^ id</code> (via <code>label_name</code>) produces a key that
942942+sorts spatially. Any system that maintains sorted order (B-tree, LSM tree, lexicographic file
943943+listing) gets spatial clustering for free: a prefix scan on a cell value retrieves all labels
944944+in that spatial neighbourhood.</p>
945945+</section>
946946+947947+</main>
948948+949949+<!-- ═══════════════════════════════════════════════════════
950950+ JAVASCRIPT -- all interactive behaviour
951951+ ═══════════════════════════════════════════════════════ -->
952952+<script>
953953+(function() {
954954+"use strict";
955955+956956+// ═══════════════════════════════════════════════════════
957957+// LABEL DATA -- all 23 labels from the AOH example
958958+// ═══════════════════════════════════════════════════════
959959+960960+var labels = [
961961+ // Camera traps
962962+ {id:"ct-001", layer:"camera", cell:"b7a", type:"point", x:34.82, y:-2.33,
963963+ origin:"Measured", observer:"urn:sensor:camera-trap:serengeti-node-17",
964964+ accuracy_m:5.0, confidence:0.97,
965965+ class_dist:[["Panthera leo",1.0]], event_date:"2024-06-12T05:42:00Z",
966966+ activity:"act-field-2024",
967967+ props:{image_uri:"s3://slp/ct17/IMG_4821.jpg", individual_count:"3", behaviour:"resting"}},
968968+ {id:"ct-002", layer:"camera", cell:"b7a", type:"point", x:34.83, y:-2.32,
969969+ origin:"Measured", observer:"urn:sensor:camera-trap:serengeti-node-17",
970970+ accuracy_m:5.0, confidence:0.92,
971971+ class_dist:[["Panthera leo",1.0]], event_date:"2024-06-14T19:15:00Z",
972972+ activity:"act-field-2024",
973973+ props:{image_uri:"s3://slp/ct17/IMG_4903.jpg", individual_count:"1", behaviour:"walking"}},
974974+ {id:"ct-003", layer:"camera", cell:"b7c", type:"point", x:35.01, y:-2.15,
975975+ origin:"Measured", observer:"urn:sensor:camera-trap:serengeti-node-42",
976976+ accuracy_m:5.0, confidence:0.88,
977977+ class_dist:[["Panthera leo",1.0]], event_date:"2024-06-18T03:22:00Z",
978978+ activity:"act-field-2024",
979979+ props:{image_uri:"s3://slp/ct42/IMG_1207.jpg", individual_count:"2"}},
980980+ {id:"ct-004", layer:"camera", cell:"b7d", type:"point", x:35.22, y:-2.45,
981981+ origin:"Measured", observer:"urn:sensor:camera-trap:serengeti-node-55",
982982+ accuracy_m:null, confidence:null,
983983+ class_dist:[], event_date:"2024-06-20T22:10:00Z",
984984+ activity:"act-field-2024",
985985+ props:{image_uri:"s3://slp/ct55/IMG_0891.jpg", trigger:"motion", species_detected:"none"}},
986986+987987+ // GPS collars -- leo-007
988988+ {id:"gps-001", layer:"gps", cell:"b7a", type:"point", x:34.81, y:-2.34,
989989+ origin:"Measured", observer:"urn:sensor:gps:vectronic-vertex-plus-007",
990990+ via:"movebank:study/1234/individual/leo-007/event/98001", license:"CC-BY-NC-4.0",
991991+ accuracy_m:3.5, confidence:null,
992992+ class_dist:[["Panthera leo",1.0]], event_date:"2024-06-10T06:00:00Z",
993993+ activity:"act-movebank-import",
994994+ props:{individual_id:"leo-007", fix_type:"3D", hdop:"0.9"}},
995995+ {id:"gps-002", layer:"gps", cell:"b7a", type:"point", x:34.84, y:-2.31,
996996+ origin:"Measured", observer:"urn:sensor:gps:vectronic-vertex-plus-007",
997997+ via:"movebank:study/1234/individual/leo-007/event/98002", license:"CC-BY-NC-4.0",
998998+ accuracy_m:4.2, confidence:null,
999999+ class_dist:[["Panthera leo",1.0]], event_date:"2024-06-10T12:00:00Z",
10001000+ activity:"act-movebank-import",
10011001+ props:{individual_id:"leo-007", fix_type:"3D", hdop:"1.1"}},
10021002+ {id:"gps-003", layer:"gps", cell:"b7b", type:"point", x:34.91, y:-2.28,
10031003+ origin:"Measured", observer:"urn:sensor:gps:vectronic-vertex-plus-007",
10041004+ via:"movebank:study/1234/individual/leo-007/event/98003", license:"CC-BY-NC-4.0",
10051005+ accuracy_m:5.1, confidence:null,
10061006+ class_dist:[["Panthera leo",1.0]], event_date:"2024-06-11T06:00:00Z",
10071007+ activity:"act-movebank-import",
10081008+ props:{individual_id:"leo-007", fix_type:"3D", hdop:"1.4"}},
10091009+10101010+ // GPS collar -- leo-012
10111011+ {id:"gps-004", layer:"gps", cell:"b7c", type:"point", x:35.05, y:-2.10,
10121012+ origin:"Measured", observer:"urn:sensor:gps:vectronic-vertex-plus-012",
10131013+ via:"movebank:study/1234/individual/leo-012/event/98501", license:"CC-BY-NC-4.0",
10141014+ accuracy_m:3.0, confidence:null,
10151015+ class_dist:[["Panthera leo",1.0]], event_date:"2024-06-12T06:00:00Z",
10161016+ activity:"act-movebank-import",
10171017+ props:{individual_id:"leo-012"}},
10181018+10191019+ // GBIF
10201020+ {id:"gbif-001", layer:"gbif", cell:"b7a", type:"point", x:34.85, y:-2.35,
10211021+ origin:"Measured", via:"gbif:4023589127", license:"CC-BY-4.0",
10221022+ accuracy_m:100.0, confidence:null,
10231023+ class_dist:[["Panthera leo",1.0]], event_date:"2022-08-14",
10241024+ activity:"act-gbif-import",
10251025+ props:{gbif_dataset:"serengeti-biodiversity-survey", basis_of_record:"HUMAN_OBSERVATION",
10261026+ recorded_by:"Tanzania Wildlife Research Institute"}},
10271027+ {id:"gbif-002", layer:"gbif", cell:"b7e", type:"point", x:35.40, y:-2.50,
10281028+ origin:"Measured", via:"gbif:4023589999", license:"CC-BY-4.0",
10291029+ accuracy_m:500.0, confidence:null,
10301030+ class_dist:[["Panthera leo",1.0]], event_date:"2021",
10311031+ activity:"act-gbif-import",
10321032+ props:{gbif_dataset:"ngorongoro-mammal-survey", basis_of_record:"HUMAN_OBSERVATION"}},
10331033+10341034+ // iNaturalist
10351035+ {id:"inat-001", layer:"inat", cell:"b7b", type:"point", x:34.95, y:-2.20,
10361036+ origin:"Measured", observer:"inaturalist:user/safari_dave",
10371037+ via:"inaturalist:observation/182345678", license:"CC-BY-NC-4.0",
10381038+ accuracy_m:50.0, confidence:0.95,
10391039+ class_dist:[["Panthera leo",1.0]], event_date:"2023-07-22T16:30:00Z",
10401040+ activity:"act-inat-import",
10411041+ props:{quality_grade:"research", num_identifications:"5"}},
10421042+10431043+ // IUCN range
10441044+ {id:"iucn-range-001", layer:"iucn", cell:"b70", type:"polygon",
10451045+ ring:[[34,-3],[36,-3],[36,-1],[34,-1],[34,-3]],
10461046+ origin:"Measured", via:"iucn:redlist:22/Panthera-leo:range:2024.1", license:"CC-BY-NC-4.0",
10471047+ accuracy_m:null, confidence:null,
10481048+ class_dist:[["Panthera leo",1.0]], event_date:"2024",
10491049+ activity:"act-iucn-import",
10501050+ props:{iucn_status:"VU", iucn_criteria:"A2abcd", population_trend:"decreasing",
10511051+ range_type:"extant:resident", habitat_codes:"1.5;1.6;2;3;14.1"}},
10521052+10531053+ // IUCN habitat preferences (point markers at 35,-2)
10541054+ {id:"iucn-hab-001", layer:"iucn-hab", cell:"b70", type:"point", x:35.0, y:-2.0,
10551055+ origin:"Measured", via:"iucn:redlist:22/Panthera-leo:habitat:2", license:"CC-BY-NC-4.0",
10561056+ accuracy_m:null, confidence:0.95,
10571057+ class_dist:[["habitat-preference:savanna",1.0]], event_date:null,
10581058+ activity:"act-iucn-import",
10591059+ props:{iucn_habitat_code:"2", suitability:"Suitable", major_importance:"Yes"}},
10601060+ {id:"iucn-hab-002", layer:"iucn-hab", cell:"b70", type:"point", x:35.0, y:-2.0,
10611061+ origin:"Measured", via:"iucn:redlist:22/Panthera-leo:habitat:3", license:"CC-BY-NC-4.0",
10621062+ accuracy_m:null, confidence:0.70,
10631063+ class_dist:[["habitat-preference:shrubland",1.0]], event_date:null,
10641064+ activity:"act-iucn-import",
10651065+ props:{iucn_habitat_code:"3", suitability:"Suitable", major_importance:"No"}},
10661066+10671067+ // Simulated (Lotka-Volterra)
10681068+ {id:"sim-001", layer:"sim", cell:"b7d", type:"point", x:35.20, y:-2.50,
10691069+ origin:"Simulated", model:"fairground:notebook/lotka-volterra-serengeti:v4",
10701070+ run_id:"lv-run-42",
10711071+ accuracy_m:null, confidence:0.60,
10721072+ class_dist:[["Panthera leo",1.0]], event_date:"2024-06-15T00:00:00Z",
10731073+ activity:"act-sim-lv-001",
10741074+ props:{scenario:"baseline-2024", time_step:"150", prey_density_km2:"45.2", seed:"42"}},
10751075+ {id:"sim-002", layer:"sim", cell:"b7d", type:"point", x:35.18, y:-2.48,
10761076+ origin:"Simulated", model:"fairground:notebook/lotka-volterra-serengeti:v4",
10771077+ run_id:"lv-run-42",
10781078+ accuracy_m:null, confidence:0.60,
10791079+ class_dist:[["Panthera leo",1.0]], event_date:"2024-06-15T06:00:00Z",
10801080+ activity:"act-sim-lv-001",
10811081+ props:{scenario:"baseline-2024", time_step:"151", prey_density_km2:"44.8", seed:"42"}},
10821082+ {id:"sim-003", layer:"sim", cell:"b7e", type:"point", x:35.45, y:-2.55,
10831083+ origin:"Simulated", model:"fairground:notebook/lotka-volterra-serengeti:v4",
10841084+ run_id:"lv-run-42",
10851085+ accuracy_m:null, confidence:0.55,
10861086+ class_dist:[["Panthera leo",1.0]], event_date:"2024-06-16T00:00:00Z",
10871087+ activity:"act-sim-lv-001",
10881088+ props:{scenario:"drought-2024", time_step:"152", prey_density_km2:"28.1", seed:"42"}},
10891089+10901090+ // Habitat tiles (derived)
10911091+ {id:"hab-001", layer:"habitat", cell:"b7a", type:"polygon",
10921092+ ring:[[34.80,-2.40],[34.90,-2.40],[34.90,-2.30],[34.80,-2.30],[34.80,-2.40]],
10931093+ origin:"Derived", sources:["iucn-hab-001","iucn-hab-002"],
10941094+ method_:"habitat-classify:tessera-v3.1:threshold-0.6",
10951095+ accuracy_m:null, confidence:0.91,
10961096+ class_dist:[["savanna",0.78],["shrubland",0.13],["other",0.09]], event_date:null,
10971097+ activity:"act-habitat-2024",
10981098+ props:{tessera_tile:"b7a:034.80:-002.40", dominant_landcover:"savanna"}},
10991099+ {id:"hab-002", layer:"habitat", cell:"b7d", type:"polygon",
11001100+ ring:[[35.10,-2.60],[35.20,-2.60],[35.20,-2.50],[35.10,-2.50],[35.10,-2.60]],
11011101+ origin:"Derived", sources:["iucn-hab-001","iucn-hab-002"],
11021102+ method_:"habitat-classify:tessera-v3.1:threshold-0.6",
11031103+ accuracy_m:null, confidence:0.68,
11041104+ class_dist:[["savanna",0.45],["shrubland",0.30],["cropland",0.25]], event_date:null,
11051105+ activity:"act-habitat-2024",
11061106+ props:{tessera_tile:"b7d:035.10:-002.60", dominant_landcover:"savanna-shrubland-mosaic"}},
11071107+ {id:"hab-003", layer:"habitat", cell:"b7f", type:"polygon",
11081108+ ring:[[35.80,-1.20],[35.90,-1.20],[35.90,-1.10],[35.80,-1.10],[35.80,-1.20]],
11091109+ origin:"Derived", sources:["iucn-hab-001","iucn-hab-002"],
11101110+ method_:"habitat-classify:tessera-v3.1:threshold-0.6",
11111111+ accuracy_m:null, confidence:0.12,
11121112+ class_dist:[["cropland",0.72],["settlement",0.18],["savanna",0.10]], event_date:null,
11131113+ activity:"act-habitat-2024",
11141114+ props:{tessera_tile:"b7f:035.80:-001.20", dominant_landcover:"cropland"}},
11151115+11161116+ // Species range (derived)
11171117+ {id:"range-001", layer:"range", cell:"b70", type:"polygon",
11181118+ ring:[[34.75,-2.60],[35.50,-2.60],[35.50,-2.00],[35.10,-1.90],[34.75,-2.10],[34.75,-2.60]],
11191119+ origin:"Derived",
11201120+ sources:["ct-001","ct-002","ct-003","gps-001","gps-002","gps-003","gps-004","gbif-001","gbif-002","inat-001"],
11211121+ method_:"alpha-shape:alpha-0.005",
11221122+ accuracy_m:null, confidence:null,
11231123+ class_dist:[["range:Panthera leo",1.0]], event_date:null,
11241124+ activity:"act-range-2024",
11251125+ props:{range_km2:"4850", n_occurrences:"10", excludes_synthetic:"true"}},
11261126+11271127+ // AOH (derived, multi-polygon)
11281128+ {id:"aoh-001", layer:"aoh", cell:"b70", type:"multi",
11291129+ patches:[
11301130+ [[34.80,-2.40],[35.20,-2.40],[35.20,-2.10],[34.80,-2.10],[34.80,-2.40]],
11311131+ [[35.10,-2.60],[35.40,-2.60],[35.40,-2.40],[35.10,-2.40],[35.10,-2.60]]
11321132+ ],
11331133+ origin:"Derived",
11341134+ sources:["range-001","iucn-range-001","hab-001","hab-002"],
11351135+ method_:"aoh:iucn-2022:range-intersect-habitat",
11361136+ accuracy_m:null, confidence:null,
11371137+ class_dist:[["aoh:Panthera leo",1.0]], event_date:null,
11381138+ activity:"act-aoh-2024",
11391139+ props:{aoh_km2:"3420", range_km2:"4850", habitat_proportion:"0.705",
11401140+ unsuitable_excluded_km2:"1430", dominant_exclusion:"cropland",
11411141+ iucn_status:"VU", iucn_criteria:"A2abcd", population_trend:"decreasing",
11421142+ tessera_model:"tessera:v3.1:east-africa",
11431143+ synthetic_in_sdm_training:"true", synthetic_fraction_in_training:"0.23"}},
11441144+11451145+ // Training set (derived) -- not shown on map but in data
11461146+ {id:"ts-001", layer:"_none", cell:"b70", type:"polygon",
11471147+ ring:[[34,-3],[36,-3],[36,-1],[34,-1],[34,-3]],
11481148+ origin:"Derived",
11491149+ sources:["ct-001","ct-002","ct-003","gps-001","gps-002","gps-003","gps-004","gbif-001","gbif-002","inat-001","sim-001","sim-002","sim-003"],
11501150+ method_:"training-set:balanced-spatial-sample",
11511151+ accuracy_m:null, confidence:null,
11521152+ class_dist:[["training-set:Panthera-leo:sdm-2024",1.0]], event_date:null,
11531153+ activity:"act-training-2024",
11541154+ props:{n_measured:"10", n_synthetic:"3", synthetic_fraction:"0.23",
11551155+ spatial_extent:"34.0,-3.0,36.0,-1.0", temporal_window:"2021/2024",
11561156+ tessera_model:"tessera:v3.1:east-africa"}}
11571157+];
11581158+11591159+// ═══════════════════════════════════════════════════════
11601160+// MAP RENDERING
11611161+// ═══════════════════════════════════════════════════════
11621162+11631163+// Map extent (WGS84)
11641164+var mapExtent = { minLon: 33.8, maxLon: 36.2, minLat: -3.2, maxLat: -0.8 };
11651165+var svgW = 800, svgH = 600;
11661166+var pad = 30;
11671167+11681168+function lonToX(lon) {
11691169+ return pad + (lon - mapExtent.minLon) / (mapExtent.maxLon - mapExtent.minLon) * (svgW - 2*pad);
11701170+}
11711171+function latToY(lat) {
11721172+ return pad + (mapExtent.maxLat - lat) / (mapExtent.maxLat - mapExtent.minLat) * (svgH - 2*pad);
11731173+}
11741174+function ringToPoints(ring) {
11751175+ return ring.map(function(c) { return lonToX(c[0]) + "," + latToY(c[1]); }).join(" ");
11761176+}
11771177+11781178+function buildMap() {
11791179+ var svg = document.getElementById("mapSvg");
11801180+ svg.innerHTML = "";
11811181+11821182+ function el(tag, attrs) {
11831183+ var e = document.createElementNS("http://www.w3.org/2000/svg", tag);
11841184+ for (var k in attrs) e.setAttribute(k, attrs[k]);
11851185+ return e;
11861186+ }
11871187+11881188+ // Background
11891189+ svg.appendChild(el("rect", {x:0, y:0, width:svgW, height:svgH, fill:"#e8edf3"}));
11901190+11911191+ // Water hint
11921192+ svg.appendChild(el("ellipse", {cx: lonToX(33.9), cy: latToY(-2.0), rx: 30, ry: 50, fill:"#b3d4f0", opacity:"0.5"}));
11931193+11941194+ // Graticule
11951195+ var grat = el("g", {stroke:"#c8d0dc", "stroke-width":"0.5", fill:"none", opacity:"0.6"});
11961196+ for (var lon = 34; lon <= 36; lon += 0.5) {
11971197+ grat.appendChild(el("line", {x1:lonToX(lon), y1:latToY(mapExtent.minLat), x2:lonToX(lon), y2:latToY(mapExtent.maxLat)}));
11981198+ }
11991199+ for (var lat = -3; lat <= -1; lat += 0.5) {
12001200+ grat.appendChild(el("line", {x1:lonToX(mapExtent.minLon), y1:latToY(lat), x2:lonToX(mapExtent.maxLon), y2:latToY(lat)}));
12011201+ }
12021202+ svg.appendChild(grat);
12031203+12041204+ // Axis labels
12051205+ var axG = el("g", {"font-size":"9", fill:"#6b7280", "font-family":"var(--font-mono)"});
12061206+ for (var alon = 34; alon <= 36; alon += 0.5) {
12071207+ var t = el("text", {x:lonToX(alon), y:svgH-8, "text-anchor":"middle"});
12081208+ t.textContent = alon.toFixed(1) + "\u00B0E";
12091209+ axG.appendChild(t);
12101210+ }
12111211+ for (var alat = -3; alat <= -1; alat += 0.5) {
12121212+ var t2 = el("text", {x:12, y:latToY(alat)+3, "text-anchor":"middle"});
12131213+ t2.textContent = Math.abs(alat).toFixed(1) + "\u00B0S";
12141214+ axG.appendChild(t2);
12151215+ }
12161216+ svg.appendChild(axG);
12171217+12181218+ // Title on map
12191219+ var title = el("text", {x:svgW/2, y:20, "text-anchor":"middle", "font-size":"13",
12201220+ "font-weight":"600", fill:"#334155", "font-family":"var(--font-sans)"});
12211221+ title.textContent = "Serengeti Ecosystem \u2014 Panthera leo AOH";
12221222+ svg.appendChild(title);
12231223+12241224+ // Layer groups (render order: back to front)
12251225+ var layerGroups = {};
12261226+ var layerOrder = ["iucn","aoh","range","habitat","iucn-hab","sim","gbif","inat","gps","camera"];
12271227+ layerOrder.forEach(function(ly) {
12281228+ layerGroups[ly] = el("g", {"data-layer": ly});
12291229+ });
12301230+12311231+ labels.forEach(function(lb) {
12321232+ if (lb.layer === "_none") return;
12331233+ var g = layerGroups[lb.layer];
12341234+ if (!g) return;
12351235+ var col = {camera:"#ea580c", gps:"#2563eb", gbif:"#16a34a", inat:"#0d9488",
12361236+ iucn:"#dc2626", "iucn-hab":"#dc2626", sim:"#7c3aed",
12371237+ habitat:"#65a30d", range:"#3b82f6", aoh:"#16a34a"}[lb.layer] || "#888";
12381238+12391239+ if (lb.type === "polygon" && lb.ring) {
12401240+ var fillOp = "0.15", sw = "1.5", sd = "";
12411241+ if (lb.layer === "iucn") { fillOp = "0.06"; sw = "2"; sd = "6,3"; }
12421242+ if (lb.layer === "range") { fillOp = "0.08"; sw = "2"; sd = "4,2"; }
12431243+ if (lb.layer === "habitat") { fillOp = "0.3"; }
12441244+12451245+ var poly = el("polygon", {
12461246+ points: ringToPoints(lb.ring),
12471247+ fill: col, "fill-opacity": fillOp,
12481248+ stroke: col, "stroke-width": sw, "stroke-opacity": "0.8",
12491249+ "class": "map-poly", "data-id": lb.id
12501250+ });
12511251+ if (sd) poly.setAttribute("stroke-dasharray", sd);
12521252+ (function(label) {
12531253+ poly.addEventListener("click", function() { showDetail(label); });
12541254+ })(lb);
12551255+ g.appendChild(poly);
12561256+12571257+ // Habitat tile label
12581258+ if (lb.layer === "habitat") {
12591259+ var cx = lb.ring.reduce(function(a,c){return a+c[0];},0)/lb.ring.length;
12601260+ var cy = lb.ring.reduce(function(a,c){return a+c[1];},0)/lb.ring.length;
12611261+ var txt = el("text", {x:lonToX(cx), y:latToY(cy)+3, "text-anchor":"middle",
12621262+ "font-size":"8", fill:"#3f6212", "font-weight":"600",
12631263+ "pointer-events":"none"});
12641264+ txt.textContent = lb.id;
12651265+ g.appendChild(txt);
12661266+ }
12671267+ }
12681268+ else if (lb.type === "multi" && lb.patches) {
12691269+ lb.patches.forEach(function(patch, pi) {
12701270+ var poly = el("polygon", {
12711271+ points: ringToPoints(patch),
12721272+ fill: col, "fill-opacity": "0.35",
12731273+ stroke: col, "stroke-width": "2.5", "stroke-opacity": "0.9",
12741274+ "class": "map-poly", "data-id": lb.id
12751275+ });
12761276+ (function(label) {
12771277+ poly.addEventListener("click", function() { showDetail(label); });
12781278+ })(lb);
12791279+ g.appendChild(poly);
12801280+12811281+ var cx2 = patch.reduce(function(a,c){return a+c[0];},0)/patch.length;
12821282+ var cy2 = patch.reduce(function(a,c){return a+c[1];},0)/patch.length;
12831283+ var txt2 = el("text", {x:lonToX(cx2), y:latToY(cy2)+3, "text-anchor":"middle",
12841284+ "font-size":"9", fill:"#065f46", "font-weight":"700",
12851285+ "pointer-events":"none"});
12861286+ txt2.textContent = "AOH P" + (pi+1);
12871287+ g.appendChild(txt2);
12881288+ });
12891289+ }
12901290+ else if (lb.type === "point") {
12911291+ var r = 5, sd2 = "";
12921292+ if (lb.layer === "sim") { sd2 = "2,2"; }
12931293+ if (lb.layer === "iucn-hab") { r = 7; }
12941294+12951295+ var circ = el("circle", {
12961296+ cx: lonToX(lb.x), cy: latToY(lb.y), r: r,
12971297+ fill: col, "fill-opacity": "0.85",
12981298+ stroke: "#fff", "stroke-width": "1.5",
12991299+ "class": "map-point", "data-id": lb.id
13001300+ });
13011301+ if (sd2) {
13021302+ circ.setAttribute("stroke", col);
13031303+ circ.setAttribute("stroke-dasharray", sd2);
13041304+ circ.setAttribute("fill-opacity", "0.5");
13051305+ circ.setAttribute("fill", "#ede9fe");
13061306+ }
13071307+ if (lb.layer === "iucn-hab") {
13081308+ circ.setAttribute("fill", "#fecdd3");
13091309+ circ.setAttribute("stroke", "#dc2626");
13101310+ circ.setAttribute("stroke-width", "2");
13111311+ }
13121312+ (function(label) {
13131313+ circ.addEventListener("click", function() { showDetail(label); });
13141314+ })(lb);
13151315+ g.appendChild(circ);
13161316+13171317+ var ptLabel = el("text", {
13181318+ x: lonToX(lb.x) + (lb.layer === "iucn-hab" ? 10 : 8),
13191319+ y: latToY(lb.y) + 3,
13201320+ "font-size": "8", fill: "#475569",
13211321+ "font-family": "var(--font-mono)",
13221322+ "pointer-events": "none"
13231323+ });
13241324+ ptLabel.textContent = lb.id;
13251325+ g.appendChild(ptLabel);
13261326+ }
13271327+ });
13281328+13291329+ // GPS track line (leo-007)
13301330+ var gpsTrack = el("polyline", {
13311331+ points: [lonToX(34.81)+","+latToY(-2.34),
13321332+ lonToX(34.84)+","+latToY(-2.31),
13331333+ lonToX(34.91)+","+latToY(-2.28)].join(" "),
13341334+ fill: "none", stroke: "#2563eb", "stroke-width": "1.5",
13351335+ "stroke-dasharray": "4,3", opacity: "0.6"
13361336+ });
13371337+ layerGroups["gps"].insertBefore(gpsTrack, layerGroups["gps"].firstChild);
13381338+13391339+ // Append layer groups in order
13401340+ layerOrder.forEach(function(ly) { svg.appendChild(layerGroups[ly]); });
13411341+13421342+ // Scale bar
13431343+ var scaleG = el("g", {transform: "translate(" + (svgW - 120) + "," + (svgH - 30) + ")"});
13441344+ var degPx = (svgW - 2*pad) / (mapExtent.maxLon - mapExtent.minLon);
13451345+ var km50 = (50/111.32) * degPx;
13461346+ scaleG.appendChild(el("line", {x1:0, y1:0, x2:km50, y2:0, stroke:"#334155", "stroke-width":"2"}));
13471347+ scaleG.appendChild(el("line", {x1:0, y1:-3, x2:0, y2:3, stroke:"#334155", "stroke-width":"2"}));
13481348+ scaleG.appendChild(el("line", {x1:km50, y1:-3, x2:km50, y2:3, stroke:"#334155", "stroke-width":"2"}));
13491349+ var scaleText = el("text", {x:km50/2, y:-5, "text-anchor":"middle", "font-size":"9", fill:"#334155"});
13501350+ scaleText.textContent = "50 km";
13511351+ scaleG.appendChild(scaleText);
13521352+ svg.appendChild(scaleG);
13531353+}
13541354+13551355+// ═══════════════════════════════════════════════════════
13561356+// DETAIL PANEL
13571357+// ═══════════════════════════════════════════════════════
13581358+13591359+function showDetail(lb) {
13601360+ var panel = document.getElementById("detailPanel");
13611361+ panel.className = "detail-panel show";
13621362+13631363+ var badge = lb.origin === "Simulated" ? "badge-simulated" :
13641364+ lb.origin === "Derived" ? "badge-derived" : "badge-measured";
13651365+13661366+ var html = '<div class="dp-title">' + escHtml(lb.id) + ' <span class="badge ' + badge + '">' + escHtml(lb.origin) + '</span></div>';
13671367+13681368+ function row(k, v) {
13691369+ if (v === null || v === undefined || v === "") return "";
13701370+ return '<div class="dp-row"><span class="dp-key">' + escHtml(k) + '</span><span class="dp-val">' + escHtml(String(v)) + '</span></div>';
13711371+ }
13721372+13731373+ html += row("cell", lb.cell);
13741374+ html += row("label_name", lb.cell + "-" + lb.id);
13751375+13761376+ if (lb.type === "point") {
13771377+ html += row("geometry", "Point(" + lb.x + ", " + lb.y + ")");
13781378+ } else if (lb.type === "polygon" && lb.ring) {
13791379+ html += row("geometry", "Polygon [" + lb.ring.length + " vertices]");
13801380+ } else if (lb.type === "multi" && lb.patches) {
13811381+ html += row("geometry", "Multi [" + lb.patches.length + " patches]");
13821382+ }
13831383+13841384+ if (lb.observer) html += row("observer", lb.observer);
13851385+ if (lb.via) html += row("via", lb.via);
13861386+ if (lb.license) html += row("license", lb.license);
13871387+ if (lb.model) html += row("model", lb.model);
13881388+ if (lb.run_id) html += row("run_id", lb.run_id);
13891389+ if (lb.sources) html += row("sources", lb.sources.join(", "));
13901390+ if (lb.method_) html += row("method", lb.method_);
13911391+13921392+ if (lb.class_dist && lb.class_dist.length > 0) {
13931393+ var cdStr = lb.class_dist.map(function(cd) { return cd[0] + ": " + cd[1].toFixed(2); }).join("; ");
13941394+ html += row("class_dist", cdStr);
13951395+ } else {
13961396+ html += row("class_dist", "(empty -- unclassified)");
13971397+ }
13981398+13991399+ html += row("confidence", lb.confidence !== null ? lb.confidence : null);
14001400+ html += row("accuracy_m", lb.accuracy_m !== null ? lb.accuracy_m + " m" : null);
14011401+ html += row("event_date", lb.event_date);
14021402+ html += row("activity", lb.activity);
14031403+14041404+ if (lb.props && Object.keys(lb.props).length > 0) {
14051405+ html += '<div style="margin-top:0.5rem;font-weight:600;font-size:0.78rem;color:#475569;">Properties</div>';
14061406+ for (var pk in lb.props) {
14071407+ html += row(pk, lb.props[pk]);
14081408+ }
14091409+ }
14101410+14111411+ panel.innerHTML = html;
14121412+}
14131413+14141414+function escHtml(s) {
14151415+ var d = document.createElement("div");
14161416+ d.appendChild(document.createTextNode(s));
14171417+ return d.innerHTML;
14181418+}
14191419+14201420+// ═══════════════════════════════════════════════════════
14211421+// LAYER TOGGLES
14221422+// ═══════════════════════════════════════════════════════
14231423+14241424+function setupToggles() {
14251425+ var controls = document.querySelectorAll("#layerControls input[data-layer]");
14261426+ controls.forEach(function(cb) {
14271427+ cb.addEventListener("change", function() {
14281428+ var layer = this.getAttribute("data-layer");
14291429+ var groups = document.querySelectorAll("#mapSvg g[data-layer='" + layer + "']");
14301430+ groups.forEach(function(g) {
14311431+ g.style.display = cb.checked ? "" : "none";
14321432+ });
14331433+ });
14341434+ });
14351435+}
14361436+14371437+// ═══════════════════════════════════════════════════════
14381438+// PROVENANCE DAG
14391439+// ═══════════════════════════════════════════════════════
14401440+14411441+function buildDAG() {
14421442+ var dagSvg = document.getElementById("dagSvg");
14431443+ var W = 940, H = 520;
14441444+ dagSvg.setAttribute("viewBox", "0 0 " + W + " " + H);
14451445+ dagSvg.setAttribute("width", "100%");
14461446+ dagSvg.setAttribute("height", H);
14471447+ dagSvg.innerHTML = "";
14481448+14491449+ function el(tag, attrs) {
14501450+ var e = document.createElementNS("http://www.w3.org/2000/svg", tag);
14511451+ for (var k in attrs) e.setAttribute(k, attrs[k]);
14521452+ return e;
14531453+ }
14541454+14551455+ // Node positions
14561456+ var nodes = {
14571457+ "aoh-001": {x:370, y:20, w:200, h:32, color:"#16a34a", label:"aoh-001 (AOH)", dash:false},
14581458+ "range-001": {x:130, y:100, w:200, h:28, color:"#3b82f6", label:"range-001 (Species Range)", dash:false},
14591459+ "iucn-range-001":{x:400, y:100, w:200, h:28, color:"#dc2626", label:"iucn-range-001 (IUCN Range)", dash:false},
14601460+ "hab-001": {x:650, y:100, w:180, h:28, color:"#65a30d", label:"hab-001 (Savanna tile)", dash:false},
14611461+ "hab-002": {x:650, y:150, w:200, h:28, color:"#65a30d", label:"hab-002 (Mosaic tile)", dash:false},
14621462+ "ts-001": {x:370, y:280, w:200, h:28, color:"#a16207", label:"ts-001 (Training Set)", dash:false},
14631463+ "iucn-hab-001": {x:740, y:230, w:150, h:26, color:"#dc2626", label:"iucn-hab-001", dash:false},
14641464+ "iucn-hab-002": {x:740, y:270, w:150, h:26, color:"#dc2626", label:"iucn-hab-002", dash:false},
14651465+ "ct-001": {x:10, y:210, w:82, h:24, color:"#ea580c", label:"ct-001", dash:false},
14661466+ "ct-002": {x:100, y:210, w:82, h:24, color:"#ea580c", label:"ct-002", dash:false},
14671467+ "ct-003": {x:190, y:210, w:82, h:24, color:"#ea580c", label:"ct-003", dash:false},
14681468+ "gps-001": {x:10, y:250, w:82, h:24, color:"#2563eb", label:"gps-001", dash:false},
14691469+ "gps-002": {x:100, y:250, w:82, h:24, color:"#2563eb", label:"gps-002", dash:false},
14701470+ "gps-003": {x:190, y:250, w:82, h:24, color:"#2563eb", label:"gps-003", dash:false},
14711471+ "gps-004": {x:10, y:290, w:82, h:24, color:"#2563eb", label:"gps-004", dash:false},
14721472+ "gbif-001": {x:100, y:290, w:82, h:24, color:"#16a34a", label:"gbif-001", dash:false},
14731473+ "gbif-002": {x:190, y:290, w:82, h:24, color:"#16a34a", label:"gbif-002", dash:false},
14741474+ "inat-001": {x:100, y:330, w:82, h:24, color:"#0d9488", label:"inat-001", dash:false},
14751475+ "sim-001": {x:330, y:390, w:82, h:24, color:"#7c3aed", label:"sim-001", dash:true},
14761476+ "sim-002": {x:420, y:390, w:82, h:24, color:"#7c3aed", label:"sim-002", dash:true},
14771477+ "sim-003": {x:510, y:390, w:82, h:24, color:"#7c3aed", label:"sim-003", dash:true}
14781478+ };
14791479+14801480+ // Edges
14811481+ var edges = [
14821482+ ["range-001", "aoh-001"],
14831483+ ["iucn-range-001", "aoh-001"],
14841484+ ["hab-001", "aoh-001"],
14851485+ ["hab-002", "aoh-001"],
14861486+ ["ct-001","range-001"], ["ct-002","range-001"], ["ct-003","range-001"],
14871487+ ["gps-001","range-001"], ["gps-002","range-001"], ["gps-003","range-001"],
14881488+ ["gps-004","range-001"],
14891489+ ["gbif-001","range-001"], ["gbif-002","range-001"],
14901490+ ["inat-001","range-001"],
14911491+ ["iucn-hab-001","hab-001"], ["iucn-hab-002","hab-001"],
14921492+ ["iucn-hab-001","hab-002"], ["iucn-hab-002","hab-002"],
14931493+ ["ct-001","ts-001"], ["ct-002","ts-001"], ["ct-003","ts-001"],
14941494+ ["gps-001","ts-001"], ["gps-002","ts-001"], ["gps-003","ts-001"], ["gps-004","ts-001"],
14951495+ ["gbif-001","ts-001"], ["gbif-002","ts-001"], ["inat-001","ts-001"],
14961496+ ["sim-001","ts-001"], ["sim-002","ts-001"], ["sim-003","ts-001"],
14971497+ ["iucn-hab-001","ts-001"], ["iucn-hab-002","ts-001"]
14981498+ ];
14991499+15001500+ // Arrowhead defs
15011501+ var defs = el("defs", {});
15021502+ var marker = el("marker", {id:"arrowhead", markerWidth:"8", markerHeight:"6",
15031503+ refX:"8", refY:"3", orient:"auto"});
15041504+ marker.appendChild(el("polygon", {points:"0 0, 8 3, 0 6", fill:"#94a3b8"}));
15051505+ defs.appendChild(marker);
15061506+ var marker2 = el("marker", {id:"arrowhead-light", markerWidth:"6", markerHeight:"5",
15071507+ refX:"6", refY:"2.5", orient:"auto"});
15081508+ marker2.appendChild(el("polygon", {points:"0 0, 6 2.5, 0 5", fill:"#cbd5e1"}));
15091509+ defs.appendChild(marker2);
15101510+ dagSvg.appendChild(defs);
15111511+15121512+ // Draw edges
15131513+ var edgeG = el("g", {});
15141514+ edges.forEach(function(e) {
15151515+ var from = nodes[e[0]], to = nodes[e[1]];
15161516+ if (!from || !to) return;
15171517+ var x1 = from.x + from.w/2, y1 = from.y;
15181518+ var x2 = to.x + to.w/2, y2 = to.y + to.h;
15191519+ var isTS = e[1] === "ts-001";
15201520+ var line = el("line", {
15211521+ x1:x1, y1:y1, x2:x2, y2:y2,
15221522+ stroke: isTS ? "#e2e8f0" : "#94a3b8",
15231523+ "stroke-width": isTS ? "1" : "1.5",
15241524+ "marker-end": isTS ? "url(#arrowhead-light)" : "url(#arrowhead)"
15251525+ });
15261526+ if (isTS) line.setAttribute("stroke-dasharray", "3,3");
15271527+ edgeG.appendChild(line);
15281528+ });
15291529+ dagSvg.appendChild(edgeG);
15301530+15311531+ // Draw nodes
15321532+ var nodeG = el("g", {});
15331533+ for (var nid in nodes) {
15341534+ var n = nodes[nid];
15351535+ var rect = el("rect", {
15361536+ x: n.x, y: n.y, width: n.w, height: n.h,
15371537+ rx: "4", ry: "4",
15381538+ fill: "#fff", stroke: n.color, "stroke-width": n.dash ? "2" : "1.5"
15391539+ });
15401540+ if (n.dash) rect.setAttribute("stroke-dasharray", "4,3");
15411541+ nodeG.appendChild(rect);
15421542+15431543+ nodeG.appendChild(el("rect", {
15441544+ x: n.x, y: n.y, width: "4", height: n.h,
15451545+ rx: "2", fill: n.color
15461546+ }));
15471547+15481548+ var txt = el("text", {
15491549+ x: n.x + n.w/2, y: n.y + n.h/2 + 4,
15501550+ "text-anchor": "middle", "font-size": n.h > 26 ? "10" : "9",
15511551+ fill: "#334155", "font-family": "var(--font-mono)"
15521552+ });
15531553+ txt.textContent = n.label;
15541554+ nodeG.appendChild(txt);
15551555+ }
15561556+ dagSvg.appendChild(nodeG);
15571557+15581558+ // Legend
15591559+ var legG = el("g", {transform: "translate(20," + (H - 100) + ")"});
15601560+ legG.appendChild(el("rect", {x:0, y:0, width:200, height:90, rx:6, fill:"#fff", stroke:"#e2e8f0"}));
15611561+ var legTitle = el("text", {x:10, y:16, "font-size":"10", "font-weight":"600", fill:"#334155"});
15621562+ legTitle.textContent = "Legend";
15631563+ legG.appendChild(legTitle);
15641564+15651565+ function legItem(y, col, text, dash) {
15661566+ var r = el("rect", {x:10, y:y, width:20, height:12, rx:2, fill:"#fff", stroke:col, "stroke-width":"1.5"});
15671567+ if (dash) r.setAttribute("stroke-dasharray", "3,2");
15681568+ legG.appendChild(r);
15691569+ var t = el("text", {x:36, y:y+10, "font-size":"9", fill:"#475569"});
15701570+ t.textContent = text;
15711571+ legG.appendChild(t);
15721572+ }
15731573+ legItem(26, "#ea580c", "Camera trap (Measured)", false);
15741574+ legItem(42, "#2563eb", "GPS collar (Measured)", false);
15751575+ legItem(58, "#16a34a", "Registry import (Measured)", false);
15761576+ legItem(74, "#7c3aed", "Simulated (LV)", true);
15771577+ dagSvg.appendChild(legG);
15781578+15791579+ // Method annotations
15801580+ var methG = el("g", {"font-size":"8", fill:"#6b7280", "font-style":"italic"});
15811581+ function methodLabel(x, y, text) {
15821582+ var t = el("text", {x:x, y:y, "text-anchor":"middle"});
15831583+ t.textContent = text;
15841584+ methG.appendChild(t);
15851585+ }
15861586+ methodLabel(470, 68, "aoh:iucn-2022:range-intersect-habitat");
15871587+ methodLabel(180, 190, "alpha-shape:alpha-0.005");
15881588+ methodLabel(700, 142, "habitat-classify:tessera-v3.1");
15891589+ methodLabel(470, 268, "training-set:balanced-spatial-sample");
15901590+ dagSvg.appendChild(methG);
15911591+}
15921592+15931593+// ═══════════════════════════════════════════════════════
15941594+// TRAINING CYCLE DIAGRAM
15951595+// ═══════════════════════════════════════════════════════
15961596+15971597+function buildCycleDiagram() {
15981598+ var svg = document.getElementById("cycleSvg");
15991599+ var W = 880, H = 310;
16001600+ svg.setAttribute("viewBox", "0 0 " + W + " " + H);
16011601+ svg.setAttribute("width", "100%");
16021602+ svg.setAttribute("height", H);
16031603+ svg.innerHTML = "";
16041604+16051605+ function el(tag, attrs) {
16061606+ var e = document.createElementNS("http://www.w3.org/2000/svg", tag);
16071607+ for (var k in attrs) e.setAttribute(k, attrs[k]);
16081608+ return e;
16091609+ }
16101610+16111611+ // Arrow defs
16121612+ var defs = el("defs", {});
16131613+ var m = el("marker", {id:"cycle-arrow", markerWidth:"10", markerHeight:"7",
16141614+ refX:"10", refY:"3.5", orient:"auto"});
16151615+ m.appendChild(el("polygon", {points:"0 0, 10 3.5, 0 7", fill:"#64748b"}));
16161616+ defs.appendChild(m);
16171617+ var m2 = el("marker", {id:"cycle-arrow-red", markerWidth:"10", markerHeight:"7",
16181618+ refX:"10", refY:"3.5", orient:"auto"});
16191619+ m2.appendChild(el("polygon", {points:"0 0, 10 3.5, 0 7", fill:"#dc2626"}));
16201620+ defs.appendChild(m2);
16211621+ svg.appendChild(defs);
16221622+16231623+ // Phases
16241624+ var phases = [
16251625+ {x:20, y:80, w:120, h:80, color:"#ea580c", title:"Observations", sub:"Accumulate",
16261626+ detail:["Camera traps, GPS,","GBIF, iNat, IUCN"]},
16271627+ {x:170, y:80, w:120, h:80, color:"#7c3aed", title:"Synthetic", sub:"Augment",
16281628+ detail:["Lotka-Volterra","simulation"]},
16291629+ {x:320, y:80, w:120, h:80, color:"#a16207", title:"Training Set", sub:"Assemble",
16301630+ detail:["Balanced sample","23% synthetic"]},
16311631+ {x:470, y:80, w:120, h:80, color:"#0891b2", title:"Train Model", sub:"TESSERA v3.1",
16321632+ detail:["Habitat classifier","from embeddings"]},
16331633+ {x:620, y:80, w:120, h:80, color:"#65a30d", title:"Classify", sub:"Habitat tiles",
16341634+ detail:["Savanna, shrubland,","cropland tiles"]},
16351635+ {x:770, y:80, w:90, h:80, color:"#16a34a", title:"AOH", sub:"Compute",
16361636+ detail:["Range \u2229 suitable","habitat"]}
16371637+ ];
16381638+16391639+ phases.forEach(function(p, i) {
16401640+ svg.appendChild(el("rect", {
16411641+ x:p.x+2, y:p.y+2, width:p.w, height:p.h, rx:8, fill:"#e2e8f0"
16421642+ }));
16431643+ svg.appendChild(el("rect", {
16441644+ x:p.x, y:p.y, width:p.w, height:p.h, rx:8,
16451645+ fill:"#fff", stroke:p.color, "stroke-width":"2"
16461646+ }));
16471647+ svg.appendChild(el("rect", {
16481648+ x:p.x, y:p.y, width:p.w, height:6, rx:3, fill:p.color
16491649+ }));
16501650+16511651+ var t = el("text", {x:p.x+p.w/2, y:p.y+28, "text-anchor":"middle",
16521652+ "font-size":"12", "font-weight":"700", fill:p.color});
16531653+ t.textContent = p.title;
16541654+ svg.appendChild(t);
16551655+16561656+ var s = el("text", {x:p.x+p.w/2, y:p.y+42, "text-anchor":"middle",
16571657+ "font-size":"9", fill:"#64748b"});
16581658+ s.textContent = p.sub;
16591659+ svg.appendChild(s);
16601660+16611661+ p.detail.forEach(function(line, li) {
16621662+ var d = el("text", {x:p.x+p.w/2, y:p.y+56+li*12, "text-anchor":"middle",
16631663+ "font-size":"8", fill:"#94a3b8"});
16641664+ d.textContent = line;
16651665+ svg.appendChild(d);
16661666+ });
16671667+16681668+ // Phase number
16691669+ var numC = el("circle", {cx:p.x+12, cy:p.y-10, r:10, fill:p.color});
16701670+ svg.appendChild(numC);
16711671+ var numT = el("text", {x:p.x+12, y:p.y-6, "text-anchor":"middle",
16721672+ "font-size":"10", "font-weight":"700", fill:"#fff"});
16731673+ numT.textContent = String(i+1);
16741674+ svg.appendChild(numT);
16751675+ });
16761676+16771677+ // Forward arrows
16781678+ [[0,2],[1,2],[2,3],[3,4],[4,5]].forEach(function(pair) {
16791679+ var from = phases[pair[0]], to = phases[pair[1]];
16801680+ svg.appendChild(el("line", {
16811681+ x1: from.x + from.w + 4, y1: from.y + from.h/2,
16821682+ x2: to.x - 4, y2: to.y + to.h/2,
16831683+ stroke:"#64748b", "stroke-width":"2",
16841684+ "marker-end":"url(#cycle-arrow)"
16851685+ }));
16861686+ });
16871687+16881688+ // Species range bypass arc
16891689+ var obs = phases[0], aoh = phases[5];
16901690+ svg.appendChild(el("path", {
16911691+ d: "M " + (obs.x + obs.w/2) + " " + (obs.y + obs.h) +
16921692+ " Q " + (obs.x + obs.w/2) + " " + (obs.y + obs.h + 65) +
16931693+ " " + 620 + " " + (obs.y + obs.h + 65) +
16941694+ " Q " + (aoh.x + aoh.w/2) + " " + (obs.y + obs.h + 65) +
16951695+ " " + (aoh.x + aoh.w/2) + " " + (aoh.y + aoh.h),
16961696+ fill:"none", stroke:"#3b82f6", "stroke-width":"1.5", "stroke-dasharray":"5,3",
16971697+ "marker-end":"url(#cycle-arrow)"
16981698+ }));
16991699+ var rangeLabel = el("text", {x:400, y:obs.y + obs.h + 60, "text-anchor":"middle",
17001700+ "font-size":"9", fill:"#3b82f6", "font-weight":"600"});
17011701+ rangeLabel.textContent = "Species Range (measured-only, alpha-shape)";
17021702+ svg.appendChild(rangeLabel);
17031703+17041704+ // Recomputation feedback
17051705+ svg.appendChild(el("path", {
17061706+ d: "M " + (aoh.x + aoh.w) + " " + (aoh.y + 20) +
17071707+ " C " + (aoh.x + aoh.w + 30) + " " + (aoh.y + 20) +
17081708+ " " + (aoh.x + aoh.w + 30) + " 22 " +
17091709+ " " + (phases[2].x + phases[2].w/2) + " 22" +
17101710+ " L " + (phases[2].x + phases[2].w/2) + " " + (phases[2].y - 2),
17111711+ fill:"none", stroke:"#dc2626", "stroke-width":"1.5", "stroke-dasharray":"4,3",
17121712+ "marker-end":"url(#cycle-arrow-red)"
17131713+ }));
17141714+ var recompLabel = el("text", {x:590, y:16, "text-anchor":"middle",
17151715+ "font-size":"9", fill:"#dc2626", "font-weight":"600"});
17161716+ recompLabel.textContent = "Recompute when new observations arrive or model retrains";
17171717+ svg.appendChild(recompLabel);
17181718+17191719+ // Origin type labels
17201720+ [
17211721+ {x:80, y:175, text:"Measured", color:"#ea580c"},
17221722+ {x:230, y:175, text:"Simulated", color:"#7c3aed"},
17231723+ {x:380, y:175, text:"Derived", color:"#a16207"}
17241724+ ].forEach(function(ol) {
17251725+ var t = el("text", {x:ol.x, y:ol.y, "text-anchor":"middle",
17261726+ "font-size":"8", fill:ol.color, "font-weight":"600"});
17271727+ t.textContent = ol.text;
17281728+ svg.appendChild(t);
17291729+ });
17301730+17311731+ // Bottom note
17321732+ var note = el("text", {x:W/2, y:H - 15, "text-anchor":"middle",
17331733+ "font-size":"10", fill:"#64748b", "font-style":"italic"});
17341734+ note.textContent = "Each phase produces labels that become inputs to the next. Old versions are retained for comparison.";
17351735+ svg.appendChild(note);
17361736+}
17371737+17381738+// ═══════════════════════════════════════════════════════
17391739+// INITIALISE
17401740+// ═══════════════════════════════════════════════════════
17411741+17421742+document.addEventListener("DOMContentLoaded", function() {
17431743+ buildMap();
17441744+ setupToggles();
17451745+ buildDAG();
17461746+ buildCycleDiagram();
17471747+});
17481748+17491749+})();
17501750+</script>
17511751+17521752+</body>
17531753+</html>
+15
dune-project
···11+(lang dune 3.16)
22+(name terradots)
33+(generate_opam_files true)
44+55+(package
66+ (name terradots)
77+ (synopsis "Geospatial label store for planetary observation data")
88+ (description
99+ "A data model for geospatial labels — human observations, registry
1010+ imports, simulation outputs, and derived annotations used to train
1111+ geospatial foundation models. Supports full provenance tracking,
1212+ Hilbert curve spatial indexing, and Darwin Core temporal conventions.")
1313+ (license ISC)
1414+ (depends
1515+ (ocaml (>= 5.2))))
+846
example/aoh_example.ml
···11+(** AOH worked example: {i Panthera leo} in the Serengeti ecosystem.
22+33+ Demonstrates the full label pipeline from raw observations
44+ through synthetic simulation to Area of Habitat calculation,
55+ integrating data from:
66+77+ - Camera traps (Serengeti Lion Project grid)
88+ - GPS collars (Movebank study 1234)
99+ - GBIF occurrence records
1010+ - iNaturalist citizen science observations
1111+ - IUCN Red List expert range and habitat preferences
1212+ - Lotka-Volterra population simulation (synthetic)
1313+ - TESSERA v3.1 habitat classification
1414+1515+ The provenance graph:
1616+ {v
1717+ AOH polygon
1818+ ├── species_range (alpha-shape, measured-only)
1919+ │ ├── camera trap detections
2020+ │ ├── GPS collar fixes (Movebank)
2121+ │ ├── GBIF occurrences
2222+ │ └── iNaturalist observations
2323+ ├── IUCN expert range (validation)
2424+ └── habitat suitability tiles (TESSERA)
2525+ └── training set
2626+ ├── all measured occurrences
2727+ ├── IUCN habitat preferences
2828+ └── synthetic augmentation (Lotka-Volterra)
2929+ v} *)
3030+3131+open Terradots
3232+3333+let ed = event_date_of_string
3434+let c = cell_of_string
3535+3636+(* ══════════════════════════════════════════════════════════
3737+ 1. Activities — the audit trail
3838+3939+ Each activity links a batch of labels to who/what produced
4040+ them and when. The [agent] field points to Fairground
4141+ notebook URIs where applicable.
4242+ ══════════════════════════════════════════════════════════ *)
4343+4444+let act_field_survey =
4545+ { activity_id = "act-field-2024";
4646+ agent = "orcid:0000-0002-1234-5678";
4747+ date = "2024-06-15T08:00:00Z";
4848+ description = Some "Serengeti Lion Project 2024 dry-season \
4949+ camera trap survey" }
5050+5151+let act_movebank_import =
5252+ { activity_id = "act-movebank-import";
5353+ agent = "fairground:notebook/movebank-ingest:v2";
5454+ date = "2024-07-01T12:00:00Z";
5555+ description = Some "Bulk import of GPS collar data from \
5656+ Movebank study 1234, individuals leo-007 \
5757+ and leo-012" }
5858+5959+let act_gbif_import =
6060+ { activity_id = "act-gbif-import";
6161+ agent = "fairground:notebook/gbif-ingest:v3";
6262+ date = "2024-07-02T10:00:00Z";
6363+ description = Some "GBIF Panthera leo occurrences, East Africa, \
6464+ 2020-2024" }
6565+6666+let act_inat_import =
6767+ { activity_id = "act-inat-import";
6868+ agent = "fairground:notebook/inat-ingest:v1";
6969+ date = "2024-07-02T14:00:00Z";
7070+ description = Some "iNaturalist research-grade P. leo observations, \
7171+ Serengeti-Mara ecosystem" }
7272+7373+let act_iucn_import =
7474+ { activity_id = "act-iucn-import";
7575+ agent = "fairground:notebook/iucn-ingest:v1";
7676+ date = "2024-07-03T09:00:00Z";
7777+ description = Some "IUCN Red List Panthera leo assessment: expert \
7878+ range polygon and habitat preference codes" }
7979+8080+let act_simulation =
8181+ { activity_id = "act-sim-lv-001";
8282+ agent = "fairground:notebook/lotka-volterra-serengeti:v4@cell-7";
8383+ date = "2024-07-10T16:00:00Z";
8484+ description = Some "Lotka-Volterra predator-prey simulation, \
8585+ lion-zebra-wildebeest, Serengeti parameterisation, \
8686+ 100-year projection, seed=42" }
8787+8888+let act_training_set =
8989+ { activity_id = "act-training-2024";
9090+ agent = "fairground:notebook/sdm-training:v2";
9191+ date = "2024-07-15T10:00:00Z";
9292+ description = Some "Assemble training set for P. leo SDM: \
9393+ balanced spatial sample with synthetic \
9494+ augmentation from Lotka-Volterra run" }
9595+9696+let act_habitat =
9797+ { activity_id = "act-habitat-2024";
9898+ agent = "fairground:notebook/habitat-classify:v3";
9999+ date = "2024-07-16T09:00:00Z";
100100+ description = Some "Habitat suitability classification from \
101101+ TESSERA v3.1 land-cover embeddings, \
102102+ thresholded against IUCN habitat codes" }
103103+104104+let act_range =
105105+ { activity_id = "act-range-2024";
106106+ agent = "fairground:notebook/species-range:v2";
107107+ date = "2024-07-16T11:00:00Z";
108108+ description = Some "Alpha-shape species range from all verified \
109109+ occurrences (measured-only, no synthetic)" }
110110+111111+let act_aoh =
112112+ { activity_id = "act-aoh-2024";
113113+ agent = "fairground:notebook/aoh-iucn:v3";
114114+ date = "2024-07-16T14:00:00Z";
115115+ description = Some "IUCN Area of Habitat: species range intersected \
116116+ with suitable habitat tiles" }
117117+118118+(* ══════════════════════════════════════════════════════════
119119+ 2. Camera trap observations — Serengeti Lion Project
120120+121121+ Fixed sensors in the Serengeti NP grid. Each trigger
122122+ produces a Point at the trap's surveyed coordinates.
123123+ Hilbert cells b7a–b7f cover the Serengeti at level 12.
124124+ ══════════════════════════════════════════════════════════ *)
125125+126126+let trap_01 =
127127+ make_point
128128+ ~cell:(c "b7a") ~id:"ct-001"
129129+ ~x:34.82 ~y:(-2.33)
130130+ ~observer:"urn:sensor:camera-trap:serengeti-node-17"
131131+ ~class_dist:[("Panthera leo", 1.0)] ~accuracy_m:5.0 ~confidence:0.97
132132+ ~event_date:(ed "2024-06-12T05:42:00Z")
133133+ ~activity:"act-field-2024"
134134+ ~properties:[
135135+ ("image_uri", "s3://slp/ct17/IMG_4821.jpg");
136136+ ("individual_count", "3");
137137+ ("behaviour", "resting")]
138138+ ()
139139+140140+let trap_02 =
141141+ make_point
142142+ ~cell:(c "b7a") ~id:"ct-002"
143143+ ~x:34.83 ~y:(-2.32)
144144+ ~observer:"urn:sensor:camera-trap:serengeti-node-17"
145145+ ~class_dist:[("Panthera leo", 1.0)] ~accuracy_m:5.0 ~confidence:0.92
146146+ ~event_date:(ed "2024-06-14T19:15:00Z")
147147+ ~activity:"act-field-2024"
148148+ ~properties:[
149149+ ("image_uri", "s3://slp/ct17/IMG_4903.jpg");
150150+ ("individual_count", "1");
151151+ ("behaviour", "walking")]
152152+ ()
153153+154154+let trap_03 =
155155+ make_point
156156+ ~cell:(c "b7c") ~id:"ct-003"
157157+ ~x:35.01 ~y:(-2.15)
158158+ ~observer:"urn:sensor:camera-trap:serengeti-node-42"
159159+ ~class_dist:[("Panthera leo", 1.0)] ~accuracy_m:5.0 ~confidence:0.88
160160+ ~event_date:(ed "2024-06-18T03:22:00Z")
161161+ ~activity:"act-field-2024"
162162+ ~properties:[
163163+ ("image_uri", "s3://slp/ct42/IMG_1207.jpg");
164164+ ("individual_count", "2")]
165165+ ()
166166+167167+(** Non-detection: trap triggered by motion but no lion present.
168168+ These matter for occupancy models — absence data is data. *)
169169+let trap_04 =
170170+ make_point
171171+ ~cell:(c "b7d") ~id:"ct-004"
172172+ ~x:35.22 ~y:(-2.45)
173173+ ~observer:"urn:sensor:camera-trap:serengeti-node-55"
174174+ ~event_date:(ed "2024-06-20T22:10:00Z")
175175+ ~activity:"act-field-2024"
176176+ ~properties:[
177177+ ("image_uri", "s3://slp/ct55/IMG_0891.jpg");
178178+ ("trigger", "motion");
179179+ ("species_detected", "none")]
180180+ ()
181181+182182+(* ══════════════════════════════════════════════════════════
183183+ 3. GPS collar tracks — Movebank study 1234
184184+185185+ Imported via the Movebank registry. Each fix is a Point
186186+ with the collar as observer and the Movebank event URI as
187187+ the registry record ([via]).
188188+ ══════════════════════════════════════════════════════════ *)
189189+190190+(** Individual leo-007: three fixes showing movement NE. *)
191191+let gps_01 =
192192+ make_imported
193193+ ~cell:(c "b7a") ~id:"gps-001"
194194+ ~geometry:(Point { x = 34.81; y = -2.34 })
195195+ ~via:"movebank:study/1234/individual/leo-007/event/98001"
196196+ ~observer:"urn:sensor:gps:vectronic-vertex-plus-007"
197197+ ~license:"CC-BY-NC-4.0"
198198+ ~accuracy_m:3.5
199199+ ~class_dist:[("Panthera leo", 1.0)]
200200+ ~event_date:(ed "2024-06-10T06:00:00Z")
201201+ ~activity:"act-movebank-import"
202202+ ~properties:[
203203+ ("individual_id", "leo-007");
204204+ ("fix_type", "3D"); ("hdop", "0.9")]
205205+ ()
206206+207207+let gps_02 =
208208+ make_imported
209209+ ~cell:(c "b7a") ~id:"gps-002"
210210+ ~geometry:(Point { x = 34.84; y = -2.31 })
211211+ ~via:"movebank:study/1234/individual/leo-007/event/98002"
212212+ ~observer:"urn:sensor:gps:vectronic-vertex-plus-007"
213213+ ~license:"CC-BY-NC-4.0"
214214+ ~accuracy_m:4.2
215215+ ~class_dist:[("Panthera leo", 1.0)]
216216+ ~event_date:(ed "2024-06-10T12:00:00Z")
217217+ ~activity:"act-movebank-import"
218218+ ~properties:[
219219+ ("individual_id", "leo-007");
220220+ ("fix_type", "3D"); ("hdop", "1.1")]
221221+ ()
222222+223223+let gps_03 =
224224+ make_imported
225225+ ~cell:(c "b7b") ~id:"gps-003"
226226+ ~geometry:(Point { x = 34.91; y = -2.28 })
227227+ ~via:"movebank:study/1234/individual/leo-007/event/98003"
228228+ ~observer:"urn:sensor:gps:vectronic-vertex-plus-007"
229229+ ~license:"CC-BY-NC-4.0"
230230+ ~accuracy_m:5.1
231231+ ~class_dist:[("Panthera leo", 1.0)]
232232+ ~event_date:(ed "2024-06-11T06:00:00Z")
233233+ ~activity:"act-movebank-import"
234234+ ~properties:[
235235+ ("individual_id", "leo-007");
236236+ ("fix_type", "3D"); ("hdop", "1.4")]
237237+ ()
238238+239239+(** Individual leo-012: separate pride, further east. *)
240240+let gps_04 =
241241+ make_imported
242242+ ~cell:(c "b7c") ~id:"gps-004"
243243+ ~geometry:(Point { x = 35.05; y = -2.10 })
244244+ ~via:"movebank:study/1234/individual/leo-012/event/98501"
245245+ ~observer:"urn:sensor:gps:vectronic-vertex-plus-012"
246246+ ~license:"CC-BY-NC-4.0"
247247+ ~accuracy_m:3.0
248248+ ~class_dist:[("Panthera leo", 1.0)]
249249+ ~event_date:(ed "2024-06-12T06:00:00Z")
250250+ ~activity:"act-movebank-import"
251251+ ~properties:[("individual_id", "leo-012")]
252252+ ()
253253+254254+(* ══════════════════════════════════════════════════════════
255255+ 4. GBIF occurrence records
256256+257257+ Museum specimens and field surveys aggregated through GBIF.
258258+ Note the varying accuracy — the 2021 record has 500 m
259259+ uncertainty (flagged for review in annotations).
260260+ ══════════════════════════════════════════════════════════ *)
261261+262262+let gbif_01 =
263263+ make_imported
264264+ ~cell:(c "b7a") ~id:"gbif-001"
265265+ ~geometry:(Point { x = 34.85; y = -2.35 })
266266+ ~via:"gbif:4023589127"
267267+ ~license:"CC-BY-4.0"
268268+ ~accuracy_m:100.0
269269+ ~class_dist:[("Panthera leo", 1.0)]
270270+ ~event_date:(ed "2022-08-14")
271271+ ~activity:"act-gbif-import"
272272+ ~properties:[
273273+ ("gbif_dataset", "serengeti-biodiversity-survey");
274274+ ("basis_of_record", "HUMAN_OBSERVATION");
275275+ ("recorded_by", "Tanzania Wildlife Research Institute")]
276276+ ()
277277+278278+let gbif_02 =
279279+ make_imported
280280+ ~cell:(c "b7e") ~id:"gbif-002"
281281+ ~geometry:(Point { x = 35.40; y = -2.50 })
282282+ ~via:"gbif:4023589999"
283283+ ~license:"CC-BY-4.0"
284284+ ~accuracy_m:500.0
285285+ ~class_dist:[("Panthera leo", 1.0)]
286286+ ~event_date:(ed "2021")
287287+ ~activity:"act-gbif-import"
288288+ ~properties:[
289289+ ("gbif_dataset", "ngorongoro-mammal-survey");
290290+ ("basis_of_record", "HUMAN_OBSERVATION")]
291291+ ()
292292+293293+(* ══════════════════════════════════════════════════════════
294294+ 5. iNaturalist citizen science
295295+296296+ Research-grade observations from the iNaturalist platform.
297297+ The observer is a user URI; the record is the observation URI.
298298+ ══════════════════════════════════════════════════════════ *)
299299+300300+let inat_01 =
301301+ make_imported
302302+ ~cell:(c "b7b") ~id:"inat-001"
303303+ ~geometry:(Point { x = 34.95; y = -2.20 })
304304+ ~via:"inaturalist:observation/182345678"
305305+ ~observer:"inaturalist:user/safari_dave"
306306+ ~license:"CC-BY-NC-4.0"
307307+ ~accuracy_m:50.0
308308+ ~class_dist:[("Panthera leo", 1.0)]
309309+ ~confidence:0.95
310310+ ~event_date:(ed "2023-07-22T16:30:00Z")
311311+ ~activity:"act-inat-import"
312312+ ~properties:[
313313+ ("quality_grade", "research");
314314+ ("num_identifications", "5")]
315315+ ()
316316+317317+(* ══════════════════════════════════════════════════════════
318318+ 6. IUCN Red List — expert range and habitat preferences
319319+320320+ The IUCN assessment provides two things:
321321+ (a) An expert-drawn range polygon for the species.
322322+ (b) Habitat preference codes (IUCN Habitats Classification
323323+ Scheme) with suitability ratings.
324324+325325+ The range polygon validates the data-driven range; the
326326+ habitat codes drive the suitability classification.
327327+ ══════════════════════════════════════════════════════════ *)
328328+329329+(** Expert-drawn range polygon (simplified to bounding extent). *)
330330+let iucn_range =
331331+ make_imported
332332+ ~cell:(c "b70") ~id:"iucn-range-001"
333333+ ~geometry:(Polygon [
334334+ { x = 34.0; y = -3.0 };
335335+ { x = 36.0; y = -3.0 };
336336+ { x = 36.0; y = -1.0 };
337337+ { x = 34.0; y = -1.0 };
338338+ { x = 34.0; y = -3.0 };
339339+ ])
340340+ ~via:"iucn:redlist:22/Panthera-leo:range:2024.1"
341341+ ~license:"CC-BY-NC-4.0"
342342+ ~class_dist:[("Panthera leo", 1.0)]
343343+ ~event_date:(ed "2024")
344344+ ~activity:"act-iucn-import"
345345+ ~properties:[
346346+ ("iucn_status", "VU");
347347+ ("iucn_criteria", "A2abcd");
348348+ ("population_trend", "decreasing");
349349+ ("range_type", "extant:resident");
350350+ ("habitat_codes", "1.5;1.6;2;3;14.1")]
351351+ ()
352352+353353+(** Habitat preference: savanna (IUCN code 2) — major habitat. *)
354354+let iucn_hab_savanna =
355355+ make_imported
356356+ ~cell:(c "b70") ~id:"iucn-hab-001"
357357+ ~geometry:(Point { x = 35.0; y = -2.0 })
358358+ ~via:"iucn:redlist:22/Panthera-leo:habitat:2"
359359+ ~license:"CC-BY-NC-4.0"
360360+ ~class_dist:[("habitat-preference:savanna", 1.0)]
361361+ ~confidence:0.95
362362+ ~activity:"act-iucn-import"
363363+ ~properties:[
364364+ ("iucn_habitat_code", "2");
365365+ ("suitability", "Suitable");
366366+ ("major_importance", "Yes")]
367367+ ()
368368+369369+(** Habitat preference: shrubland (IUCN code 3) — minor habitat. *)
370370+let iucn_hab_shrubland =
371371+ make_imported
372372+ ~cell:(c "b70") ~id:"iucn-hab-002"
373373+ ~geometry:(Point { x = 35.0; y = -2.0 })
374374+ ~via:"iucn:redlist:22/Panthera-leo:habitat:3"
375375+ ~license:"CC-BY-NC-4.0"
376376+ ~class_dist:[("habitat-preference:shrubland", 1.0)]
377377+ ~confidence:0.70
378378+ ~activity:"act-iucn-import"
379379+ ~properties:[
380380+ ("iucn_habitat_code", "3");
381381+ ("suitability", "Suitable");
382382+ ("major_importance", "No")]
383383+ ()
384384+385385+(* ══════════════════════════════════════════════════════════
386386+ 7. Synthetic simulation — Lotka-Volterra population dynamics
387387+388388+ Agent-based Lotka-Volterra model producing simulated lion
389389+ positions in under-sampled areas (Ngorongoro corridor).
390390+ These augment the SDM training set but are NEVER included
391391+ in the measured species range.
392392+393393+ The [Simulated] origin keeps them type-level distinct from
394394+ real observations. Properties carry the scenario parameters
395395+ for reproducibility.
396396+ ══════════════════════════════════════════════════════════ *)
397397+398398+let sim_01 =
399399+ make_simulated
400400+ ~cell:(c "b7d") ~id:"sim-001"
401401+ ~geometry:(Point { x = 35.20; y = -2.50 })
402402+ ~model:"fairground:notebook/lotka-volterra-serengeti:v4"
403403+ ~run_id:"lv-run-42"
404404+ ~class_dist:[("Panthera leo", 1.0)]
405405+ ~event_date:(ed "2024-06-15T00:00:00Z")
406406+ ~confidence:0.60
407407+ ~activity:"act-sim-lv-001"
408408+ ~properties:[
409409+ ("scenario", "baseline-2024");
410410+ ("time_step", "150");
411411+ ("prey_density_km2", "45.2");
412412+ ("seed", "42")]
413413+ ()
414414+415415+let sim_02 =
416416+ make_simulated
417417+ ~cell:(c "b7d") ~id:"sim-002"
418418+ ~geometry:(Point { x = 35.18; y = -2.48 })
419419+ ~model:"fairground:notebook/lotka-volterra-serengeti:v4"
420420+ ~run_id:"lv-run-42"
421421+ ~class_dist:[("Panthera leo", 1.0)]
422422+ ~event_date:(ed "2024-06-15T06:00:00Z")
423423+ ~confidence:0.60
424424+ ~activity:"act-sim-lv-001"
425425+ ~properties:[
426426+ ("scenario", "baseline-2024");
427427+ ("time_step", "151");
428428+ ("prey_density_km2", "44.8");
429429+ ("seed", "42")]
430430+ ()
431431+432432+(** Drought scenario — prey density drops, lion shifts south. *)
433433+let sim_03 =
434434+ make_simulated
435435+ ~cell:(c "b7e") ~id:"sim-003"
436436+ ~geometry:(Point { x = 35.45; y = -2.55 })
437437+ ~model:"fairground:notebook/lotka-volterra-serengeti:v4"
438438+ ~run_id:"lv-run-42"
439439+ ~class_dist:[("Panthera leo", 1.0)]
440440+ ~event_date:(ed "2024-06-16T00:00:00Z")
441441+ ~confidence:0.55
442442+ ~activity:"act-sim-lv-001"
443443+ ~properties:[
444444+ ("scenario", "drought-2024");
445445+ ("time_step", "152");
446446+ ("prey_density_km2", "28.1");
447447+ ("seed", "42")]
448448+ ()
449449+450450+(* ══════════════════════════════════════════════════════════
451451+ 8. Derivation: training set assembly
452452+453453+ The training set is itself a derived label — it records
454454+ exactly which labels (measured + synthetic) were selected
455455+ for model training, and the synthetic fraction.
456456+457457+ This is the provenance anchor for the SDM: you can always
458458+ ask "which observations trained this model?"
459459+ ══════════════════════════════════════════════════════════ *)
460460+461461+let all_measured_ids =
462462+ [ "ct-001"; "ct-002"; "ct-003";
463463+ "gps-001"; "gps-002"; "gps-003"; "gps-004";
464464+ "gbif-001"; "gbif-002";
465465+ "inat-001" ]
466466+467467+let all_synthetic_ids =
468468+ [ "sim-001"; "sim-002"; "sim-003" ]
469469+470470+let training_set =
471471+ make_derived
472472+ ~cell:(c "b70") ~id:"ts-001"
473473+ ~geometry:(Polygon [
474474+ { x = 34.0; y = -3.0 };
475475+ { x = 36.0; y = -3.0 };
476476+ { x = 36.0; y = -1.0 };
477477+ { x = 34.0; y = -1.0 };
478478+ { x = 34.0; y = -3.0 };
479479+ ])
480480+ ~sources:(all_measured_ids @ all_synthetic_ids)
481481+ ~method_:"training-set:balanced-spatial-sample"
482482+ ~class_dist:[("training-set:Panthera-leo:sdm-2024", 1.0)]
483483+ ~activity:"act-training-2024"
484484+ ~properties:[
485485+ ("n_measured", string_of_int (List.length all_measured_ids));
486486+ ("n_synthetic", string_of_int (List.length all_synthetic_ids));
487487+ ("synthetic_fraction", "0.23");
488488+ ("spatial_extent", "34.0,-3.0,36.0,-1.0");
489489+ ("temporal_window", "2021/2024");
490490+ ("tessera_model", "tessera:v3.1:east-africa")]
491491+ ()
492492+493493+(* ══════════════════════════════════════════════════════════
494494+ 9. Derivation: habitat suitability from TESSERA
495495+496496+ Each TESSERA tile is classified as suitable or unsuitable
497497+ based on its land-cover embedding and the IUCN habitat
498498+ preference codes. The [sources] link back to the IUCN
499499+ habitat labels.
500500+ ══════════════════════════════════════════════════════════ *)
501501+502502+let hab_sources = ["iucn-hab-001"; "iucn-hab-002"]
503503+504504+(** Core Serengeti savanna — highly suitable. *)
505505+let hab_01 =
506506+ make_derived
507507+ ~cell:(c "b7a") ~id:"hab-001"
508508+ ~geometry:(Polygon [
509509+ { x = 34.80; y = -2.40 };
510510+ { x = 34.90; y = -2.40 };
511511+ { x = 34.90; y = -2.30 };
512512+ { x = 34.80; y = -2.30 };
513513+ { x = 34.80; y = -2.40 };
514514+ ])
515515+ ~sources:hab_sources
516516+ ~method_:"habitat-classify:tessera-v3.1:threshold-0.6"
517517+ ~confidence:0.91
518518+ ~class_dist:[("savanna", 0.78); ("shrubland", 0.13); ("other", 0.09)]
519519+ ~activity:"act-habitat-2024"
520520+ ~properties:[
521521+ ("tessera_tile", "b7a:034.80:-002.40");
522522+ ("dominant_landcover", "savanna")]
523523+ ()
524524+525525+(** Savanna-shrubland mosaic — moderate suitability. *)
526526+let hab_02 =
527527+ make_derived
528528+ ~cell:(c "b7d") ~id:"hab-002"
529529+ ~geometry:(Polygon [
530530+ { x = 35.10; y = -2.60 };
531531+ { x = 35.20; y = -2.60 };
532532+ { x = 35.20; y = -2.50 };
533533+ { x = 35.10; y = -2.50 };
534534+ { x = 35.10; y = -2.60 };
535535+ ])
536536+ ~sources:hab_sources
537537+ ~method_:"habitat-classify:tessera-v3.1:threshold-0.6"
538538+ ~confidence:0.68
539539+ ~class_dist:[("savanna", 0.45); ("shrubland", 0.30); ("cropland", 0.25)]
540540+ ~activity:"act-habitat-2024"
541541+ ~properties:[
542542+ ("tessera_tile", "b7d:035.10:-002.60");
543543+ ("dominant_landcover", "savanna-shrubland-mosaic")]
544544+ ()
545545+546546+(** Agricultural land — unsuitable, excluded from AOH. *)
547547+let hab_03 =
548548+ make_derived
549549+ ~cell:(c "b7f") ~id:"hab-003"
550550+ ~geometry:(Polygon [
551551+ { x = 35.80; y = -1.20 };
552552+ { x = 35.90; y = -1.20 };
553553+ { x = 35.90; y = -1.10 };
554554+ { x = 35.80; y = -1.10 };
555555+ { x = 35.80; y = -1.20 };
556556+ ])
557557+ ~sources:hab_sources
558558+ ~method_:"habitat-classify:tessera-v3.1:threshold-0.6"
559559+ ~confidence:0.12
560560+ ~class_dist:[("cropland", 0.72); ("settlement", 0.18); ("savanna", 0.10)]
561561+ ~activity:"act-habitat-2024"
562562+ ~properties:[
563563+ ("tessera_tile", "b7f:035.80:-001.20");
564564+ ("dominant_landcover", "cropland")]
565565+ ()
566566+567567+(* ══════════════════════════════════════════════════════════
568568+ 10. Derivation: species range from occurrences
569569+570570+ Alpha-shape computed from measured-only data. Synthetic
571571+ labels are explicitly excluded — the range must reflect
572572+ where lions have actually been observed.
573573+574574+ The [is_simulated] accessor is used by the range pipeline
575575+ to filter out synthetic augmentation.
576576+ ══════════════════════════════════════════════════════════ *)
577577+578578+let species_range =
579579+ make_derived
580580+ ~cell:(c "b70") ~id:"range-001"
581581+ ~geometry:(Polygon [
582582+ { x = 34.75; y = -2.60 };
583583+ { x = 35.50; y = -2.60 };
584584+ { x = 35.50; y = -2.00 };
585585+ { x = 35.10; y = -1.90 };
586586+ { x = 34.75; y = -2.10 };
587587+ { x = 34.75; y = -2.60 };
588588+ ])
589589+ ~sources:all_measured_ids (* no sim-* labels *)
590590+ ~method_:"alpha-shape:alpha-0.005"
591591+ ~class_dist:[("range:Panthera leo", 1.0)]
592592+ ~activity:"act-range-2024"
593593+ ~properties:[
594594+ ("range_km2", "4850");
595595+ ("n_occurrences", string_of_int (List.length all_measured_ids));
596596+ ("excludes_synthetic", "true")]
597597+ ()
598598+599599+(* ══════════════════════════════════════════════════════════
600600+ 11. Derivation: Area of Habitat
601601+602602+ The final AOH is the intersection of:
603603+ - the data-driven species range (measured-only)
604604+ - the TESSERA habitat suitability tiles (suitable only)
605605+ - validated against the IUCN expert range
606606+607607+ The result is a Multi polygon — disconnected habitat
608608+ patches within the range. Properties carry the IUCN
609609+ assessment metadata and the key metrics.
610610+611611+ When TESSERA is retrained (v3.1 → v3.2), the habitat
612612+ tiles change, so AOH recomputes. The new AOH label gets
613613+ a new activity; both versions coexist for comparison.
614614+ ══════════════════════════════════════════════════════════ *)
615615+616616+let aoh =
617617+ make_derived
618618+ ~cell:(c "b70") ~id:"aoh-001"
619619+ ~geometry:(Multi [
620620+ (* Patch 1: core Serengeti savanna *)
621621+ Polygon [
622622+ { x = 34.80; y = -2.40 };
623623+ { x = 35.20; y = -2.40 };
624624+ { x = 35.20; y = -2.10 };
625625+ { x = 34.80; y = -2.10 };
626626+ { x = 34.80; y = -2.40 };
627627+ ];
628628+ (* Patch 2: southern extension into Ngorongoro *)
629629+ Polygon [
630630+ { x = 35.10; y = -2.60 };
631631+ { x = 35.40; y = -2.60 };
632632+ { x = 35.40; y = -2.40 };
633633+ { x = 35.10; y = -2.40 };
634634+ { x = 35.10; y = -2.60 };
635635+ ];
636636+ ])
637637+ ~sources:[
638638+ "range-001"; (* data-driven species range *)
639639+ "iucn-range-001"; (* IUCN expert range — validation *)
640640+ "hab-001"; "hab-002"; (* suitable habitat tiles *)
641641+ (* hab-003 excluded: unsuitable cropland *)
642642+ ]
643643+ ~method_:"aoh:iucn-2022:range-intersect-habitat"
644644+ ~class_dist:[("aoh:Panthera leo", 1.0)]
645645+ ~activity:"act-aoh-2024"
646646+ ~properties:[
647647+ (* AOH metrics *)
648648+ ("aoh_km2", "3420");
649649+ ("range_km2", "4850");
650650+ ("habitat_proportion", "0.705");
651651+ ("unsuitable_excluded_km2", "1430");
652652+ ("dominant_exclusion", "cropland");
653653+ (* IUCN assessment context *)
654654+ ("iucn_status", "VU");
655655+ ("iucn_criteria", "A2abcd");
656656+ ("population_trend", "decreasing");
657657+ (* Model provenance *)
658658+ ("tessera_model", "tessera:v3.1:east-africa");
659659+ ("synthetic_in_sdm_training", "true");
660660+ ("synthetic_fraction_in_training", "0.23")]
661661+ ()
662662+663663+(* ══════════════════════════════════════════════════════════
664664+ 12. Document assembly
665665+ ══════════════════════════════════════════════════════════ *)
666666+667667+let doc =
668668+ { crs = wgs84;
669669+ level = 12;
670670+ provenance = [
671671+ act_field_survey;
672672+ act_movebank_import;
673673+ act_gbif_import;
674674+ act_inat_import;
675675+ act_iucn_import;
676676+ act_simulation;
677677+ act_training_set;
678678+ act_habitat;
679679+ act_range;
680680+ act_aoh;
681681+ ];
682682+ labels = [
683683+ (* Camera traps *)
684684+ trap_01; trap_02; trap_03; trap_04;
685685+ (* GPS collars — Movebank *)
686686+ gps_01; gps_02; gps_03; gps_04;
687687+ (* GBIF *)
688688+ gbif_01; gbif_02;
689689+ (* iNaturalist *)
690690+ inat_01;
691691+ (* IUCN Red List *)
692692+ iucn_range; iucn_hab_savanna; iucn_hab_shrubland;
693693+ (* Synthetic — Lotka-Volterra *)
694694+ sim_01; sim_02; sim_03;
695695+ (* Derivations *)
696696+ training_set;
697697+ hab_01; hab_02; hab_03;
698698+ species_range;
699699+ aoh;
700700+ ];
701701+ annotations = [
702702+ { id = "ann-001";
703703+ text = "Camera trap ct-001 and GPS fix gps-001 are 1.4 km \
704704+ apart on the same day — likely same pride. Consider \
705705+ merge after dry-season survey completes.";
706706+ anchors = ["ct-001"; "gps-001"] };
707707+ { id = "ann-002";
708708+ text = "GBIF gbif-002 has 500 m uncertainty and only year-level \
709709+ temporal precision. Flag for review before including \
710710+ in high-resolution analyses.";
711711+ anchors = ["gbif-002"] };
712712+ { id = "ann-003";
713713+ text = "Synthetic labels sim-001..sim-003 augment the under-sampled \
714714+ Ngorongoro corridor. Weight reduced to 0.5x in training \
715715+ set assembly. Not included in species range computation.";
716716+ anchors = ["sim-001"; "sim-002"; "sim-003"] };
717717+ { id = "ann-004";
718718+ text = "AOH shows 70.5% of range is suitable habitat. Main \
719719+ exclusion is cropland encroachment on the eastern boundary. \
720720+ Compare with IUCN 2019 assessment (was 78%).";
721721+ anchors = ["aoh-001"] };
722722+ ];
723723+ groups = [
724724+ { id = "grp-field-2024";
725725+ activity = Some "act-field-2024";
726726+ members = ["ct-001"; "ct-002"; "ct-003"; "ct-004"] };
727727+ { id = "grp-leo-007-track";
728728+ activity = Some "act-movebank-import";
729729+ members = ["gps-001"; "gps-002"; "gps-003"] };
730730+ { id = "grp-leo-012-track";
731731+ activity = Some "act-movebank-import";
732732+ members = ["gps-004"] };
733733+ { id = "grp-synthetic-lv42";
734734+ activity = Some "act-sim-lv-001";
735735+ members = ["sim-001"; "sim-002"; "sim-003"] };
736736+ { id = "grp-iucn-habitat-prefs";
737737+ activity = Some "act-iucn-import";
738738+ members = ["iucn-hab-001"; "iucn-hab-002"] };
739739+ ];
740740+ }
741741+742742+(* ══════════════════════════════════════════════════════════
743743+ 13. Queries — demonstrating the provenance graph
744744+745745+ These functions show how a wiki renderer or analysis
746746+ pipeline would traverse the label graph.
747747+ ══════════════════════════════════════════════════════════ *)
748748+749749+(** Find a label by ID. *)
750750+let find id =
751751+ List.find (fun (l : label) -> l.id = id) doc.labels
752752+753753+(** All labels in a Hilbert cell. *)
754754+let in_cell c =
755755+ List.filter (fun (l : label) -> l.cell = c) doc.labels
756756+757757+(** All measured (non-synthetic, non-derived) labels. *)
758758+let measured_only () =
759759+ List.filter (fun (l : label) ->
760760+ match l.origin with Measured _ -> true | _ -> false)
761761+ doc.labels
762762+763763+(** All simulated labels. *)
764764+let synthetic_only () =
765765+ List.filter is_simulated doc.labels
766766+767767+(** Immediate sources of a derived label. *)
768768+let sources_of_label l =
769769+ List.filter_map
770770+ (fun src_id ->
771771+ match List.find_opt (fun (l : label) -> l.id = src_id) doc.labels with
772772+ | Some src -> Some src
773773+ | None -> None)
774774+ (sources_of l)
775775+776776+(** Transitive closure: all labels reachable through [sources]. *)
777777+let rec all_ancestors l =
778778+ let immediate = sources_of_label l in
779779+ let deeper = List.concat_map all_ancestors immediate in
780780+ immediate @ deeper
781781+782782+(** How many synthetic labels influenced this derivation? *)
783783+let synthetic_ancestor_count l =
784784+ all_ancestors l
785785+ |> List.filter is_simulated
786786+ |> List.length
787787+788788+(** Activity record for a label. *)
789789+let activity_of (l : label) =
790790+ match l.activity with
791791+ | None -> None
792792+ | Some aid ->
793793+ List.find_opt (fun a -> a.activity_id = aid) doc.provenance
794794+795795+(* ══════════════════════════════════════════════════════════
796796+ 14. Main — exercise the provenance queries
797797+ ══════════════════════════════════════════════════════════ *)
798798+799799+let () =
800800+ let n_labels = List.length doc.labels in
801801+ let n_measured = List.length (measured_only ()) in
802802+ let n_synthetic = List.length (synthetic_only ()) in
803803+ let n_derived = n_labels - n_measured - n_synthetic in
804804+ Printf.printf "Terradots AOH Example: Panthera leo, Serengeti\n";
805805+ Printf.printf "══════════════════════════════════════════════\n";
806806+ Printf.printf "CRS: %s Hilbert level: %d\n" doc.crs doc.level;
807807+ Printf.printf "Labels: %d total (%d measured, %d synthetic, %d derived)\n"
808808+ n_labels n_measured n_synthetic n_derived;
809809+ Printf.printf "Activities: %d\n" (List.length doc.provenance);
810810+ Printf.printf "Annotations: %d\n" (List.length doc.annotations);
811811+ Printf.printf "Groups: %d\n\n" (List.length doc.groups);
812812+813813+ (* AOH provenance *)
814814+ let aoh_label = find "aoh-001" in
815815+ Printf.printf "AOH label: %s\n" (label_name aoh_label);
816816+ let props key =
817817+ List.assoc_opt key aoh_label.properties
818818+ |> Option.value ~default:"?" in
819819+ Printf.printf " AOH: %s km² / %s km² range = %s suitable\n"
820820+ (props "aoh_km2") (props "range_km2") (props "habitat_proportion");
821821+ Printf.printf " IUCN status: %s (%s), trend: %s\n"
822822+ (props "iucn_status") (props "iucn_criteria")
823823+ (props "population_trend");
824824+ Printf.printf " TESSERA model: %s\n" (props "tessera_model");
825825+ Printf.printf " Synthetic in training: %s (fraction: %s)\n\n"
826826+ (props "synthetic_in_sdm_training")
827827+ (props "synthetic_fraction_in_training");
828828+829829+ (* Provenance depth *)
830830+ let ancestors = all_ancestors aoh_label in
831831+ let n_syn_ancestors = synthetic_ancestor_count aoh_label in
832832+ Printf.printf "Provenance graph from AOH:\n";
833833+ Printf.printf " Reachable labels: %d\n" (List.length ancestors);
834834+ Printf.printf " Of which synthetic: %d\n" n_syn_ancestors;
835835+836836+ (* Activity for AOH *)
837837+ (match activity_of aoh_label with
838838+ | Some a ->
839839+ Printf.printf " Activity: %s\n" a.activity_id;
840840+ Printf.printf " Agent: %s\n" a.agent;
841841+ Printf.printf " Date: %s\n" a.date
842842+ | None -> ());
843843+844844+ (* Spatial query *)
845845+ Printf.printf "\nLabels in cell b7a: %d\n"
846846+ (List.length (in_cell (c "b7a")))
···11+(** Terradots Label Store — core data model.
22+33+ Coordinates are always in the document's native CRS (e.g. lon/lat
44+ for EPSG:4326, metres for UTM). Pixel-space mapping (affine
55+ transforms, viewBox) is a serialisation concern handled by format
66+ encoders/decoders, not by this module. *)
77+88+(** {1 Coordinate Reference Systems} *)
99+1010+(** Any string that PROJ can resolve: ["EPSG:4326"], WKT2, etc. *)
1111+type crs = string
1212+1313+let wgs84 = "EPSG:4326"
1414+let web_mercator = "EPSG:3857"
1515+1616+(** {1 Temporal} *)
1717+1818+type event_date = string
1919+let event_date_of_string s = s
2020+let string_of_event_date s = s
2121+2222+(** {1 Spatial indexing} *)
2323+2424+type cell = string
2525+let cell_of_string s = s
2626+let string_of_cell s = s
2727+2828+(** {1 Geometry}
2929+3030+ Points and closed polygons, following OGC Simple Features (ISO 19125).
3131+ Coordinates are in the document's native CRS units. *)
3232+3333+type point = { x : float; y : float }
3434+3535+type geometry =
3636+ | Point of point
3737+ | Polygon of point list (** Exterior ring, closed. *)
3838+ | Multi of geometry list (** GeometryCollection / Multi* *)
3939+4040+(** Representative point for spatial indexing. *)
4141+let rec centroid = function
4242+ | Point p -> p
4343+ | Polygon ring ->
4444+ let n = Float.of_int (List.length ring) in
4545+ let sx = List.fold_left (fun acc p -> acc +. p.x) 0.0 ring in
4646+ let sy = List.fold_left (fun acc p -> acc +. p.y) 0.0 ring in
4747+ { x = sx /. n; y = sy /. n }
4848+ | Multi gs ->
4949+ let cs = List.map centroid gs in
5050+ centroid (Polygon cs)
5151+5252+(** {1 Origin}
5353+5454+ How a label was produced. This is the only part that varies
5555+ by kind — confidence and classification are universal.
5656+5757+ Observers and registries are identified by URI. The scheme
5858+ tells you what kind of source it is:
5959+6060+ {v
6161+ URI meaning
6262+ ──────────────────────────────────── ────────────────────
6363+ orcid:0000-0001-2345-6789 human (ORCID)
6464+ https://ror.org/035dkdb55 institution (ROR)
6565+ urn:sensor:gps:trimble-r12-0042 GPS receiver
6666+ urn:sensor:camera-trap:ct-0042 camera trap
6767+ gbif:4023589127 GBIF occurrence
6868+ inaturalist:observation/12345 iNaturalist record
6969+ osm:node/123456 OpenStreetMap node
7070+ v}
7171+7272+ - {b Measured}: observation by someone/something ([observer]),
7373+ possibly imported via a registry ([via]).
7474+ - {b Derived}: computed from other labels (convex hull, buffer).
7575+ Positional accuracy propagates from sources.
7676+ - {b Simulated}: produced by a theoretical model (population
7777+ dynamics, agent-based simulation). Must remain identifiable
7878+ as synthetic — never mixed with measured observations in
7979+ analyses that require ground truth. *)
8080+8181+type origin =
8282+ | Measured of {
8383+ observer : string option; (** URI of the observer *)
8484+ via : string option; (** URI of the registry record *)
8585+ license : string option; (** SPDX identifier, e.g. ["CC-BY-4.0"] *)
8686+ accuracy_m : float option; (** Positional uncertainty radius (m) *)
8787+ }
8888+ | Derived of {
8989+ sources : string list; (** IDs of source labels *)
9090+ method_ : string; (** e.g. ["convex-hull"], ["buffer-10m"] *)
9191+ }
9292+ | Simulated of {
9393+ model : string; (** URI of the simulation model/notebook *)
9494+ run_id : string; (** Unique simulation run identifier *)
9595+ }
9696+9797+(** {1 Provenance}
9898+9999+ An [activity] is the audit record for how labels were produced:
100100+ who, when, and (for derivations) which inputs and method.
101101+ A label's {!field-label.origin} is the structural summary;
102102+ the optional {!field-label.activity} links to the full record. *)
103103+104104+type activity = {
105105+ activity_id : string;
106106+ agent : string; (** Who or what: email, tool/version, etc. *)
107107+ date : string; (** ISO 8601 *)
108108+ description : string option; (** Free-text note on what was done *)
109109+}
110110+111111+(** {1 Labels}
112112+113113+ {b Identity and spatial indexing.} A label has two name
114114+ components: [cell] and [id].
115115+116116+ - [cell] is a Hilbert curve cell index computed from the
117117+ label's {!centroid} at the document's {!field-document.level}.
118118+ It encodes area and resolution — you can read off where a
119119+ label is from its cell. Recomputed on reprojection.
120120+121121+ - [id] is a unique identifier (e.g. UUID) within the cell.
122122+ Stable, never changes.
123123+124124+ Together, [cell ^ "-" ^ id] gives a spatially-sortable unique
125125+ name. Sorting by this composite groups nearby labels.
126126+127127+ {b Classification.} A label's class is expressed through
128128+ [class_dist] — a probability distribution over class names.
129129+ A definite classification is [class_dist = \[("Panthera leo", 1.0)\]].
130130+ An uncertain classification distributes probability across
131131+ candidates. An unclassified label has [class_dist = \[\]].
132132+ The {!primary_class} accessor returns the most likely class.
133133+134134+ {b Deduplication.} Dedup across sources is a derivation: find
135135+ candidate matches (same [cell] + class agreement + temporal
136136+ overlap), let an expert decide, and merge via
137137+ [Derived { sources = \[a; b\]; method_ = "manual-merge" }].
138138+ Both originals are kept for provenance.
139139+140140+ {b Temporal.} [event_date] follows the Darwin Core
141141+ {{: https://techdocs.gbif.org/en/data-processing/temporal-interpretation}
142142+ temporal interpretation} convention (ISO 8601-1:2019):
143143+ precise dates (["2023-09-18"]), imprecise dates (["2023-09"],
144144+ ["2023"]), date-times (["2023-09-18T13:27:00Z"]), or intervals
145145+ (["2023-09-05/2023-09-18"]). This records when the observation
146146+ was made, not when the label was imported. *)
147147+type label = {
148148+ cell : cell; (** Hilbert cell — spatial index hint *)
149149+ id : string; (** Unique identifier (e.g. UUID) *)
150150+ geometry : geometry;
151151+ origin : origin;
152152+ event_date : event_date option; (** Darwin Core eventDate, ISO 8601 *)
153153+ confidence : float option; (** Semantic confidence ∈ \[0, 1\] *)
154154+ class_dist : (string * float) list; (** Per-class probability distribution *)
155155+ activity : string option; (** Activity ID for full provenance *)
156156+ properties : (string * string) list; (** Extensible key-value metadata *)
157157+}
158158+159159+(** The full spatially-sortable name of a label. *)
160160+let label_name l = string_of_cell l.cell ^ "-" ^ l.id
161161+162162+(** A free-text annotation anchored to one or more labels. *)
163163+type annotation = {
164164+ id : string;
165165+ text : string;
166166+ anchors : string list; (** IDs of labels this annotates *)
167167+}
168168+169169+(** A named group of labels (e.g. a field campaign). *)
170170+type group = {
171171+ id : string;
172172+ activity : string option; (** Activity ID for this group's provenance *)
173173+ members : string list; (** IDs of labels in this group *)
174174+}
175175+176176+(** {1 Document} *)
177177+178178+type document = {
179179+ crs : crs;
180180+ level : int; (** Hilbert curve level for {!field-label.cell} *)
181181+ provenance : activity list;
182182+ labels : label list;
183183+ annotations : annotation list;
184184+ groups : group list;
185185+}
186186+187187+(** {1 Constructors} *)
188188+189189+let make_point ~cell ~id ~x ~y ~observer ?accuracy_m
190190+ ?event_date ?confidence ?(class_dist = []) ?activity ?(properties = []) () =
191191+ { cell; id; geometry = Point { x; y };
192192+ origin = Measured { observer = Some observer; via = None;
193193+ license = None; accuracy_m };
194194+ event_date; confidence; class_dist; activity; properties }
195195+196196+let make_polygon ~cell ~id ~ring ~observer ?accuracy_m
197197+ ?event_date ?confidence ?(class_dist = []) ?activity ?(properties = []) () =
198198+ { cell; id; geometry = Polygon ring;
199199+ origin = Measured { observer = Some observer; via = None;
200200+ license = None; accuracy_m };
201201+ event_date; confidence; class_dist; activity; properties }
202202+203203+let make_imported ~cell ~id ~geometry ~via ?observer ?license
204204+ ?accuracy_m ?event_date ?confidence ?(class_dist = [])
205205+ ?activity ?(properties = []) () =
206206+ { cell; id; geometry;
207207+ origin = Measured { observer; via = Some via; license; accuracy_m };
208208+ event_date; confidence; class_dist; activity; properties }
209209+210210+let make_derived ~cell ~id ~geometry ~sources ~method_
211211+ ?event_date ?confidence ?(class_dist = []) ?activity ?(properties = []) () =
212212+ { cell; id; geometry;
213213+ origin = Derived { sources; method_ };
214214+ event_date; confidence; class_dist; activity; properties }
215215+216216+let make_simulated ~cell ~id ~geometry ~model ~run_id
217217+ ?event_date ?confidence ?(class_dist = []) ?activity ?(properties = []) () =
218218+ { cell; id; geometry;
219219+ origin = Simulated { model; run_id };
220220+ event_date; confidence; class_dist; activity; properties }
221221+222222+(** {1 Accessors} *)
223223+224224+let primary_class l =
225225+ match l.class_dist with
226226+ | [] -> None
227227+ | (c, _) :: _ ->
228228+ Some (List.fold_left
229229+ (fun (bc, bp) (c, p) -> if p > bp then (c, p) else (bc, bp))
230230+ (c, 0.0) l.class_dist
231231+ |> fst)
232232+233233+let accuracy_of l =
234234+ match l.origin with
235235+ | Measured { accuracy_m; _ } -> accuracy_m
236236+ | Derived _ | Simulated _ -> None
237237+238238+let sources_of l =
239239+ match l.origin with
240240+ | Measured _ | Simulated _ -> []
241241+ | Derived { sources; _ } -> sources
242242+243243+(** Registry URI, if this label was imported via a registry. *)
244244+let via_of l =
245245+ match l.origin with
246246+ | Measured { via; _ } -> via
247247+ | Derived _ | Simulated _ -> None
248248+249249+(** Is this label synthetic (from a simulation)? *)
250250+let is_simulated l =
251251+ match l.origin with
252252+ | Simulated _ -> true
253253+ | Measured _ | Derived _ -> false
254254+255255+(** {1 Fingerprinting}
256256+257257+ A fingerprint is [cell + primary class] — a coarse key for
258258+ finding dedup candidates. Event date is deliberately excluded:
259259+ the same feature observed at different times should still match
260260+ as a candidate for human review. *)
261261+262262+let fingerprint l =
263263+ let class_str = Option.value ~default:"_" (primary_class l) in
264264+ string_of_cell l.cell ^ "|" ^ class_str
265265+266266+let empty_document ~crs ?(level = 12) () =
267267+ { crs; level; provenance = []; labels = [];
268268+ annotations = []; groups = [] }
269269+270270+(** {1 Storage Layer}
271271+272272+ The data model above is independent of how labels are stored
273273+ and indexed. This section defines the interface between the
274274+ core types and a storage backend.
275275+276276+ {b Hilbert cell computation.} The [cell] field on each label
277277+ is a hex-encoded Hilbert curve cell index, computed by the
278278+ storage layer from the label's {!centroid} at the document's
279279+ {!field-document.level}. The Hilbert curve maps 2D coordinates
280280+ to a 1D index that preserves spatial locality — nearby points
281281+ get nearby indices.
282282+283283+ Level [n] divides each axis into [2{^n}] cells:
284284+285285+ {v
286286+ Level EPSG:4326 cell Hex chars
287287+ 8 ~1.4 km 2
288288+ 12 ~88 m 3
289289+ 16 ~5.5 m 4
290290+ 20 ~0.3 m 5
291291+ v}
292292+293293+ The storage layer must provide:
294294+295295+ {v
296296+ val hilbert_cell : level:int -> crs:crs -> point -> cell
297297+ v}
298298+299299+ which computes the cell for a point in the document's CRS.
300300+ The {!centroid} function gives the representative point for
301301+ any geometry.
302302+303303+ {b Why Hilbert, not Geohash.} Geohash uses a Z-order (Morton)
304304+ curve which has discontinuities at cell boundaries — two
305305+ points close in space can have very different hashes. The
306306+ Hilbert curve has better locality: adjacent cells on the curve
307307+ are always spatially adjacent.
308308+309309+ {b Reprojection.} When a document's CRS changes, all [cell]
310310+ values must be recomputed. The [id] fields remain stable.
311311+312312+ {b Sorted keys.} Concatenating [cell ^ "-" ^ id] (see
313313+ {!label_name}) gives a key that sorts spatially. Any system
314314+ that maintains sorted order (B-tree, LSM, lexicographic file
315315+ listing) gets spatial clustering for free. *)
+726
lib/terradots.mli
···11+(** {0 Terradots Label Store}
22+33+ A data model for geospatial labels — human observations and
44+ derived annotations used to train geospatial foundation models.
55+66+ This module defines the core types for representing labelled
77+ geographic features with full provenance, uncertainty, and
88+ spatial indexing. It is independent of any serialisation format
99+ (SVG, GeoJSON, GeoParquet, etc.); format encoders and decoders
1010+ operate over these types.
1111+1212+ {1 Design Principles}
1313+1414+ {2 Coordinates live in CRS space}
1515+1616+ All coordinates are in the document's native Coordinate Reference
1717+ System (CRS). Pixel-space mapping (affine transforms, SVG
1818+ viewBox) is a serialisation concern, not a data model concern.
1919+ The CRS is specified per document as any string that
2020+ {{: https://proj.org/} PROJ} can resolve: EPSG codes
2121+ (["EPSG:4326"]), WKT2 strings, or PROJ pipeline definitions.
2222+2323+ {2 Origin distinguishes measured, derived, and simulated}
2424+2525+ Every label records how it was produced. {b Measured} labels
2626+ come from direct observation — a GPS receiver, a human expert
2727+ digitising on imagery, or an import from an external registry
2828+ (GBIF, iNaturalist, OpenStreetMap). {b Derived} labels are
2929+ computed from other labels — convex hulls, buffers, manual
3030+ merges during deduplication. {b Simulated} labels are
3131+ produced by theoretical models (population dynamics,
3232+ agent-based simulations, climate projections) and must remain
3333+ identifiable as synthetic — they augment training data or
3434+ explore counterfactual scenarios but do not represent
3535+ real-world observations.
3636+3737+ Measured labels carry positional accuracy (metres). Derived
3838+ labels do not — their accuracy propagates from the source
3939+ labels via the derivation method. Simulated labels carry
4040+ confidence reflecting model reliability, typically lower
4141+ than measured data.
4242+4343+ Confidence and classification are universal: you can be
4444+ confident (or uncertain) about any label regardless of its
4545+ origin.
4646+4747+ {2 URIs identify observers and registries}
4848+4949+ Observers (sensors, humans) and external registries are
5050+ identified by URI. The URI scheme encodes the kind of source:
5151+5252+ {v
5353+ URI meaning
5454+ ──────────────────────────────────── ────────────────────
5555+ orcid:0000-0001-2345-6789 human (ORCID)
5656+ https://ror.org/035dkdb55 institution (ROR)
5757+ urn:sensor:gps:trimble-r12-0042 GPS receiver
5858+ urn:sensor:camera-trap:ct-0042 camera trap
5959+ gbif:4023589127 GBIF occurrence
6060+ inaturalist:observation/12345 iNaturalist record
6161+ osm:node/123456 OpenStreetMap node
6262+ v}
6363+6464+ Adding a new kind of observer or registry requires no code
6565+ changes — just use a new URI scheme.
6666+6767+ {2 Identity and spatial indexing are separate}
6868+6969+ A label has two name components:
7070+7171+ - {b cell}: a Hilbert curve cell index encoding spatial
7272+ locality. Computed from the label's centroid at the
7373+ document's Hilbert level. Recomputed on reprojection.
7474+7575+ - {b id}: a unique identifier (e.g. UUID). Stable across
7676+ reprojections, never changes.
7777+7878+ Concatenating [cell ^ "-" ^ id] (see {!label_name}) gives a
7979+ spatially-sortable unique name. Any sorted index (B-tree, LSM,
8080+ lexicographic file listing) gets spatial clustering for free.
8181+8282+ {2 Classification is a probability distribution}
8383+8484+ A label's class is expressed through {!field-label.class_dist},
8585+ a list of [(class_name, probability)] pairs ordered by
8686+ decreasing probability. A definite classification is a
8787+ singleton list: [\[("Panthera leo", 1.0)\]]. An uncertain
8888+ classification distributes probability across candidates.
8989+ An unclassified label has an empty list.
9090+9191+ The {!primary_class} accessor extracts the highest-probability
9292+ class. The {!fingerprint} function uses the primary class for
9393+ coarse deduplication matching.
9494+9595+ {2 Temporal data follows Darwin Core}
9696+9797+ The [event_date] field follows the
9898+ {{: https://techdocs.gbif.org/en/data-processing/temporal-interpretation}
9999+ Darwin Core temporal interpretation} convention (ISO 8601-1:2019).
100100+ It records when the observation was made, not when the label was
101101+ imported into the system. Supported formats:
102102+103103+ - Precise dates: ["2023-09-18"]
104104+ - Imprecise dates: ["2023-09"] or ["2023"]
105105+ - Date-times: ["2023-09-18T13:27:00Z"]
106106+ - Intervals: ["2023-09-05/2023-09-18"]
107107+108108+ {2 Deduplication is a derivation}
109109+110110+ Labels imported from multiple sources may refer to the same
111111+ real-world feature. Deduplication is modelled as a derivation:
112112+ find candidate matches (same Hilbert cell, class agreement,
113113+ temporal overlap), let an expert decide, then merge via
114114+ [Derived { sources = \[a; b\]; method_ = "manual-merge" }].
115115+ Both originals are kept in the document for full provenance.
116116+ The {!fingerprint} function produces coarse spatial+class keys
117117+ for efficient candidate matching. *)
118118+119119+(** {1 Coordinate Reference Systems} *)
120120+121121+(** A coordinate reference system identifier.
122122+123123+ Any string that {{: https://proj.org/} PROJ} can resolve:
124124+ EPSG codes (["EPSG:4326"]), WKT2 strings, or PROJ pipeline
125125+ definitions. The CRS determines the units and meaning of
126126+ all {!point} coordinates in the document. *)
127127+type crs = string
128128+129129+(** WGS 84 geographic coordinates (longitude, latitude in degrees).
130130+ The most common CRS for global datasets. *)
131131+val wgs84 : crs
132132+133133+(** Web Mercator (metres). Used by web mapping tile services. *)
134134+val web_mercator : crs
135135+136136+(** {1 Temporal} *)
137137+138138+(** A temporal extent following the Darwin Core
139139+ {{: https://techdocs.gbif.org/en/data-processing/temporal-interpretation}
140140+ [eventDate]} convention (ISO 8601-1:2019).
141141+142142+ This type is abstract — construct values with
143143+ {!event_date_of_string} and inspect with
144144+ {!string_of_event_date}. Valid forms:
145145+146146+ - Precise dates: ["2023-09-18"]
147147+ - Imprecise dates: ["2023-09"], ["2023"]
148148+ - Date-times: ["2023-09-18T13:27:00Z"]
149149+ - Intervals: ["2023-09-05/2023-09-18"]
150150+151151+ The abstraction boundary allows a future implementation to
152152+ parse and validate these forms, or to provide temporal
153153+ comparison and overlap queries. *)
154154+type event_date
155155+156156+(** Construct an {!event_date} from an ISO 8601 string.
157157+158158+ Currently accepts any string — validation is deferred to the
159159+ storage layer or a future version of this module. *)
160160+val event_date_of_string : string -> event_date
161161+162162+(** Convert an {!event_date} back to its ISO 8601 string
163163+ representation. *)
164164+val string_of_event_date : event_date -> string
165165+166166+(** {1 Spatial Indexing} *)
167167+168168+(** A Hilbert curve cell index — a hex-encoded spatial locality
169169+ key computed from a label's {!centroid} at the document's
170170+ Hilbert level.
171171+172172+ This type is abstract — construct values with
173173+ {!cell_of_string} and inspect with {!string_of_cell}.
174174+175175+ The Hilbert curve maps 2D coordinates to a 1D index that
176176+ preserves spatial locality. Nearby points in CRS space
177177+ get nearby cell values. See {!section:storagelayer} for
178178+ the level-to-resolution table.
179179+180180+ The abstraction boundary allows a future implementation to
181181+ enforce hex format, validate level-appropriate lengths, or
182182+ provide cell arithmetic (parent, children, neighbours). *)
183183+type cell
184184+185185+(** Construct a {!cell} from a hex string. *)
186186+val cell_of_string : string -> cell
187187+188188+(** Convert a {!cell} back to its hex string representation. *)
189189+val string_of_cell : cell -> string
190190+191191+(** {1 Geometry} *)
192192+193193+(** A point in the document's native CRS.
194194+195195+ For [EPSG:4326]: [x] is longitude (degrees east), [y] is
196196+ latitude (degrees north). For projected CRS (UTM, Web
197197+ Mercator): [x] is easting (metres), [y] is northing (metres). *)
198198+type point = { x : float; y : float }
199199+200200+(** A geometry in the document's native CRS.
201201+202202+ Follows {{: https://www.ogc.org/standard/sfa/} OGC Simple
203203+ Features / ISO 19125} with the subset needed for labelling:
204204+205205+ - {b Point}: a single coordinate pair.
206206+ - {b Polygon}: a closed exterior ring (the last point must
207207+ equal the first). Interior rings (holes) are not supported.
208208+ - {b Multi}: a heterogeneous collection of geometries,
209209+ corresponding to OGC GeometryCollection and the Multi*
210210+ types (MultiPoint, MultiPolygon). *)
211211+type geometry =
212212+ | Point of point
213213+ | Polygon of point list
214214+ | Multi of geometry list
215215+216216+(** Compute the representative point (centroid) of a geometry.
217217+218218+ - {b Point}: the point itself.
219219+ - {b Polygon}: arithmetic mean of the ring vertices.
220220+ - {b Multi}: centroid of the centroids of its members
221221+ (unweighted — sufficient for spatial indexing, not for
222222+ area-weighted geometric analysis).
223223+224224+ Used by the storage layer to compute Hilbert cell indices
225225+ (see {!section:storagelayer}). *)
226226+val centroid : geometry -> point
227227+228228+(** {1 Origin}
229229+230230+ How a label was produced. This is the only dimension along
231231+ which label metadata varies — confidence and classification
232232+ are universal properties independent of origin. *)
233233+234234+(** The origin of a label.
235235+236236+ {b Measured.} A direct observation by an observer (sensor,
237237+ human, or institution), possibly imported via an external
238238+ registry. Fields:
239239+240240+ - [observer]: URI identifying who or what made the observation.
241241+ Required for direct observations; optional for registry
242242+ imports where the original observer may be unknown.
243243+ - [via]: URI of the registry record, if imported from an
244244+ external platform. [None] for direct observations.
245245+ - [license]: SPDX license identifier (e.g. ["CC-BY-4.0"],
246246+ ["ODbL-1.0"]) for the imported data. [None] for direct
247247+ observations or when the license is unspecified.
248248+ - [accuracy_m]: positional uncertainty radius in metres.
249249+ Interpretation: the true position lies within [accuracy_m]
250250+ metres of the stated coordinates with high probability.
251251+ Maps to GBIF [coordinateUncertaintyInMeters] for imports.
252252+253253+ {b Derived.} A label computed from other labels. Fields:
254254+255255+ - [sources]: list of label IDs that were inputs to the
256256+ derivation. These must be valid label IDs within the same
257257+ document.
258258+ - [method_]: a short identifier for the derivation algorithm
259259+ (e.g. ["convex-hull"], ["buffer-10m"], ["manual-merge"]).
260260+261261+ Derived labels do not carry independent positional accuracy —
262262+ it propagates from the source labels via the derivation
263263+ method. The {!activity} record provides the full audit trail
264264+ (who ran the derivation, when, etc.).
265265+266266+ {b Simulated.} A label produced by a theoretical simulation
267267+ (agent-based models, population dynamics, climate projections,
268268+ etc.). Simulated labels augment training data or explore
269269+ counterfactual scenarios but {b must} remain identifiable as
270270+ synthetic — they do not represent real-world observations.
271271+272272+ - [model]: URI of the simulation model (typically a Fairground
273273+ notebook, e.g. ["fairground:notebook/lotka-volterra:v4"]).
274274+ - [run_id]: unique identifier for the simulation run, linking
275275+ all labels from the same execution. Combined with the
276276+ [activity] record, this gives full reproducibility: model
277277+ version, parameters, random seed.
278278+279279+ Simulated labels carry [confidence] reflecting the model's
280280+ estimated reliability, typically lower than measured data.
281281+ Derivation pipelines that consume simulated labels should
282282+ record the synthetic fraction in their [properties] for
283283+ transparency. *)
284284+type origin =
285285+ | Measured of {
286286+ observer : string option;
287287+ via : string option;
288288+ license : string option;
289289+ accuracy_m : float option;
290290+ }
291291+ | Derived of {
292292+ sources : string list;
293293+ method_ : string;
294294+ }
295295+ | Simulated of {
296296+ model : string;
297297+ run_id : string;
298298+ }
299299+300300+(** {1 Provenance} *)
301301+302302+(** An audit record for how labels were produced.
303303+304304+ An [activity] captures the "who" and "when" of label creation
305305+ or derivation. It complements {!origin}, which captures the
306306+ structural "what" and "from what".
307307+308308+ - [activity_id]: unique identifier for this activity.
309309+ - [agent]: URI or free-text identifier for the person, team,
310310+ or software that performed the activity.
311311+ - [date]: when the activity occurred, ISO 8601.
312312+ - [description]: optional free-text note on what was done.
313313+314314+ Labels reference activities via {!field-label.activity}.
315315+ Multiple labels may share the same activity (e.g. a batch
316316+ import or a bulk derivation). *)
317317+type activity = {
318318+ activity_id : string;
319319+ agent : string;
320320+ date : string;
321321+ description : string option;
322322+}
323323+324324+(** {1 Labels} *)
325325+326326+(** A geospatial label: a geometry in CRS space with classification,
327327+ origin, confidence, temporal extent, and extensible metadata.
328328+329329+ See the module-level documentation for the full design rationale
330330+ covering identity, spatial indexing, temporal conventions, and
331331+ deduplication.
332332+333333+ {b Fields.}
334334+335335+ - [cell]: Hilbert curve cell index (see {!cell}), computed from
336336+ {!centroid} at the document's {!field-document.level}. Encodes
337337+ spatial locality — labels in the same cell are geographically
338338+ close. Recomputed on reprojection; not part of the stable
339339+ identity.
340340+341341+ - [id]: unique identifier (e.g. UUID) within the cell. Stable
342342+ across reprojections. Two independent imports of the same
343343+ real-world feature get different IDs; this is correct until
344344+ an expert merges them via derivation.
345345+346346+ - [geometry]: the label's spatial extent in the document's CRS.
347347+348348+ - [origin]: how the label was produced (see {!origin}).
349349+350350+ - [event_date]: when the observation was made (not when it was
351351+ imported). See {!event_date} for the Darwin Core temporal
352352+ convention and supported formats.
353353+354354+ - [confidence]: semantic confidence in the overall label,
355355+ in the range [\[0, 1\]]. Independent of positional accuracy.
356356+357357+ - [class_dist]: per-class probability distribution. A list of
358358+ [(class_name, probability)] pairs ordered by decreasing
359359+ probability. Should sum to 1.0. A definite classification
360360+ is a singleton: [\[("Panthera leo", 1.0)\]]. An unclassified
361361+ label has an empty list. See {!primary_class}.
362362+363363+ - [activity]: optional ID of the {!activity} record that
364364+ produced this label. Provides the full audit trail.
365365+366366+ - [properties]: extensible key-value metadata. Use for
367367+ domain-specific attributes that don't warrant a dedicated
368368+ field (e.g. [("gbif_dataset", "uk-nbn-atlas")],
369369+ [("observer_name", "Alice Smith")]). *)
370370+type label = {
371371+ cell : cell;
372372+ id : string;
373373+ geometry : geometry;
374374+ origin : origin;
375375+ event_date : event_date option;
376376+ confidence : float option;
377377+ class_dist : (string * float) list;
378378+ activity : string option;
379379+ properties : (string * string) list;
380380+}
381381+382382+(** The full spatially-sortable name of a label.
383383+384384+ Returns [cell ^ "-" ^ id]. Sorting a collection of labels
385385+ by {!label_name} groups spatially nearby labels together,
386386+ because the Hilbert cell prefix preserves spatial locality. *)
387387+val label_name : label -> string
388388+389389+(** A free-text annotation anchored to one or more labels.
390390+391391+ Annotations provide commentary, corrections, or contextual
392392+ notes without modifying the labels themselves.
393393+394394+ - [id]: unique identifier for this annotation.
395395+ - [text]: the annotation content (free text).
396396+ - [anchors]: list of label IDs that this annotation refers to.
397397+ An annotation may span multiple labels (e.g. "these three
398398+ points are the same tree observed in different years"). *)
399399+type annotation = {
400400+ id : string;
401401+ text : string;
402402+ anchors : string list;
403403+}
404404+405405+(** A named group of labels.
406406+407407+ Groups organise labels into logical collections — a field
408408+ campaign, a seasonal survey, a thematic subset. They are
409409+ purely organisational and do not affect label semantics.
410410+411411+ - [id]: unique identifier for this group.
412412+ - [activity]: optional ID of the {!activity} that created or
413413+ curated this group.
414414+ - [members]: list of label IDs belonging to this group.
415415+ A label may belong to multiple groups. *)
416416+type group = {
417417+ id : string;
418418+ activity : string option;
419419+ members : string list;
420420+}
421421+422422+(** {1 Document}
423423+424424+ A document is the top-level container: a set of labels in a
425425+ common CRS, with provenance records, annotations, and groups. *)
426426+427427+(** A label store document.
428428+429429+ - [crs]: the coordinate reference system for all geometries.
430430+ - [level]: the Hilbert curve level used to compute
431431+ {!field-label.cell} values. All labels in a document use
432432+ the same level for consistent spatial resolution. See
433433+ {!section:storagelayer} for the level-to-resolution table.
434434+ - [provenance]: the list of {!activity} records referenced by
435435+ labels and groups.
436436+ - [labels]: the label collection.
437437+ - [annotations]: free-text annotations anchored to labels.
438438+ - [groups]: named subsets of labels. *)
439439+type document = {
440440+ crs : crs;
441441+ level : int;
442442+ provenance : activity list;
443443+ labels : label list;
444444+ annotations : annotation list;
445445+ groups : group list;
446446+}
447447+448448+(** {1 Constructors}
449449+450450+ Convenience functions that enforce common patterns. Direct
451451+ observations ([make_point], [make_polygon]) require an observer
452452+ URI. Registry imports ([make_imported]) require a registry URI
453453+ and accept an optional observer. Derivations ([make_derived])
454454+ require source label IDs and a method.
455455+456456+ All constructors require [~cell] (Hilbert cell, computed by
457457+ the storage layer) and [~id] (unique identifier).
458458+459459+ Classification is always via [~class_dist]. For a definite
460460+ class, pass [~class_dist:\[("Panthera leo", 1.0)\]]. *)
461461+462462+(** Construct a measured point label.
463463+464464+ Example:
465465+ {[
466466+ make_point
467467+ ~cell:(cell_of_string "a3f2") ~id:"7b1c9d"
468468+ ~x:0.1 ~y:52.2
469469+ ~observer:"urn:sensor:gps:trimble-r12-0042"
470470+ ~accuracy_m:0.02 ~confidence:0.99
471471+ ~class_dist:[("Quercus robur", 1.0)]
472472+ ~event_date:(event_date_of_string "2023-09-18T13:27:00Z") ()
473473+ ]} *)
474474+val make_point :
475475+ cell:cell -> id:string ->
476476+ x:float -> y:float ->
477477+ observer:string ->
478478+ ?accuracy_m:float ->
479479+ ?event_date:event_date -> ?confidence:float ->
480480+ ?class_dist:(string * float) list ->
481481+ ?activity:string ->
482482+ ?properties:(string * string) list ->
483483+ unit -> label
484484+485485+(** Construct a measured polygon label.
486486+487487+ Example:
488488+ {[
489489+ make_polygon
490490+ ~cell:(cell_of_string "a3f2") ~id:"e4a821"
491491+ ~ring:[{x=0.0;y=52.0}; {x=0.1;y=52.0};
492492+ {x=0.1;y=52.1}; {x=0.0;y=52.1};
493493+ {x=0.0;y=52.0}]
494494+ ~observer:"orcid:0000-0001-2345-6789"
495495+ ~class_dist:[("cropland", 0.9); ("grassland", 0.1)]
496496+ ~confidence:0.9
497497+ ~event_date:(event_date_of_string "2023-09") ()
498498+ ]} *)
499499+val make_polygon :
500500+ cell:cell -> id:string ->
501501+ ring:point list ->
502502+ observer:string ->
503503+ ?accuracy_m:float ->
504504+ ?event_date:event_date -> ?confidence:float ->
505505+ ?class_dist:(string * float) list ->
506506+ ?activity:string ->
507507+ ?properties:(string * string) list ->
508508+ unit -> label
509509+510510+(** Construct a label imported from an external registry.
511511+512512+ The [via] URI identifies the registry record. The [observer]
513513+ is optional — many registry records do not expose the original
514514+ collector. The [license] is the SPDX identifier for the
515515+ imported data.
516516+517517+ Example:
518518+ {[
519519+ make_imported
520520+ ~cell:(cell_of_string "a3f2") ~id:"8c1d3e"
521521+ ~geometry:(Point { x = 0.12; y = 52.21 })
522522+ ~via:"gbif:4023589127"
523523+ ~license:"CC-BY-4.0"
524524+ ~accuracy_m:100.0
525525+ ~class_dist:[("Quercus robur", 1.0)]
526526+ ~event_date:(event_date_of_string "2023")
527527+ ~properties:[("gbif_dataset", "uk-nbn-atlas")] ()
528528+ ]} *)
529529+val make_imported :
530530+ cell:cell -> id:string ->
531531+ geometry:geometry ->
532532+ via:string ->
533533+ ?observer:string -> ?license:string ->
534534+ ?accuracy_m:float ->
535535+ ?event_date:event_date -> ?confidence:float ->
536536+ ?class_dist:(string * float) list ->
537537+ ?activity:string ->
538538+ ?properties:(string * string) list ->
539539+ unit -> label
540540+541541+(** Construct a derived label.
542542+543543+ {b Deduplication merges} are a special case of derivation:
544544+545545+ {[
546546+ make_derived
547547+ ~cell:(cell_of_string "a3f2") ~id:"merged01"
548548+ ~geometry:(Point { x = 0.11; y = 52.205 })
549549+ ~sources:["7b1c9d"; "8c1d3e"]
550550+ ~method_:"manual-merge"
551551+ ~class_dist:[("Quercus robur", 1.0)]
552552+ ~confidence:0.95 ()
553553+ ]} *)
554554+val make_derived :
555555+ cell:cell -> id:string ->
556556+ geometry:geometry ->
557557+ sources:string list ->
558558+ method_:string ->
559559+ ?event_date:event_date -> ?confidence:float ->
560560+ ?class_dist:(string * float) list ->
561561+ ?activity:string ->
562562+ ?properties:(string * string) list ->
563563+ unit -> label
564564+565565+(** Construct a simulated label.
566566+567567+ A label produced by a theoretical simulation (population
568568+ dynamics, agent-based model, climate projection, etc.).
569569+ The [model] URI identifies the simulation code; [run_id]
570570+ links all labels from the same execution.
571571+572572+ Example — Lotka-Volterra population model:
573573+ {[
574574+ make_simulated
575575+ ~cell:(cell_of_string "b7d") ~id:"sim-001"
576576+ ~geometry:(Point { x = 35.20; y = -2.50 })
577577+ ~model:"fairground:notebook/lotka-volterra-serengeti:v4"
578578+ ~run_id:"lv-run-42"
579579+ ~class_dist:[("Panthera leo", 1.0)]
580580+ ~confidence:0.60
581581+ ~event_date:(event_date_of_string "2024-06-15")
582582+ ~properties:[("scenario", "baseline"); ("seed", "42")] ()
583583+ ]} *)
584584+val make_simulated :
585585+ cell:cell -> id:string ->
586586+ geometry:geometry ->
587587+ model:string ->
588588+ run_id:string ->
589589+ ?event_date:event_date -> ?confidence:float ->
590590+ ?class_dist:(string * float) list ->
591591+ ?activity:string ->
592592+ ?properties:(string * string) list ->
593593+ unit -> label
594594+595595+(** {1 Accessors} *)
596596+597597+(** The most likely class from {!field-label.class_dist}.
598598+599599+ Returns [Some class_name] for the highest-probability entry,
600600+ or [None] if [class_dist] is empty (unclassified label). *)
601601+val primary_class : label -> string option
602602+603603+(** Positional accuracy in metres, if this is a measured label.
604604+605605+ Returns [Some metres] for measured labels with a stated
606606+ accuracy, [None] for derived and simulated labels. *)
607607+val accuracy_of : label -> float option
608608+609609+(** Source label IDs, if this is a derived label.
610610+611611+ Returns the list of label IDs that were inputs to the
612612+ derivation. Returns [\[\]] for measured and simulated
613613+ labels. *)
614614+val sources_of : label -> string list
615615+616616+(** Registry URI, if this label was imported via a registry.
617617+618618+ Returns [Some uri] for labels imported from GBIF, iNaturalist,
619619+ OSM, etc. Returns [None] for direct observations, derived
620620+ labels, and simulated labels. *)
621621+val via_of : label -> string option
622622+623623+(** Is this label synthetic (produced by a simulation)?
624624+625625+ Returns [true] for simulated labels, [false] for measured
626626+ and derived labels. Use this to filter synthetic data out
627627+ of analyses that must reflect real-world observations only
628628+ (e.g. species range computation). *)
629629+val is_simulated : label -> bool
630630+631631+(** {1 Fingerprinting}
632632+633633+ A fingerprint is a coarse key for finding deduplication
634634+ candidates. It combines the Hilbert cell (spatial locality)
635635+ with the primary class from {!field-label.class_dist}.
636636+637637+ Two labels with the same fingerprint are worth comparing for
638638+ potential deduplication. Different fingerprints guarantee the
639639+ labels are either spatially distant or differently classified.
640640+641641+ The event date is deliberately excluded: the same real-world
642642+ feature observed at different times should still match as a
643643+ candidate, so a human reviewer can decide whether they are
644644+ the same feature. *)
645645+646646+(** Compute the fingerprint of a label.
647647+648648+ Returns [cell ^ "|" ^ primary_class], where [primary_class]
649649+ defaults to ["_"] if [class_dist] is empty. *)
650650+val fingerprint : label -> string
651651+652652+(** {1 Document Construction} *)
653653+654654+(** Create an empty document in the given CRS.
655655+656656+ @param level Hilbert curve level for spatial cell computation.
657657+ Defaults to [12], which gives ~88 m cells for EPSG:4326.
658658+ See {!section:storagelayer} for the full level-to-resolution
659659+ table. *)
660660+val empty_document : crs:crs -> ?level:int -> unit -> document
661661+662662+(** {1:storagelayer Storage Layer}
663663+664664+ The data model above is independent of how labels are stored
665665+ and indexed. This section specifies the contract between the
666666+ core types and a storage backend.
667667+668668+ {2 Hilbert Cell Computation}
669669+670670+ The {!field-label.cell} field on each label is a hex-encoded
671671+ Hilbert curve cell index. The storage layer computes it from
672672+ the label's {!centroid} at the document's
673673+ {!field-document.level}.
674674+675675+ The Hilbert curve maps 2D coordinates to a 1D index that
676676+ preserves spatial locality — nearby points in 2D space map to
677677+ nearby positions on the curve. This is the key property that
678678+ makes sorted-key spatial clustering work.
679679+680680+ Level [n] divides each CRS axis into [2{^n}] cells. For
681681+ EPSG:4326 (degrees):
682682+683683+ {v
684684+ Level Cell size Hex chars
685685+ ───── ───────────── ─────────
686686+ 8 ~1.4 km 2
687687+ 12 ~88 m 3
688688+ 16 ~5.5 m 4
689689+ 20 ~0.3 m 5
690690+ v}
691691+692692+ The storage layer must provide a function with this signature:
693693+694694+ {[
695695+ val hilbert_cell : level:int -> crs:crs -> point -> cell
696696+ ]}
697697+698698+ The {!centroid} function provides the representative point for
699699+ any geometry.
700700+701701+ {2 Why Hilbert, not Geohash}
702702+703703+ Geohash uses a Z-order (Morton) curve. Z-order curves have
704704+ discontinuities at certain cell boundaries: two points that
705705+ are close in 2D space can receive very different hash values
706706+ when they fall on opposite sides of a major subdivision.
707707+708708+ The Hilbert curve avoids this: adjacent cells on the curve
709709+ are {i always} spatially adjacent. This gives more uniform
710710+ spatial clustering and fewer edge-case misses in proximity
711711+ queries.
712712+713713+ {2 Reprojection}
714714+715715+ When a document's CRS changes, all {!field-label.cell} values
716716+ must be recomputed from the (reprojected) geometries. The
717717+ {!field-label.id} fields remain stable — identity is
718718+ independent of coordinate system.
719719+720720+ {2 Sorted Keys}
721721+722722+ Concatenating [cell ^ "-" ^ id] (see {!label_name}) produces
723723+ a key that sorts spatially. Any system that maintains sorted
724724+ order (B-tree, LSM tree, lexicographic file listing) gets
725725+ spatial clustering for free: a prefix scan on a cell value
726726+ retrieves all labels in that spatial neighbourhood. *)
+28
terradots.opam
···11+# This file is generated by dune, edit dune-project instead
22+opam-version: "2.0"
33+synopsis: "Geospatial label store for planetary observation data"
44+description: """
55+A data model for geospatial labels — human observations, registry
66+ imports, simulation outputs, and derived annotations used to train
77+ geospatial foundation models. Supports full provenance tracking,
88+ Hilbert curve spatial indexing, and Darwin Core temporal conventions."""
99+license: "ISC"
1010+depends: [
1111+ "dune" {>= "3.16"}
1212+ "ocaml" {>= "5.2"}
1313+ "odoc" {with-doc}
1414+]
1515+build: [
1616+ ["dune" "subst"] {dev}
1717+ [
1818+ "dune"
1919+ "build"
2020+ "-p"
2121+ name
2222+ "-j"
2323+ jobs
2424+ "@install"
2525+ "@runtest" {with-test}
2626+ "@doc" {with-doc}
2727+ ]
2828+]