···5757 - Rewarding the datasets creators according to the usefulness. E.g: [CommonCrawl built an amazing repository](https://commoncrawl.org/) that OpenAI has used for their GPTs LLMs. Not sure how well CommonCrawl was compensated.
5858- **Open Source and Decentralized**. Datasets should be stored in multiple places.
5959 - Don't create yet another standard. Provide a way for people to integrate current indexers. Work on _adapters_ for different datasets sources. Similar to:
6060- - [Foreign Data Wrappers in PostgreSQL](https://wiki.postgresql.org/wiki/Foreign_data_wrappers)
6161- - [Trustfall](https://github.com/obi1kenobi/trustfall).
6262- - Open source data integration projects like [Airbyte](https://airbyte.com/). They can used to build open data connectors making possible to replicate something from `$RANDOM_SOURCE` (e.g: spreadsheets, Ethereum Blocks, URL, ...) to any destination.
6363- - Adapters are created by the community so data becomes connected.
6060+ - [Foreign Data Wrappers in PostgreSQL](https://wiki.postgresql.org/wiki/Foreign_data_wrappers)
6161+ - [Trustfall](https://github.com/obi1kenobi/trustfall).
6262+ - Open source data integration projects like [Airbyte](https://airbyte.com/). They can used to build open data connectors making possible to replicate something from `$RANDOM_SOURCE` (e.g: spreadsheets, Ethereum Blocks, URL, ...) to any destination.
6363+ - Adapters are created by the community so data becomes connected.
6464 - Integrate with the modern data stack to avoid reinventing the wheel.
6565 - Decentralized the computation (where data lives) and then cache copies of the results (or aggregations) in CDNs. Most queries require only reading a small amount of data and going to be similar.
6666···7777 - Datasets could be linked to a [[Open Data#Datafile|Datafile]]/`datapackage.yml` with description, default visualizations, WASM linked code...
7878 - One repository, one dataset or catalog/hub.
7979 - To avoid yet another open dataset portal, build adapters to integrate with other indexes.
8080- - For example, bring all HF datasets by making a simple PR on their repository that generates a `datapackage.yml` reusing their parquet files.
8181- - [Expose a JSON-LD so Google Dataset Search can index it](https://developers.google.com/search/docs/appearance/structured-data/dataset).
8080+ - For example, bring all HF datasets by making a simple PR on their repository that generates a `datapackage.yml` reusing their parquet files.
8181+ - [Expose a JSON-LD so Google Dataset Search can index it](https://developers.google.com/search/docs/appearance/structured-data/dataset).
8282- **Formatting**. Datasets should be saved and exposed in multiple formats (CSV, Parquet, ...). Could be done via WASM transformations or in the fly when pulling data. The package manager should be **format and storage agnostic**.
8383- **Social**. Allow users, organizations, stars, citations, attaching default visualizations (d3, [Vega](https://vega.github.io/), [Vegafusion](https://github.com/vegafusion/vegafusion/), and others), ...
8484 - Importing datasets. Making possible to `data fork user/data`, improve something and publish the resulting dataset back (via something like a PR).
···224224- [Datahub](https://datahub.io/)
225225- [Open Data Services](https://opendataservices.coop)
226226- [Catalyst Cooperative](https://catalyst.coop/)
227227+- [Carbon Plan](https://github.com/carbonplan)
227228- [Data is Plural](https://github.com/data-is-plural)
228229- [Data Liberation Project](https://github.com/data-liberation-project)
229230