:art: · davidgasquez.com/handbook@7b9b33a

+7 -6

1 changed file

expand all

Open Data.md

+7 -6

Open Data.md

··· 57 57 - Rewarding the datasets creators according to the usefulness. E.g: [CommonCrawl built an amazing repository](https://commoncrawl.org/) that OpenAI has used for their GPTs LLMs. Not sure how well CommonCrawl was compensated. 58 58 - **Open Source and Decentralized**. Datasets should be stored in multiple places. 59 59 - Don't create yet another standard. Provide a way for people to integrate current indexers. Work on _adapters_ for different datasets sources. Similar to: 60 - - [Foreign Data Wrappers in PostgreSQL](https://wiki.postgresql.org/wiki/Foreign_data_wrappers) 61 - - [Trustfall](https://github.com/obi1kenobi/trustfall). 62 - - Open source data integration projects like [Airbyte](https://airbyte.com/). They can used to build open data connectors making possible to replicate something from `$RANDOM_SOURCE` (e.g: spreadsheets, Ethereum Blocks, URL, ...) to any destination. 63 - - Adapters are created by the community so data becomes connected. 60 + - [Foreign Data Wrappers in PostgreSQL](https://wiki.postgresql.org/wiki/Foreign_data_wrappers) 61 + - [Trustfall](https://github.com/obi1kenobi/trustfall). 62 + - Open source data integration projects like [Airbyte](https://airbyte.com/). They can used to build open data connectors making possible to replicate something from `$RANDOM_SOURCE` (e.g: spreadsheets, Ethereum Blocks, URL, ...) to any destination. 63 + - Adapters are created by the community so data becomes connected. 64 64 - Integrate with the modern data stack to avoid reinventing the wheel. 65 65 - Decentralized the computation (where data lives) and then cache copies of the results (or aggregations) in CDNs. Most queries require only reading a small amount of data and going to be similar. 66 66 ··· 77 77 - Datasets could be linked to a [[Open Data#Datafile|Datafile]]/`datapackage.yml` with description, default visualizations, WASM linked code... 78 78 - One repository, one dataset or catalog/hub. 79 79 - To avoid yet another open dataset portal, build adapters to integrate with other indexes. 80 - - For example, bring all HF datasets by making a simple PR on their repository that generates a `datapackage.yml` reusing their parquet files. 81 - - [Expose a JSON-LD so Google Dataset Search can index it](https://developers.google.com/search/docs/appearance/structured-data/dataset). 80 + - For example, bring all HF datasets by making a simple PR on their repository that generates a `datapackage.yml` reusing their parquet files. 81 + - [Expose a JSON-LD so Google Dataset Search can index it](https://developers.google.com/search/docs/appearance/structured-data/dataset). 82 82 - **Formatting**. Datasets should be saved and exposed in multiple formats (CSV, Parquet, ...). Could be done via WASM transformations or in the fly when pulling data. The package manager should be **format and storage agnostic**. 83 83 - **Social**. Allow users, organizations, stars, citations, attaching default visualizations (d3, [Vega](https://vega.github.io/), [Vegafusion](https://github.com/vegafusion/vegafusion/), and others), ... 84 84 - Importing datasets. Making possible to `data fork user/data`, improve something and publish the resulting dataset back (via something like a PR). ··· 224 224 - [Datahub](https://datahub.io/) 225 225 - [Open Data Services](https://opendataservices.coop) 226 226 - [Catalyst Cooperative](https://catalyst.coop/) 227 + - [Carbon Plan](https://github.com/carbonplan) 227 228 - [Data is Plural](https://github.com/data-is-plural) 228 229 - [Data Liberation Project](https://github.com/data-liberation-project) 229 230

Configure Feed

Configure Feed