feat: 📝 Elaborate on principles and ecosystem in data and open data docs

+6 -3

2 changed files

expand all

Data

Open Data.md

+2 -2

Data/Data Engineering.md

··· 18 18 19 19 ### Basic Principles 20 20 21 - - **Simplicity**: Each steps is easy to understand and modify. Rely on immutable data. Write only. No deletes. No updates. 21 + - **Simplicity**: Each steps is easy to understand and modify. Rely on immutable data. Write only. No deletes. No updates. Avoid having too much "state". Hosting static files on S3 is much less friction and maintenance than a server somewhere serving an API. 22 22 - **Reliability**: Errors in the pipelines can be recovered. Pipelines are monitored and tested. Data is saved in each step (storage is cheap) so it can be used later if needed. For example, adding a new column to a table can be done extracting the column from the intermediary data without having to query the data source. It is better to support 1 feature that works reliably and has a great UX than 2 that are unreliable or hard to use. One solid step is better than 2 finicky ones. 23 23 - **[[Modularity]]**: Steps are independent, declarative, and [[Idempotence|itempotent]]. This makes pipelines composable. 24 24 - **Consistency**: Same conventions and design patterns across pipelines. If a failure is actionable by the user, clearly let them know what they can do. Schema on write. 25 - - **Efficiency**: Low event latency when needed. Easy to scale up and down. A user should not be able to configure something that will not work. 25 + - **Efficiency**: Low event latency when needed. Easy to scale up and down. A user should not be able to configure something that will not work. Don't mix heterogeneous workloads under the same tooling (e.g: big data warehouses doing simple queries 95% of their time and 1 big batch once a day). 26 26 - **Flexibility**: Steps change to conform data points. Changes don't stop the pipeline or losses data. Fail fast and upstream. 27 27 28 28 ### Data Flow

+4 -1

Open Data.md

··· 52 52 - Support for integrating non-dataset files. A dataset could be linked to code, visualizations, pipelines, models, reports, ... 53 53 - **Reproducible and Verifiable**. People should be able to trust the final datasets without having to recompute everything from scratch. In "reality", events are immutable, data should be too. [Make datasets the center of the tooling](https://dagster.io/blog/software-defined-assets). 54 54 - With immutability and content addressing, you can move backwards in time and run transformations or queries on how the dataset was at a certain point in time. 55 - - [Datasets are books, not houses]()! 55 + - [Datasets are books, not houses](https://medium.com/qri-io/datasets-are-books-not-houses-760bd4736229)! 56 56 - **Permissionless**. Anyone should be able to add/update/fix datasets or their metadata. GitHub style collaboration, curation, and composability. On data. 57 57 - **Aligned Incentives**. Curators should have incentives to improve datasets. Data is messy after all, but a good set of incentives could make great datasets surface and reward contributors accordingly (e.g: [number of contributors to Dune](https://github.com/duneanalytics/spellbook/commits/main)). 58 58 - [Bounties](https://www.dolthub.com/bounties) could be created to reward people that adds useful but missing datasets. 59 59 - Surfacing and creating great datasets could be rewarded (retroactively or with bounties). 60 60 - Curating the data provides compounding benefits for the entire community! 61 61 - Rewarding the datasets creators according to the usefulness. E.g: [CommonCrawl built an amazing repository](https://commoncrawl.org/) that OpenAI has used for their GPTs LLMs. Not sure how well CommonCrawl was compensated. 62 + - Governments needs to be forced to use their open data. This should create a feedback loop and have them improve the quality and freshness of the data. 63 + That forces to keep up on the quality and freshness. 62 64 - **Open Source and Decentralized**. Datasets should be stored in multiple places. 63 65 - Don't create yet another standard. Provide a way for people to integrate current indexers. Work on _adapters_ for different datasets sources. Similar to: 64 66 - [Foreign Data Wrappers in PostgreSQL](https://wiki.postgresql.org/wiki/Foreign_data_wrappers) ··· 100 102 - **Versioning**. Should be able to manage _diffs_ and _incremental changes_ in a smart way. E.g: only storing the new added rows or updated columns. 101 103 - Should allow [automated harvesting of new data](https://tech.datopian.com/harvesting/) with sensors (external functions) or scheduled jobs. 102 104 - Each version is referenced by a hash. Git style. 105 + - Each version is linked to the code that produced it. 103 106 - **Smart**. Use appropriate protocols for storing the data. E.g: rows/columns shouldn't be duplicated if they don't change. 104 107 - Think at the dataset level and not the file level. 105 108 - Tabular data could be partitioned to make it easier for future retrieval.

Configure Feed

Configure Feed