···11# Cooperatives
2233- A cooperative is an autonomous association of persons united voluntarily to meet their common economic, social and cultural needs and aspirations through a jointly- owned and democratically- controlled enterprise.
44+- There are [many ways to form cooperatives](https://institute.coop/sites/default/files/resources/356%202009_Johnson%20etal_Tech%20Freelancers%20Guide%20to%20Worker%20Co-ops.pdf).
4556## Resources
67
+2
Data/Data Engineering.md
···66 - **Reliability:** how well can systems recover from outages and incidents?
77 - **Speed of execution:** how quickly can you get a new data source up and running?
88- If it can be solved with SQL, stick to SQL.
99+ - SQL will be the abstraction layer in streaming too so you don't have to care about incremental materialization or timely dataflows.
910- A [consistent pattern](https://www.startdataengineering.com/post/design-patterns/) across your data pipelines helps devs communicate easily and understand code better.
1111+- Data Engineering can learn from decentralized systems ideas like, Content Addressed Data, Immutability, and [[Idempotence]].
10121113## Data Pipelines
1214
+20-3
Future.md
···11-# Things that Might Look Weird in the Future
11+# Future
22+33+## Things that Might Look Weird in the Future
2435History teaches us that in 100 years from now [[Openness|some of the assumptions we believed will turn out to be wrong]]. A good question to ask is "What might we be wrong about today?". These are a few things that future humans might see as weird behavior:
46···79- Give birth without advanced assistance.
810- Not caring for all the [animal suffering in the wild](https://longtermrisk.org/the-importance-of-wild-animal-suffering/).
911 - Nature is not safe! The default is suffering. The current mentality is that nature is good and disruptions from nature are bad.
1010-- The ignorance of Social Media and its [full impact on society](https://twitter.com/M_B_Petersen/status/1483457679800651787).
1212+- The ignorance of Social Media and its [full impact on s_o_ciety](https://twitter.com/M_B_Petersen/status/1483457679800651787).
1113 - Is "being bad for society" an emergent property of social networks as they grow?
1212-- Voting Systems and not using more Prediction Markets in public.
1414+- Current Voting Systems.
1515+- Not relying more into tools like Prediction Markets.
1616+1717+## Predictions
1818+1319- More experimentation around [[Politics|governance]]:
1420 - Charter cities.
1521 - [Holacracy](https://en.m.wikipedia.org/wiki/Holacracy).
···2026 - More interactive explanations like the ones the awesome [Nicky Case](https://ncase.me/) do!
2127- More concern around systems with weird incentives causing large amount of pain (Moloch).
2228- Work valuation changes (plumbing more expensive than some software development) due to [Moravec's paradox](https://en.wikipedia.org/wiki/Moravec%27s_paradox). We will automate making a full app before a robot is able to master physical arms and legs like a 5 year old.
2929+- Open data will be more important as they can produce better models and help coordinate people providing shared context.
3030+- The current decentralized protocols (IPFS, ActivityPub, ...) need to evolve more. Specially around UX. People don't care about decentralization, they care about UX.
3131+3232+### Exciting Software Engineering Ideas
3333+3434+- Content Addressed Data + Immutability
3535+- CRDTs
3636+- Homomorphic Encryption
3737+- Prolly/Merkle Trees
3838+- Differential/Timely Dataflow
3939+- Zero-Knowledge Proofs
+140-112
Open Data.md
···11# Open Data
22-33-> Bring Open Data to the level of Open Source.
44-> Make Open Data compatible with the modern data ecosystem (tooling, approaches, ...).
22+_Make Open Data compatible with the Modern Data Ecosystem_.
5364## Motivation
75···20182119Open protocols create open systems. Open code creates tools. **Open data creates open knowledge**. We need better tools, protocols, and mechanisms to improve the Open Data ecosystem. It should be easy to find, download, process, publish, and collaborate on open datasets.
22202323-Iterative improvements over public datasets would yield large amounts of value ([Dune did it with blockchain data](https://dune.com/blog/the-community-data-platform))¹. Access to data gives people the opportunity to create new business and make better decisions. Open Source code has made a huge impact in the world. Let's make Open Data do the same! [Anyone should be able to fork and re-publish fixed, cleaned, reformatted datasets as easily as people fork code](https://juan.benet.ai/blog/2014-02-21-data-management-problems/).
2121+Iterative improvements over public datasets yield large amounts of value ([check how Dune did it with blockchain data](https://dune.com/blog/the-community-data-platform))¹. Access to data gives people the opportunity to create new business and make better decisions.
2222+2323+Open Source code has made a huge impact in the world. Let's make Open Data do the same! Let's make it possible for [anyone to fork and re-publish fixed, cleaned, reformatted datasets as easily as we do the same things with code](https://juan.benet.ai/blog/2014-02-21-data-management-problems/).
24242525### Why Now?
26262727-We have cheaper storage, better compute, and more data. We need to improve our workflows now. How does a world where people collaborate on datasets looks like?
2727+We have better and cheaper infrastructure. That includes things like faster storage, better compute, and, larger amounts of data. We need to improve our data workflows now. How does a world where people collaborate on datasets looks like? [The data is there. We just need to use it](https://twitter.com/auren/status/1509340748054945794).
28282929-During the last few years, a Cambrian explosion of open source tools have emerged. There are new query engines (e.g: DuckDB, DataFusion, ...), execution frameworks (WASM), data standards (Arrow, Parquet, ...), and a growing set of open data marketplaces (Datahub, HuggingFace Datasets, Kaggle Datasets).
2929+During the last few years, a large number of new data and open source tools have emerged. There are new query engines (e.g: DuckDB, DataFusion, ...), execution frameworks (WASM), data standards (Arrow, Parquet, ...), and a growing set of open data marketplaces (Datahub, HuggingFace Datasets, Kaggle Datasets).
30303131-These trends have already quick-started movements like [DeSci](https://ethereum.org/en/desci/) but we still need more tooling around data to make interoperability possible. **We should use the same modern tooling companies are using to manage open datasets**. A sort of [Data Operating system](https://data-operating-system.com/). Having better data will create better and more accessible AI models ([people are working on this](https://github.com/togethercomputer/OpenDataHub)).
3131+These trends are already making it's way towards movements like [DeSci](https://ethereum.org/en/desci/) or smaller projects like [Py-Code Datasets](https://py-code.org/datasets). But, we still need more tooling around data to improve interoperability as much as possible. Lots of companies have figured out how to make the most of their datasets. **We should use similar tooling and approaches companies are using to manage the open datasets that surrounds us**. A sort of [Data Operating system](https://data-operating-system.com/).
32323333Data wrangling is a perpetual maintenance commitment, taking a lot of ongoing attention and resources. [Better and modern data tooling can reduce these costs](https://github.com/catalyst-cooperative/pudl).
34343535-Organizations like [Our World in Data](https://ourworldindata.org/) or [538](https://fivethirtyeight.com/) provide useful analysis but have to deal with _dataset management_. They end up building custom tools around their workflows. That works, but limits the potential of these datasets. In the end, there is no `data get OWID/daily-covid-cases`, no `data query "select * from 538/polls"` that could act as entry-point to explore datasets.
3535+Organizations like [Our World in Data](https://ourworldindata.org/) or [538](https://fivethirtyeight.com/) provide useful analysis but have to deal with _dataset management_, spending most of their time building custom tools around their workflows. That works, but limits the potential of these datasets. Sadly, there is no `data get OWID/daily-covid-cases` or `data query "select * from 538/polls"` that could act as a quick and easy entry-point to explore datasets.
36363737-We could have a better ecosystem if we **collaborate on open standards**! So, lets move towards more composable, maintainable, and reproducible open data.
3737+We could have a better data ecosystem if we **collaborate on open standards**! So, lets move towards more [composable](https://voltrondata.com/codex), maintainable, and reproducible open data.
38383939-¹ I think blockchain data is a great place to start building the idea as the data there is open, immutable, and useful.
3939+¹ Blockchain data might be a great place to start building on these ideas as the data there is open, immutable, and useful.
40404141## Design Principles
42424343- **Easy**. Create, curate and share datasets without friction.
4444- - Data is useful only when used! Right now, we're not using most of humanity's datasets. That's not because they're not available but because they're hard to get. They're isolated in different places and formats.
4444+ - Frictionless: Data is useful only when used! Right now, we're not using most of humanity's datasets. That's not because they're not available but because they're hard to get. They're isolated in different places and multiple formats.
4545 - Pragmatism: published data is better than almost published one because something is missing. Publishing datasets to the web is too hard now and there are few purpose-built tools that help.
4646- **Versioned and Modular**. Data and metadata (e.g: `relation`) should be [updated, forked and discussed](https://github.com/jbenet/data/blob/master/dev/designdoc.md#data-hashes-and-refs) as code in version controlled repositories.
4747 - Prime composability (e.g: [Arrow ecosystem](https://thenewstack.io/how-apache-arrow-is-changing-the-big-data-ecosystem/)) so tools/services can be swapped.
4848- - Metadata as a first-class citizen.
4949- - Git based approach collaboration. Adopt and integrate with `git` to reduce surface area. Build tooling to adapt revisions, tags, branches, issues, PRs to datasets.
5050- - Provide a declarative way of defining the datasets schema and other meta-properties like _relations_ or _tests_.
5151- - Support for integrating non-dataset files. A dataset could be linked to code, visualizations, pipelines, models, ...
5252-- **Reproducible and Verifiable**. People should be able to trust the final datasets without having to recompute everything from scratch. In real life events are immutable, data should be too. Make datasets the center of the tooling like [software defined assets](https://dagster.io/blog/software-defined-assets).
5353- - Thanks to immutability and content addressing, you can move backwards in time and run transformations or queries on how the dataset was at a certain point in time.
5454-- **Permissionless**. Anyone should be able to add/update/fix datasets or their metadata. GitHub style collaboration, curation, and composability.
5555-- **Aligned Incentives**. Curators should have incentives to improve datasets. Data is messy after all, but a good set of incentives could make great datasets surface and reward contributors accordingly.
4848+ - Metadata as a first-class citizen. Even if minimal and automated.
4949+ - Git based approach collaboration. Adopt and integrate with `git` and GitHub to reduce surface area. Build tooling to adapt revisions, tags, branches, issues, PRs to datasets.
5050+ - Provide a declarative way of defining the datasets schema and other meta-properties like _relations_ or _tests/checks_.
5151+ - Support for integrating non-dataset files. A dataset could be linked to code, visualizations, pipelines, models, reports, ...
5252+- **Reproducible and Verifiable**. People should be able to trust the final datasets without having to recompute everything from scratch. In "reality", events are immutable, data should be too. [Make datasets the center of the tooling](https://dagster.io/blog/software-defined-assets).
5353+ - With immutability and content addressing, you can move backwards in time and run transformations or queries on how the dataset was at a certain point in time.
5454+- **Permissionless**. Anyone should be able to add/update/fix datasets or their metadata. GitHub style collaboration, curation, and composability. On data.
5555+- **Aligned Incentives**. Curators should have incentives to improve datasets. Data is messy after all, but a good set of incentives could make great datasets surface and reward contributors accordingly (e.g: [number of contributors to Dune](https://github.com/duneanalytics/spellbook/commits/main)).
5656 - [Bounties](https://www.dolthub.com/bounties) could be created to reward people that adds useful but missing datasets.
5757- - Surfacing and creating great datasets should be rewarded.
5757+ - Surfacing and creating great datasets could be rewarded (retroactively or with bounties).
5858 - Curating the data provides compounding benefits for the entire community!
5959 - Rewarding the datasets creators according to the usefulness. E.g: [CommonCrawl built an amazing repository](https://commoncrawl.org/) that OpenAI has used for their GPTs LLMs. Not sure how well CommonCrawl was compensated.
6060- **Open Source and Decentralized**. Datasets should be stored in multiple places.
···6363 - [Trustfall](https://github.com/obi1kenobi/trustfall).
6464 - Open source data integration projects like [Airbyte](https://airbyte.com/). They can used to build open data connectors making possible to replicate something from `$RANDOM_SOURCE` (e.g: spreadsheets, Ethereum Blocks, URL, ...) to any destination.
6565 - Adapters are created by the community so data becomes connected.
6666- - Integrate with the modern data stack to avoid reinventing the wheel.
6767- - Decentralized the computation (where data lives) and then cache copies of the results (or aggregations) in CDNs. Most queries require only reading a small amount of data and going to be similar.
6666+ - Having better data will help create better and more accessible AI models ([people are working on this](https://github.com/togethercomputer/OpenDataHub)).
6767+ - Integrate with the modern data stack to avoid reinventing the wheel and increase surface of the required skill sets.
6868+ - Decentralized the computation (where data lives) and then cache inmutable and static copies of the results (or aggregations) in CDNs (IPFS, R2, Torrent). Most end user queries require only reading a small amount of data!
68696970## Modules
7071···72737374Package managers have been hailed among the most important innovations Linux brought to the computing industry. The activities of both publishers and users of datasets resemble those of authors and users of software packages.
74757575-- **Distribution**. Decentralized. No central authority. Can work in a closed network. Cache/CDN friendly.
7676+- **Distribution**. Decentralized. No central authority. Can work in closed and private networks. Cache/CDN friendly.
7677 - A data package is an URI ([like in Deno](https://deno.land/manual@v1.31.2/examples/manage_dependencies)). You can import from an URL (`data add example.com/dataset.yml` or `data add example.com/hub_curated_datasets.yml`).
7777- - As [Rufus Pollock puts it](https://datahub.io/docs/dms/notebook#go-modules-and-dependency-management-re-data-package-management-2020-05-16-rufuspollock), Keep it as simple as possible. Store the table location and schema and get me the data on the hard disk fast.
7878- - [Bootstrap a package registry](https://antonz.org/writing-package-manager/). E.g: a GitHub repository with lots of known datapackages that acts as fallback and quick way to get started with the tool.
7878+ - As [Rufus Pollock puts it](https://datahub.io/docs/dms/notebook#go-modules-and-dependency-management-re-data-package-management-2020-05-16-rufuspollock), Keep it as simple as possible. Store the table location and schema and get me the data on the hard disk (or my browser) fast.
7979+ - [Bootstrap a package registry](https://antonz.org/writing-package-manager/). E.g: a GitHub repository with lots of known `datapackages` that acts as fallback and quick way to get started with the tool (`data list` returns a bunch of known open datasets and integrates with platforms like Huggingface).
7980- **Indexing**. Should be easy to list datasets matching a certain pattern or reading from a certain source.
8080- - Datasets could be linked to a [[Open Data#Datafile|Datafile]]/`datapackage.yml` with description, default visualizations, WASM linked code...
8181- - One repository, one dataset or catalog/hub.
8181+ - Datasets are linked to a [[Open Data#Datafile|Datafile]]/`datapackage.yml` with metadata.
8282+ - One repository, one dataset or one catalog/hub.
8283 - To avoid yet another open dataset portal, build adapters to integrate with other indexes.
8383- - For example, bring all HF datasets by making a simple PR on their repository that generates a `datapackage.yml` reusing their parquet files.
8484+ - For example, integrate all [Hugging Face datasets](https://huggingface.co/docs/datasets/index) by making an scheduled job that builds a Frictionless Catalog (bunch of `datapackage.yml`s pointing to their parquet files).
8485 - [Expose a JSON-LD so Google Dataset Search can index it](https://developers.google.com/search/docs/appearance/structured-data/dataset).
8585-- **Formatting**. Datasets should be saved and exposed in multiple formats (CSV, Parquet, ...). Could be done via WASM transformations or in the fly when pulling data. The package manager should be **format and storage agnostic**.
8686+- **Formatting**. Datasets are saved and exposed in multiple formats (CSV, Parquet, ...). Could be done in the backend, or in the client when pulling data (WASM). The package manager should be **format and storage agnostic**. Give me the dataset with id `xyz` as a CSV in this folder.
8687- **Social**. Allow users, organizations, stars, citations, attaching default visualizations (d3, [Vega](https://vega.github.io/), [Vegafusion](https://github.com/vegafusion/vegafusion/), and others), ...
8788 - Importing datasets. Making possible to `data fork user/data`, improve something and publish the resulting dataset back (via something like a PR).
8889 - Have issues and discussions close to the dataset.
8990- **Extensible**. Users could extend the package resource (e.g: [Time Series Tabular Package inherits from Tabular Package](https://specs.frictionlessdata.io/tabular-data-package/)) and add better support for more specific kinds of data (geographical).
9090- - Integrations could be built to ingest/publish data from other hubs (e.g: CKAN)
9191+ - Build integrations to ingest and publish data in other hubs (e.g: CKAN, HuggingFace, ...).
91929292-### Storage
9393+### Storage and Serialization
93949494-- **Permanence**. Each [version](https://tech.datopian.com/versioning/) should be permanent and accessible.
9595+- **Permanence**. Each [version](https://tech.datopian.com/versioning/) should be permanent and accessible (look at `git`, `IPFS`, `dolt`, ...).
9596- **Versioning**. Should be able to manage _diffs_ and _incremental changes_ in a smart way. E.g: only storing the new added rows or updated columns.
9697 - Should allow [automated harvesting of new data](https://tech.datopian.com/harvesting/) with sensors (external functions) or scheduled jobs.
9798 - Each version is referenced by a hash. Git style.
···99100 - Think at the dataset level and not the file level.
100101 - Tabular data could be partitioned to make it easier for future retrieval.
101102- **Immutability**. Never remove historical data. Data should be append only.
102102- - Similar to how `git` deals with it. You could force the deletion of something in case that's needed, but not the default.
103103-- **Flexible**. Allow centralized ([S3](https://twitter.com/quiltdata/status/1569447878212591618), GCS, ...) and decentralized (IPFS, Hypercore, Torrent, ...) layers.
103103+ - Similar to how `git` deals with it. You _could_ force the deletion of something in case that's needed, but that's not the default behaivor.
104104+- **Flexible**. Allow arbitrary backends. Both centralized ([S3](https://twitter.com/quiltdata/status/1569447878212591618), GCS, ...) and decentralized (IPFS, Hypercore, Torrent, ...) layers.
104105 - As agnostic as possible, supporting many types of data; tables, geospatial, images, ...
105106 - Can all datasets can be represented as tabular datasets? This will enable to run SQL (`select, groupbys, joins`) on top of them which might be the easier way to start collaborating.
106106- - A dataset could have different formats derived from a common one. Represent all data as Arrow datasets, and build converters between that one format and all others. This is how Pandoc and LLVM work. The protocol could do the transformation (e.g: CSV to Parquet, JSON to Arrow, ...) automatically and some checks at the data level to verify they contain the same information.
107107+ - A dataset could have different formats derived from a common one. Build converters between formats relying on the Apache Arrow in memory standard format. This is similar to how Pandoc and LLVM work! The protocol could do the transformation (e.g: CSV to Parquet, JSON to Arrow, ...) automagically and run some checks at the data level to verify they contain the same information.
107108 - Datasets could be tagged from a library of types (e.g: `ip-adress`) and [conversion functions](https://github.com/jbenet/transformer) (`ip-to-country`). Given that the representation is common (Arrow), the transformations could be written in multiple languages.
108109109110### Transformations
···111112- **Deterministic**. Packaged lambda style transformations (WASM/Docker).
112113 - For tabular data, starting with just SQL might be great.
113114 - Pyodite + DuckDB for transformations could cover a large area.
114114- - Datasets could be derived by importing other datasets and applying deterministic transformations in the `Datafile`. Similar to Docker containers. That file will carry [Metadata, Lineage and even some defaults (visualizations, code, ...)](https://handbook.datalad.org/en/latest/basics/101-127-yoda.html)
115115-- **Declarative**. Everything should be defined as code. E.g: YAML files with the source datasets and the transformations. Similar to how Pachyderm/Kamu/Holium do.
116116- - E.g: The tool ends up orchestrating containers that read/write from the storage layer, Pachyderm style.
115115+ - Datasets could be derived by importing other datasets and applying deterministic transformations in the `Datafile`. Similar to Docker containers and [Splitfiles](https://github.com/splitgraph/sgr#build-and-query-versioned-reproducible-datasets). That file will carry [Metadata, Lineage and even some defaults (visualizations, code, ...)](https://handbook.datalad.org/en/latest/basics/101-127-yoda.html)
116116+- **Declarative**. Transformations should be defined as code and be idempotent. Similar to how Pachyderm/Kamu/Holium work.
117117+ - E.g: The transformation tool ends up orchestrating containers/functions that read/write from the storage layer, Pachyderm style.
117118- **Environment agnostic**. Can be run locally and remotely. One machine or a cluster. Streaming or batch.
118119- **Templated**. Having a repository/market of open transformations could empower a bunch of use cases ready to plug in to datasets:
119120 - Detect outliers automatically on tabular data.
···123124 - Enrich data smartly (Match and Augment pattern). If a matcher detects a date, the augmenter can add the day of week. If is something like a latitude and longitude, the augmenter adds country/city. [Some tools do this with closed source data](https://www.getcensus.com/blog/census-enrichment-third-party-data-enrichment-now-in-your-warehouse).
124125 - [Templated validations to make sure datasets conform to certain standards](https://framework.frictionlessdata.io/docs/checks/baseline.html).
125126126126-### Visualizations
127127+### Consumption
128128+- **Accessible**. Datasets are **files**. Datasets are static assets living somewhere. Don't get in the middle with libraries or gated databases.
129129+- **Documentation**. Surface derived work (e.g: reports, other datasets, ...).
130130+- **Embedded Visualizations**. Know what's in there before downloading it.
131131+ - **Sane Defaults**. Suggest basic charts (bars, lines, time series, clustering). Multiple [views](https://tech.datopian.com/views/).
132132+ - **Exploratory**. Allow drill downs and customization. Offer a [simple way](https://lite.datasette.io/) for people to query/explore the data.
133133+ - **Dynamic**. Use only the data you need. No need to pull 150GB.
134134+- **Default APIs**. For some datasets, allowing REST API / GraphQL endpoints might be useful. Same with providing an SQL interface.
127135128128-- **Sane Defaults**. Suggest basic charts (bars, lines, time series, clustering). Multiple [views](https://tech.datopian.com/views/).
129129-- **Exploratory**. Allow drill downs and customization. Offer a [simple way](https://lite.datasette.io/) for people to query/explore the data.
130130-- **Dynamic**. Use only the data you need. No need to pull 150GB.
136136+## Frequently Asked Questions
131137132132-## Architecture
138138+> I'm not super clear on these answers! Please [reach out](https://davidgasquez.github.io/) if you want to chat about it.
133139134134-
140140+1. What would be a great use case to start with?
135141136136-_[Edit on Excalidraw](https://excalidraw.com/#json=RLkinyHZE-4Px_cl21UDI,z8D-l20khdaB-lRumpzN7w)_
142142+I'd say [chain related data](https://davidgasquez.github.io/blockchain-data-pipelines/). Is open and people are eager to get their hands on it. I'm [working on that area](https://davidgasquez.github.io/gitcoin-data/), so I might be biased.
137143138138-## Extra Thoughts
144144+2. Why should people use this instead of doing their own thing?
139145140140-- [Making a SQL interface](https://twitter.com/josephjacks_/status/1492931290416365568) to query and mix these datasets could be a great step forward since it'll enable tooling like `dbt` to be used on top of it. **Data-as-code**.
141141- - SQL should be enough for unlocking most part of the potential. E.g: joining Wikipedia data to Our World In Data.
142142- - There are some [web3 DAOs already using `dbt` to improve data models](https://github.com/MetricsDAO/harmony_dbt/tree/main/models/metrics)!
146146+[If everybody could converge to it, e.g: _"datapackage.json_" as a metadata and schema description standard, then, an ecosystem of utilities and libraries for processing data would take advantage of it](https://news.ycombinator.com/item?id=15346836).
143147144144-## Open Questions
148148+3. What is the incentive for people to adopt it?
145149146146-- What would be a great use case to start with?
147147- - Why should people use this vs doing their own thing?
148148-- How can datasets be indexed?
149149-- What is the incentive for people to adopt it?
150150- - [If everybody could converge to it, e.g: _"datapackage.json_" as a metadata and schema description standard, then, an ecosystem of utilities and libraries for processing data would take advantage of it](https://news.ycombinator.com/item?id=15346836).
151151- - Is there a way to use web3 mechanisms to incentivize people? DAOs might be a good fit. Also, companies like [Golden](httpfs://golden.com/) and [index.as](https://index.as/) are doing interesting work on monetizing data curation.
152152-- How can LLMs help "building bridges"?
153153- - They're blurring the line between structured and unstructured data.
154154- - E.g: point a GPT wrapper to a GitHub repository and get the auto-generated `datapakage.json`. It should infer files, schema, and types and generate some metadata for us. Then, a "dataset package" can be anything the tool can crawl.
155155-- [[Large Language Models|LLMs can parse unstructured data (CSV) and also generate structure from any data source (scrapping websites)]] making it easy to [create datasets from random sources](https://tomcritchlow.com/2021/03/29/open-scraping-database/).
156156-- How can we stream new data reliably? E.g: some datasets like Ethereum `blocks` are not static.
157157-- Is it possible to [mount large amount of data](https://rclone.org/commands/rclone_mount/) ([FUSE](https://github.com/datalad/datalad-fuse)) from a remote source and get it dynamically as needed?
158158-- Can new table formats play efficiently with IPFS?
159159- - E.g: Running [`delta-rs`](https://github.com/delta-io/delta-rs) on top of IPFS.
160160- - Parquet could be a great fit if we figure out how to deterministically serialize it and integrate with IPLD.
161161-- How to work with private data?
162162- - Homomorphic encryption?
163163-- How could something like [Ver](https://raulcastrofernandez.com/data-discovery-updates/) works? If you can envision the table you would like to have in front of you, i.e., you can write down the attributes you would like the table to contain, then the system will find it for you.
164164- - This probably needs a [[Knowledge Graphs]]!
165165-- How can a [[Knowledge Graphs]] [help with the data catalog](https://docs.atomicdata.dev/usecases/data-catalog.html)?
166166-- [How would a Substack for databases look like](https://tomcritchlow.com/2023/01/27/small-databases/)?
167167- - An easy tool for creating, maintaining and publishing databases with the ability to restrict parts or all of it behind a pay wall. Pair it with the ability to send email updates to your audience about changes and additions.
150150+I wonder if there are ways to use novel mechanisms (e.g: DAOs) to incentive people? Also, companies like [Golden](httpfs://golden.com/) and [index.as](https://index.as/) are doing interesting work on monetizing data curation.
168151169169-### Related Projects
152152+4. How can LLMs help "building bridges"?
153153+154154+LLMs could infer schema, types, and generate some metadata for us. [[Large Language Models|LLMs can parse unstructured data (CSV) and also generate structure from any data source (scrapping websites)]] making it easy to [create datasets from random sources](https://tomcritchlow.com/2021/03/29/open-scraping-database/).
155155+156156+They're definitely blurring the line between structured and unstructured data too. Imagine pointing a LLMs to a GitHub repository with some CSVs and get the auto-generated `datapakage.json`.
157157+158158+5. How can we stream/update new data reliably? E.g: some datasets like Ethereum `blocks` could be updated every few minutes.
159159+160160+I don't have a great answer. Perhaps just push the new data into partitioned datasets?
161161+162162+7. Is it possible to [mount large amount of data](https://rclone.org/commands/rclone_mount/) ([FUSE](https://github.com/datalad/datalad-fuse)) from a remote source and get it dynamically as needed?
163163+164164+It should be possible. I wonder if we could mount all datasets locally and explore them as if they were in your laptop.
165165+166166+8. Can new table formats play efficiently with IPFS?
167167+168168+Parquet could be a great fit if we figure out how to deterministically serialize it and integrate with IPLD. This will reduce their size as unchanged columns could be encoded in the same CID.
169169+170170+Later on I think it could be interesting to explore running [`delta-rs`](https://github.com/delta-io/delta-rs) on top of IPFS.
171171+172172+9. How to work with private data?
173173+174174+Homomorphic encryption?
175175+176176+9. How could something like [Ver](https://raulcastrofernandez.com/data-discovery-updates/) works?
177177+178178+If you can envision the table you would like to have in front of you, i.e., you can write down the attributes you would like the table to contain, then the system will find it for you. This probably needs a [[Knowledge Graphs]]!
179179+180180+10. How can a [[Knowledge Graphs]] [help with the data catalog](https://docs.atomicdata.dev/usecases/data-catalog.html)?
181181+182182+It could help users connect datasets. With good enough core datasets, it could be used as an LLM backend.
170183171171-#### Computation
184184+11. [How would a Substack for databases look like](https://tomcritchlow.com/2023/01/27/small-databases/)?
172185173173-- [Kamu](https://www.kamu.dev/).
174174-- [Bacalhau](https://www.bacalhau.org/).
175175-- [Holium](https://docs.holium.org/). An open source protocol dedicated to the management of data connected through transformations. Similar to Pachyderm but using WASM and IPFS.
176176-- [Ocean Protocol](https://oceanprotocol.com/technology/compute-to-data).
177177-- [The Graph](https://thegraph.com/).
178178-- [Trino](https://trino.io/).
186186+ An easy tool for creating, maintaining and publishing databases with the ability to restrict parts or all of it behind a pay wall. Pair it with the ability to send email updates to your audience about changes and additions.
187187+188188+12. Curated and small data (e.g: at the community level) is not reachable by Google. How can we help there?
189189+190190+Indeed! With LLMs on the rise, community curated datasets become more important as they don't appear in the big data dumps.
191191+192192+## Related Projects
179193180180-#### Data Package Managers
194194+### Data Package Managers
181195182196- [Qri](https://qri.io/). An evolution of the classical open portals that added [[Decentralized Protocols]] (IPFS) and computing on top of the data. Sadly, [it came to an end early in 2022](https://qri.io/winding_down).
183183-- [Datalad](https://www.datalad.org/). [Extended to IPFS](https://kinshukk.github.io/posts/gsoc-summary-and-future-thoughts/).
197197+- [Datalad](https://www.datalad.org/). [Extended to IPFS](https://kinshukk.github.io/posts/gsoc-summary-and-future-thoughts/)
184198 - Is a [great tool](https://archive.fosdem.org/2020/schedule/event/open_research_datalad/) and uses Git Annex (distributed binary object tracking layer on top of git).
185199 - Complicated to wrap your head around. Lots of different commands and concepts. On the other hand, it's very powerful and flexible. Git Annex is complex but powerful and flexible.
186186- - The handbook is very good, but it's a lot of reading if you just want to test things out.
187187-- [Huggingface Datasets](https://huggingface.co/docs/datasets).
188188-- [Quilt](https://github.com/quiltdata/quilt).
189189- - Forces both Python and S3.
190190-- [Oxen](https://github.com/Oxen-AI/Oxen).
191191- - Data is not accesible from other tools.
192192- - [Docs](https://github.com/Oxen-AI/oxen-release#-oxen-release) are sparse.
193193- - Definitely more in the Git for Data space than Dataset Package Manager.
194194-- [Frictionless Data](https://frictionlessdata.io/projects/#software-and-standards).
195195-- [Datopian Data CLI](https://github.com/datopian/data-cli). Sucesor of [DPM](https://github.com/frictionlessdata/dpm-js).
196196-- [LakeFS](https://lakefs.io/blog/git-for-data/). More like Git for Data.
197197-- [Datasette](https://lite.datasette.io/).
198198-- [Algovera Metahub](https://github.com/AlgoveraAI/metahub).
199199-- [DVC](https://github.com/iterative/dvc).
200200-- [XVC](https://github.com/iesahin/xvc).
201201-- [ArtiVC](https://artivc.io/).
202202-- [Xetdata](https://xetdata.com/).
203203-- [Dud](https://github.com/kevin-hanselman/dud).
204204-- [Splitgraph](https://github.com/splitgraph/sgr).
205205-- [Deep Lake](https://github.com/activeloopai/deeplake).
206206-- [Dim](https://github.com/c-3lab/dim).
207207- - Hard to grok how to use it from the docs.
208208- - Quite small surface area. You can basically install datasets from URLs, create new ones, or apply some kind of GPT3 transformation on top of them.
209209-- [Juan Benet's data](https://github.com/jbenet/data).
210210-- [Colah's data](https://github.com/colah/data).
200200+- [Huggingface Datasets](https://huggingface.co/docs/datasets)
201201+- [Quilt](https://github.com/quiltdata/quilt)
202202+ - Forces both Python and S3
203203+- [Oxen](https://github.com/Oxen-AI/Oxen)
204204+ - Data is not accesible from other tools
205205+ - [Docs](https://github.com/Oxen-AI/oxen-release#-oxen-release) are sparse
206206+ - Definitely more in the Git for Data space than Dataset Package Manager
207207+- [Frictionless Data](https://frictionlessdata.io/projects/#software-and-standards)
208208+- [Datopian Data CLI](https://github.com/datopian/data-cli). Successor of [DPM](https://github.com/frictionlessdata/dpm-js)
209209+- [LakeFS](https://lakefs.io/blog/git-for-data/). More like Git for Data
210210+- [Datasette](https://lite.datasette.io/)
211211+- [Algovera Metahub](https://github.com/AlgoveraAI/metahub)
212212+- [DVC](https://github.com/iterative/dvc)
213213+- [XVC](https://github.com/iesahin/xvc)
214214+- [ArtiVC](https://artivc.io/)
215215+- [Xetdata](https://xetdata.com/)
216216+- [Dud](https://github.com/kevin-hanselman/dud)
217217+- [Splitgraph](https://github.com/splitgraph/sgr)
218218+- [Deep Lake](https://github.com/activeloopai/deeplake)
219219+- [Dim](https://github.com/c-3lab/dim)
220220+ - Hard to grok how to use it from the docs
221221+ - Quite small surface area. You can basically install datasets from URLs, create new ones, or apply some kind of GPT3 transformation on top of them
222222+- [Juan Benet's data](https://github.com/jbenet/data)
223223+- [Colah's data](https://github.com/colah/data)
211224- [Dolt](https://docs.dolthub.com/) is another interesting project in the space with some awesome data structures. They also [do data bounties](https://www.dolthub.com/repositories/dolthub/us-businesses)!
212225213213-## Open Datasets
226226+### Computation
227227+228228+- [Kamu](https://www.kamu.dev/)
229229+- [Bacalhau](https://www.bacalhau.org/)
230230+- [Holium](https://docs.holium.org/)
231231+- [Ocean Protocol](https://oceanprotocol.com/technology/compute-to-data)
232232+- [The Graph](https://thegraph.com/)
233233+- [Trino](https://trino.io/)
234234+235235+### Large Open Datasets
214236215237- [Wikipedia](https://dumps.wikimedia.org/)
216238- [Github](https://www.gharchive.org/)
···286308287309### Interesting Projects
288310289289-- [Rath](https://rath.kanaries.net/).
290290-- [Perspective](https://perspective.finos.org/).
291291-- [Rill Developer](https://github.com/rilldata/rill-developer).
292292-- [Datastation](https://app.datastation.multiprocess.io/).
311311+- [Rath](https://rath.kanaries.net/)
312312+- [Perspective](https://perspective.finos.org/)
313313+- [Rill Developer](https://github.com/rilldata/rill-developer)
314314+- [Datastation](https://app.datastation.multiprocess.io/)
293315294316#### Datafile
295317···334356- Spec file locator with fallback to the package registry.
335357- Versioning and latest versions.
336358- Asset checksums.
359359+360360+## Architecture
361361+362362+
363363+364364+_[Edit on Excalidraw](https://excalidraw.com/#json=RLkinyHZE-4Px_cl21UDI,z8D-l20khdaB-lRumpzN7w)_