:art: · davidgasquez.com/handbook@edb5142

+1

Cooperatives.md

··· 1 1 # Cooperatives 2 2 3 3 - A cooperative is an autonomous association of persons united voluntarily to meet their common economic, social and cultural needs and aspirations through a jointly- owned and democratically- controlled enterprise. 4 + - There are [many ways to form cooperatives](https://institute.coop/sites/default/files/resources/356%202009_Johnson%20etal_Tech%20Freelancers%20Guide%20to%20Worker%20Co-ops.pdf). 4 5 5 6 ## Resources 6 7

+2

Data/Data Engineering.md

··· 6 6 - **Reliability:** how well can systems recover from outages and incidents? 7 7 - **Speed of execution:** how quickly can you get a new data source up and running? 8 8 - If it can be solved with SQL, stick to SQL. 9 + - SQL will be the abstraction layer in streaming too so you don't have to care about incremental materialization or timely dataflows. 9 10 - A [consistent pattern](https://www.startdataengineering.com/post/design-patterns/) across your data pipelines helps devs communicate easily and understand code better. 11 + - Data Engineering can learn from decentralized systems ideas like, Content Addressed Data, Immutability, and [[Idempotence]]. 10 12 11 13 ## Data Pipelines 12 14

+20 -3

Future.md

··· 1 - # Things that Might Look Weird in the Future 1 + # Future 2 + 3 + ## Things that Might Look Weird in the Future 2 4 3 5 History teaches us that in 100 years from now [[Openness|some of the assumptions we believed will turn out to be wrong]]. A good question to ask is "What might we be wrong about today?". These are a few things that future humans might see as weird behavior: 4 6 ··· 7 9 - Give birth without advanced assistance. 8 10 - Not caring for all the [animal suffering in the wild](https://longtermrisk.org/the-importance-of-wild-animal-suffering/). 9 11 - Nature is not safe! The default is suffering. The current mentality is that nature is good and disruptions from nature are bad. 10 - - The ignorance of Social Media and its [full impact on society](https://twitter.com/M_B_Petersen/status/1483457679800651787). 12 + - The ignorance of Social Media and its [full impact on s_o_ciety](https://twitter.com/M_B_Petersen/status/1483457679800651787). 11 13 - Is "being bad for society" an emergent property of social networks as they grow? 12 - - Voting Systems and not using more Prediction Markets in public. 14 + - Current Voting Systems. 15 + - Not relying more into tools like Prediction Markets. 16 + 17 + ## Predictions 18 + 13 19 - More experimentation around [[Politics|governance]]: 14 20 - Charter cities. 15 21 - [Holacracy](https://en.m.wikipedia.org/wiki/Holacracy). ··· 20 26 - More interactive explanations like the ones the awesome [Nicky Case](https://ncase.me/) do! 21 27 - More concern around systems with weird incentives causing large amount of pain (Moloch). 22 28 - Work valuation changes (plumbing more expensive than some software development) due to [Moravec's paradox](https://en.wikipedia.org/wiki/Moravec%27s_paradox). We will automate making a full app before a robot is able to master physical arms and legs like a 5 year old. 29 + - Open data will be more important as they can produce better models and help coordinate people providing shared context. 30 + - The current decentralized protocols (IPFS, ActivityPub, ...) need to evolve more. Specially around UX. People don't care about decentralization, they care about UX. 31 + 32 + ### Exciting Software Engineering Ideas 33 + 34 + - Content Addressed Data + Immutability 35 + - CRDTs 36 + - Homomorphic Encryption 37 + - Prolly/Merkle Trees 38 + - Differential/Timely Dataflow 39 + - Zero-Knowledge Proofs

+140 -112

Open Data.md

··· 1 1 # Open Data 2 - 3 - > Bring Open Data to the level of Open Source. 4 - > Make Open Data compatible with the modern data ecosystem (tooling, approaches, ...). 2 + _Make Open Data compatible with the Modern Data Ecosystem_. 5 3 6 4 ## Motivation 7 5 ··· 20 18 21 19 Open protocols create open systems. Open code creates tools. **Open data creates open knowledge**. We need better tools, protocols, and mechanisms to improve the Open Data ecosystem. It should be easy to find, download, process, publish, and collaborate on open datasets. 22 20 23 - Iterative improvements over public datasets would yield large amounts of value ([Dune did it with blockchain data](https://dune.com/blog/the-community-data-platform))¹. Access to data gives people the opportunity to create new business and make better decisions. Open Source code has made a huge impact in the world. Let's make Open Data do the same! [Anyone should be able to fork and re-publish fixed, cleaned, reformatted datasets as easily as people fork code](https://juan.benet.ai/blog/2014-02-21-data-management-problems/). 21 + Iterative improvements over public datasets yield large amounts of value ([check how Dune did it with blockchain data](https://dune.com/blog/the-community-data-platform))¹. Access to data gives people the opportunity to create new business and make better decisions. 22 + 23 + Open Source code has made a huge impact in the world. Let's make Open Data do the same! Let's make it possible for [anyone to fork and re-publish fixed, cleaned, reformatted datasets as easily as we do the same things with code](https://juan.benet.ai/blog/2014-02-21-data-management-problems/). 24 24 25 25 ### Why Now? 26 26 27 - We have cheaper storage, better compute, and more data. We need to improve our workflows now. How does a world where people collaborate on datasets looks like? 27 + We have better and cheaper infrastructure. That includes things like faster storage, better compute, and, larger amounts of data. We need to improve our data workflows now. How does a world where people collaborate on datasets looks like? [The data is there. We just need to use it](https://twitter.com/auren/status/1509340748054945794). 28 28 29 - During the last few years, a Cambrian explosion of open source tools have emerged. There are new query engines (e.g: DuckDB, DataFusion, ...), execution frameworks (WASM), data standards (Arrow, Parquet, ...), and a growing set of open data marketplaces (Datahub, HuggingFace Datasets, Kaggle Datasets). 29 + During the last few years, a large number of new data and open source tools have emerged. There are new query engines (e.g: DuckDB, DataFusion, ...), execution frameworks (WASM), data standards (Arrow, Parquet, ...), and a growing set of open data marketplaces (Datahub, HuggingFace Datasets, Kaggle Datasets). 30 30 31 - These trends have already quick-started movements like [DeSci](https://ethereum.org/en/desci/) but we still need more tooling around data to make interoperability possible. **We should use the same modern tooling companies are using to manage open datasets**. A sort of [Data Operating system](https://data-operating-system.com/). Having better data will create better and more accessible AI models ([people are working on this](https://github.com/togethercomputer/OpenDataHub)). 31 + These trends are already making it's way towards movements like [DeSci](https://ethereum.org/en/desci/) or smaller projects like [Py-Code Datasets](https://py-code.org/datasets). But, we still need more tooling around data to improve interoperability as much as possible. Lots of companies have figured out how to make the most of their datasets. **We should use similar tooling and approaches companies are using to manage the open datasets that surrounds us**. A sort of [Data Operating system](https://data-operating-system.com/). 32 32 33 33 Data wrangling is a perpetual maintenance commitment, taking a lot of ongoing attention and resources. [Better and modern data tooling can reduce these costs](https://github.com/catalyst-cooperative/pudl). 34 34 35 - Organizations like [Our World in Data](https://ourworldindata.org/) or [538](https://fivethirtyeight.com/) provide useful analysis but have to deal with _dataset management_. They end up building custom tools around their workflows. That works, but limits the potential of these datasets. In the end, there is no `data get OWID/daily-covid-cases`, no `data query "select * from 538/polls"` that could act as entry-point to explore datasets. 35 + Organizations like [Our World in Data](https://ourworldindata.org/) or [538](https://fivethirtyeight.com/) provide useful analysis but have to deal with _dataset management_, spending most of their time building custom tools around their workflows. That works, but limits the potential of these datasets. Sadly, there is no `data get OWID/daily-covid-cases` or `data query "select * from 538/polls"` that could act as a quick and easy entry-point to explore datasets. 36 36 37 - We could have a better ecosystem if we **collaborate on open standards**! So, lets move towards more composable, maintainable, and reproducible open data. 37 + We could have a better data ecosystem if we **collaborate on open standards**! So, lets move towards more [composable](https://voltrondata.com/codex), maintainable, and reproducible open data. 38 38 39 - ¹ I think blockchain data is a great place to start building the idea as the data there is open, immutable, and useful. 39 + ¹ Blockchain data might be a great place to start building on these ideas as the data there is open, immutable, and useful. 40 40 41 41 ## Design Principles 42 42 43 43 - **Easy**. Create, curate and share datasets without friction. 44 - - Data is useful only when used! Right now, we're not using most of humanity's datasets. That's not because they're not available but because they're hard to get. They're isolated in different places and formats. 44 + - Frictionless: Data is useful only when used! Right now, we're not using most of humanity's datasets. That's not because they're not available but because they're hard to get. They're isolated in different places and multiple formats. 45 45 - Pragmatism: published data is better than almost published one because something is missing. Publishing datasets to the web is too hard now and there are few purpose-built tools that help. 46 46 - **Versioned and Modular**. Data and metadata (e.g: `relation`) should be [updated, forked and discussed](https://github.com/jbenet/data/blob/master/dev/designdoc.md#data-hashes-and-refs) as code in version controlled repositories. 47 47 - Prime composability (e.g: [Arrow ecosystem](https://thenewstack.io/how-apache-arrow-is-changing-the-big-data-ecosystem/)) so tools/services can be swapped. 48 - - Metadata as a first-class citizen. 49 - - Git based approach collaboration. Adopt and integrate with `git` to reduce surface area. Build tooling to adapt revisions, tags, branches, issues, PRs to datasets. 50 - - Provide a declarative way of defining the datasets schema and other meta-properties like _relations_ or _tests_. 51 - - Support for integrating non-dataset files. A dataset could be linked to code, visualizations, pipelines, models, ... 52 - - **Reproducible and Verifiable**. People should be able to trust the final datasets without having to recompute everything from scratch. In real life events are immutable, data should be too. Make datasets the center of the tooling like [software defined assets](https://dagster.io/blog/software-defined-assets). 53 - - Thanks to immutability and content addressing, you can move backwards in time and run transformations or queries on how the dataset was at a certain point in time. 54 - - **Permissionless**. Anyone should be able to add/update/fix datasets or their metadata. GitHub style collaboration, curation, and composability. 55 - - **Aligned Incentives**. Curators should have incentives to improve datasets. Data is messy after all, but a good set of incentives could make great datasets surface and reward contributors accordingly. 48 + - Metadata as a first-class citizen. Even if minimal and automated. 49 + - Git based approach collaboration. Adopt and integrate with `git` and GitHub to reduce surface area. Build tooling to adapt revisions, tags, branches, issues, PRs to datasets. 50 + - Provide a declarative way of defining the datasets schema and other meta-properties like _relations_ or _tests/checks_. 51 + - Support for integrating non-dataset files. A dataset could be linked to code, visualizations, pipelines, models, reports, ... 52 + - **Reproducible and Verifiable**. People should be able to trust the final datasets without having to recompute everything from scratch. In "reality", events are immutable, data should be too. [Make datasets the center of the tooling](https://dagster.io/blog/software-defined-assets). 53 + - With immutability and content addressing, you can move backwards in time and run transformations or queries on how the dataset was at a certain point in time. 54 + - **Permissionless**. Anyone should be able to add/update/fix datasets or their metadata. GitHub style collaboration, curation, and composability. On data. 55 + - **Aligned Incentives**. Curators should have incentives to improve datasets. Data is messy after all, but a good set of incentives could make great datasets surface and reward contributors accordingly (e.g: [number of contributors to Dune](https://github.com/duneanalytics/spellbook/commits/main)). 56 56 - [Bounties](https://www.dolthub.com/bounties) could be created to reward people that adds useful but missing datasets. 57 - - Surfacing and creating great datasets should be rewarded. 57 + - Surfacing and creating great datasets could be rewarded (retroactively or with bounties). 58 58 - Curating the data provides compounding benefits for the entire community! 59 59 - Rewarding the datasets creators according to the usefulness. E.g: [CommonCrawl built an amazing repository](https://commoncrawl.org/) that OpenAI has used for their GPTs LLMs. Not sure how well CommonCrawl was compensated. 60 60 - **Open Source and Decentralized**. Datasets should be stored in multiple places. ··· 63 63 - [Trustfall](https://github.com/obi1kenobi/trustfall). 64 64 - Open source data integration projects like [Airbyte](https://airbyte.com/). They can used to build open data connectors making possible to replicate something from `$RANDOM_SOURCE` (e.g: spreadsheets, Ethereum Blocks, URL, ...) to any destination. 65 65 - Adapters are created by the community so data becomes connected. 66 - - Integrate with the modern data stack to avoid reinventing the wheel. 67 - - Decentralized the computation (where data lives) and then cache copies of the results (or aggregations) in CDNs. Most queries require only reading a small amount of data and going to be similar. 66 + - Having better data will help create better and more accessible AI models ([people are working on this](https://github.com/togethercomputer/OpenDataHub)). 67 + - Integrate with the modern data stack to avoid reinventing the wheel and increase surface of the required skill sets. 68 + - Decentralized the computation (where data lives) and then cache inmutable and static copies of the results (or aggregations) in CDNs (IPFS, R2, Torrent). Most end user queries require only reading a small amount of data! 68 69 69 70 ## Modules 70 71 ··· 72 73 73 74 Package managers have been hailed among the most important innovations Linux brought to the computing industry. The activities of both publishers and users of datasets resemble those of authors and users of software packages. 74 75 75 - - **Distribution**. Decentralized. No central authority. Can work in a closed network. Cache/CDN friendly. 76 + - **Distribution**. Decentralized. No central authority. Can work in closed and private networks. Cache/CDN friendly. 76 77 - A data package is an URI ([like in Deno](https://deno.land/manual@v1.31.2/examples/manage_dependencies)). You can import from an URL (`data add example.com/dataset.yml` or `data add example.com/hub_curated_datasets.yml`). 77 - - As [Rufus Pollock puts it](https://datahub.io/docs/dms/notebook#go-modules-and-dependency-management-re-data-package-management-2020-05-16-rufuspollock), Keep it as simple as possible. Store the table location and schema and get me the data on the hard disk fast. 78 - - [Bootstrap a package registry](https://antonz.org/writing-package-manager/). E.g: a GitHub repository with lots of known datapackages that acts as fallback and quick way to get started with the tool. 78 + - As [Rufus Pollock puts it](https://datahub.io/docs/dms/notebook#go-modules-and-dependency-management-re-data-package-management-2020-05-16-rufuspollock), Keep it as simple as possible. Store the table location and schema and get me the data on the hard disk (or my browser) fast. 79 + - [Bootstrap a package registry](https://antonz.org/writing-package-manager/). E.g: a GitHub repository with lots of known `datapackages` that acts as fallback and quick way to get started with the tool (`data list` returns a bunch of known open datasets and integrates with platforms like Huggingface). 79 80 - **Indexing**. Should be easy to list datasets matching a certain pattern or reading from a certain source. 80 - - Datasets could be linked to a [[Open Data#Datafile|Datafile]]/`datapackage.yml` with description, default visualizations, WASM linked code... 81 - - One repository, one dataset or catalog/hub. 81 + - Datasets are linked to a [[Open Data#Datafile|Datafile]]/`datapackage.yml` with metadata. 82 + - One repository, one dataset or one catalog/hub. 82 83 - To avoid yet another open dataset portal, build adapters to integrate with other indexes. 83 - - For example, bring all HF datasets by making a simple PR on their repository that generates a `datapackage.yml` reusing their parquet files. 84 + - For example, integrate all [Hugging Face datasets](https://huggingface.co/docs/datasets/index) by making an scheduled job that builds a Frictionless Catalog (bunch of `datapackage.yml`s pointing to their parquet files). 84 85 - [Expose a JSON-LD so Google Dataset Search can index it](https://developers.google.com/search/docs/appearance/structured-data/dataset). 85 - - **Formatting**. Datasets should be saved and exposed in multiple formats (CSV, Parquet, ...). Could be done via WASM transformations or in the fly when pulling data. The package manager should be **format and storage agnostic**. 86 + - **Formatting**. Datasets are saved and exposed in multiple formats (CSV, Parquet, ...). Could be done in the backend, or in the client when pulling data (WASM). The package manager should be **format and storage agnostic**. Give me the dataset with id `xyz` as a CSV in this folder. 86 87 - **Social**. Allow users, organizations, stars, citations, attaching default visualizations (d3, [Vega](https://vega.github.io/), [Vegafusion](https://github.com/vegafusion/vegafusion/), and others), ... 87 88 - Importing datasets. Making possible to `data fork user/data`, improve something and publish the resulting dataset back (via something like a PR). 88 89 - Have issues and discussions close to the dataset. 89 90 - **Extensible**. Users could extend the package resource (e.g: [Time Series Tabular Package inherits from Tabular Package](https://specs.frictionlessdata.io/tabular-data-package/)) and add better support for more specific kinds of data (geographical). 90 - - Integrations could be built to ingest/publish data from other hubs (e.g: CKAN) 91 + - Build integrations to ingest and publish data in other hubs (e.g: CKAN, HuggingFace, ...). 91 92 92 - ### Storage 93 + ### Storage and Serialization 93 94 94 - - **Permanence**. Each [version](https://tech.datopian.com/versioning/) should be permanent and accessible. 95 + - **Permanence**. Each [version](https://tech.datopian.com/versioning/) should be permanent and accessible (look at `git`, `IPFS`, `dolt`, ...). 95 96 - **Versioning**. Should be able to manage _diffs_ and _incremental changes_ in a smart way. E.g: only storing the new added rows or updated columns. 96 97 - Should allow [automated harvesting of new data](https://tech.datopian.com/harvesting/) with sensors (external functions) or scheduled jobs. 97 98 - Each version is referenced by a hash. Git style. ··· 99 100 - Think at the dataset level and not the file level. 100 101 - Tabular data could be partitioned to make it easier for future retrieval. 101 102 - **Immutability**. Never remove historical data. Data should be append only. 102 - - Similar to how `git` deals with it. You could force the deletion of something in case that's needed, but not the default. 103 - - **Flexible**. Allow centralized ([S3](https://twitter.com/quiltdata/status/1569447878212591618), GCS, ...) and decentralized (IPFS, Hypercore, Torrent, ...) layers. 103 + - Similar to how `git` deals with it. You _could_ force the deletion of something in case that's needed, but that's not the default behaivor. 104 + - **Flexible**. Allow arbitrary backends. Both centralized ([S3](https://twitter.com/quiltdata/status/1569447878212591618), GCS, ...) and decentralized (IPFS, Hypercore, Torrent, ...) layers. 104 105 - As agnostic as possible, supporting many types of data; tables, geospatial, images, ... 105 106 - Can all datasets can be represented as tabular datasets? This will enable to run SQL (`select, groupbys, joins`) on top of them which might be the easier way to start collaborating. 106 - - A dataset could have different formats derived from a common one. Represent all data as Arrow datasets, and build converters between that one format and all others. This is how Pandoc and LLVM work. The protocol could do the transformation (e.g: CSV to Parquet, JSON to Arrow, ...) automatically and some checks at the data level to verify they contain the same information. 107 + - A dataset could have different formats derived from a common one. Build converters between formats relying on the Apache Arrow in memory standard format. This is similar to how Pandoc and LLVM work! The protocol could do the transformation (e.g: CSV to Parquet, JSON to Arrow, ...) automagically and run some checks at the data level to verify they contain the same information. 107 108 - Datasets could be tagged from a library of types (e.g: `ip-adress`) and [conversion functions](https://github.com/jbenet/transformer) (`ip-to-country`). Given that the representation is common (Arrow), the transformations could be written in multiple languages. 108 109 109 110 ### Transformations ··· 111 112 - **Deterministic**. Packaged lambda style transformations (WASM/Docker). 112 113 - For tabular data, starting with just SQL might be great. 113 114 - Pyodite + DuckDB for transformations could cover a large area. 114 - - Datasets could be derived by importing other datasets and applying deterministic transformations in the `Datafile`. Similar to Docker containers. That file will carry [Metadata, Lineage and even some defaults (visualizations, code, ...)](https://handbook.datalad.org/en/latest/basics/101-127-yoda.html) 115 - - **Declarative**. Everything should be defined as code. E.g: YAML files with the source datasets and the transformations. Similar to how Pachyderm/Kamu/Holium do. 116 - - E.g: The tool ends up orchestrating containers that read/write from the storage layer, Pachyderm style. 115 + - Datasets could be derived by importing other datasets and applying deterministic transformations in the `Datafile`. Similar to Docker containers and [Splitfiles](https://github.com/splitgraph/sgr#build-and-query-versioned-reproducible-datasets). That file will carry [Metadata, Lineage and even some defaults (visualizations, code, ...)](https://handbook.datalad.org/en/latest/basics/101-127-yoda.html) 116 + - **Declarative**. Transformations should be defined as code and be idempotent. Similar to how Pachyderm/Kamu/Holium work. 117 + - E.g: The transformation tool ends up orchestrating containers/functions that read/write from the storage layer, Pachyderm style. 117 118 - **Environment agnostic**. Can be run locally and remotely. One machine or a cluster. Streaming or batch. 118 119 - **Templated**. Having a repository/market of open transformations could empower a bunch of use cases ready to plug in to datasets: 119 120 - Detect outliers automatically on tabular data. ··· 123 124 - Enrich data smartly (Match and Augment pattern). If a matcher detects a date, the augmenter can add the day of week. If is something like a latitude and longitude, the augmenter adds country/city. [Some tools do this with closed source data](https://www.getcensus.com/blog/census-enrichment-third-party-data-enrichment-now-in-your-warehouse). 124 125 - [Templated validations to make sure datasets conform to certain standards](https://framework.frictionlessdata.io/docs/checks/baseline.html). 125 126 126 - ### Visualizations 127 + ### Consumption 128 + - **Accessible**. Datasets are **files**. Datasets are static assets living somewhere. Don't get in the middle with libraries or gated databases. 129 + - **Documentation**. Surface derived work (e.g: reports, other datasets, ...). 130 + - **Embedded Visualizations**. Know what's in there before downloading it. 131 + - **Sane Defaults**. Suggest basic charts (bars, lines, time series, clustering). Multiple [views](https://tech.datopian.com/views/). 132 + - **Exploratory**. Allow drill downs and customization. Offer a [simple way](https://lite.datasette.io/) for people to query/explore the data. 133 + - **Dynamic**. Use only the data you need. No need to pull 150GB. 134 + - **Default APIs**. For some datasets, allowing REST API / GraphQL endpoints might be useful. Same with providing an SQL interface. 127 135 128 - - **Sane Defaults**. Suggest basic charts (bars, lines, time series, clustering). Multiple [views](https://tech.datopian.com/views/). 129 - - **Exploratory**. Allow drill downs and customization. Offer a [simple way](https://lite.datasette.io/) for people to query/explore the data. 130 - - **Dynamic**. Use only the data you need. No need to pull 150GB. 136 + ## Frequently Asked Questions 131 137 132 - ## Architecture 138 + > I'm not super clear on these answers! Please [reach out](https://davidgasquez.github.io/) if you want to chat about it. 133 139 134 - ![Architecture](https://user-images.githubusercontent.com/1682202/224966685-b2406d5f-b162-4a93-a68a-af0afca45ebe.png) 140 + 1. What would be a great use case to start with? 135 141 136 - _[Edit on Excalidraw](https://excalidraw.com/#json=RLkinyHZE-4Px_cl21UDI,z8D-l20khdaB-lRumpzN7w)_ 142 + I'd say [chain related data](https://davidgasquez.github.io/blockchain-data-pipelines/). Is open and people are eager to get their hands on it. I'm [working on that area](https://davidgasquez.github.io/gitcoin-data/), so I might be biased. 137 143 138 - ## Extra Thoughts 144 + 2. Why should people use this instead of doing their own thing? 139 145 140 - - [Making a SQL interface](https://twitter.com/josephjacks_/status/1492931290416365568) to query and mix these datasets could be a great step forward since it'll enable tooling like `dbt` to be used on top of it. **Data-as-code**. 141 - - SQL should be enough for unlocking most part of the potential. E.g: joining Wikipedia data to Our World In Data. 142 - - There are some [web3 DAOs already using `dbt` to improve data models](https://github.com/MetricsDAO/harmony_dbt/tree/main/models/metrics)! 146 + [If everybody could converge to it, e.g: _"datapackage.json_" as a metadata and schema description standard, then, an ecosystem of utilities and libraries for processing data would take advantage of it](https://news.ycombinator.com/item?id=15346836). 143 147 144 - ## Open Questions 148 + 3. What is the incentive for people to adopt it? 145 149 146 - - What would be a great use case to start with? 147 - - Why should people use this vs doing their own thing? 148 - - How can datasets be indexed? 149 - - What is the incentive for people to adopt it? 150 - - [If everybody could converge to it, e.g: _"datapackage.json_" as a metadata and schema description standard, then, an ecosystem of utilities and libraries for processing data would take advantage of it](https://news.ycombinator.com/item?id=15346836). 151 - - Is there a way to use web3 mechanisms to incentivize people? DAOs might be a good fit. Also, companies like [Golden](httpfs://golden.com/) and [index.as](https://index.as/) are doing interesting work on monetizing data curation. 152 - - How can LLMs help "building bridges"? 153 - - They're blurring the line between structured and unstructured data. 154 - - E.g: point a GPT wrapper to a GitHub repository and get the auto-generated `datapakage.json`. It should infer files, schema, and types and generate some metadata for us. Then, a "dataset package" can be anything the tool can crawl. 155 - - [[Large Language Models|LLMs can parse unstructured data (CSV) and also generate structure from any data source (scrapping websites)]] making it easy to [create datasets from random sources](https://tomcritchlow.com/2021/03/29/open-scraping-database/). 156 - - How can we stream new data reliably? E.g: some datasets like Ethereum `blocks` are not static. 157 - - Is it possible to [mount large amount of data](https://rclone.org/commands/rclone_mount/) ([FUSE](https://github.com/datalad/datalad-fuse)) from a remote source and get it dynamically as needed? 158 - - Can new table formats play efficiently with IPFS? 159 - - E.g: Running [`delta-rs`](https://github.com/delta-io/delta-rs) on top of IPFS. 160 - - Parquet could be a great fit if we figure out how to deterministically serialize it and integrate with IPLD. 161 - - How to work with private data? 162 - - Homomorphic encryption? 163 - - How could something like [Ver](https://raulcastrofernandez.com/data-discovery-updates/) works? If you can envision the table you would like to have in front of you, i.e., you can write down the attributes you would like the table to contain, then the system will find it for you. 164 - - This probably needs a [[Knowledge Graphs]]! 165 - - How can a [[Knowledge Graphs]] [help with the data catalog](https://docs.atomicdata.dev/usecases/data-catalog.html)? 166 - - [How would a Substack for databases look like](https://tomcritchlow.com/2023/01/27/small-databases/)? 167 - - An easy tool for creating, maintaining and publishing databases with the ability to restrict parts or all of it behind a pay wall. Pair it with the ability to send email updates to your audience about changes and additions. 150 + I wonder if there are ways to use novel mechanisms (e.g: DAOs) to incentive people? Also, companies like [Golden](httpfs://golden.com/) and [index.as](https://index.as/) are doing interesting work on monetizing data curation. 168 151 169 - ### Related Projects 152 + 4. How can LLMs help "building bridges"? 153 + 154 + LLMs could infer schema, types, and generate some metadata for us. [[Large Language Models|LLMs can parse unstructured data (CSV) and also generate structure from any data source (scrapping websites)]] making it easy to [create datasets from random sources](https://tomcritchlow.com/2021/03/29/open-scraping-database/). 155 + 156 + They're definitely blurring the line between structured and unstructured data too. Imagine pointing a LLMs to a GitHub repository with some CSVs and get the auto-generated `datapakage.json`. 157 + 158 + 5. How can we stream/update new data reliably? E.g: some datasets like Ethereum `blocks` could be updated every few minutes. 159 + 160 + I don't have a great answer. Perhaps just push the new data into partitioned datasets? 161 + 162 + 7. Is it possible to [mount large amount of data](https://rclone.org/commands/rclone_mount/) ([FUSE](https://github.com/datalad/datalad-fuse)) from a remote source and get it dynamically as needed? 163 + 164 + It should be possible. I wonder if we could mount all datasets locally and explore them as if they were in your laptop. 165 + 166 + 8. Can new table formats play efficiently with IPFS? 167 + 168 + Parquet could be a great fit if we figure out how to deterministically serialize it and integrate with IPLD. This will reduce their size as unchanged columns could be encoded in the same CID. 169 + 170 + Later on I think it could be interesting to explore running [`delta-rs`](https://github.com/delta-io/delta-rs) on top of IPFS. 171 + 172 + 9. How to work with private data? 173 + 174 + Homomorphic encryption? 175 + 176 + 9. How could something like [Ver](https://raulcastrofernandez.com/data-discovery-updates/) works? 177 + 178 + If you can envision the table you would like to have in front of you, i.e., you can write down the attributes you would like the table to contain, then the system will find it for you. This probably needs a [[Knowledge Graphs]]! 179 + 180 + 10. How can a [[Knowledge Graphs]] [help with the data catalog](https://docs.atomicdata.dev/usecases/data-catalog.html)? 181 + 182 + It could help users connect datasets. With good enough core datasets, it could be used as an LLM backend. 170 183 171 - #### Computation 184 + 11. [How would a Substack for databases look like](https://tomcritchlow.com/2023/01/27/small-databases/)? 172 185 173 - - [Kamu](https://www.kamu.dev/). 174 - - [Bacalhau](https://www.bacalhau.org/). 175 - - [Holium](https://docs.holium.org/). An open source protocol dedicated to the management of data connected through transformations. Similar to Pachyderm but using WASM and IPFS. 176 - - [Ocean Protocol](https://oceanprotocol.com/technology/compute-to-data). 177 - - [The Graph](https://thegraph.com/). 178 - - [Trino](https://trino.io/). 186 + An easy tool for creating, maintaining and publishing databases with the ability to restrict parts or all of it behind a pay wall. Pair it with the ability to send email updates to your audience about changes and additions. 187 + 188 + 12. Curated and small data (e.g: at the community level) is not reachable by Google. How can we help there? 189 + 190 + Indeed! With LLMs on the rise, community curated datasets become more important as they don't appear in the big data dumps. 191 + 192 + ## Related Projects 179 193 180 - #### Data Package Managers 194 + ### Data Package Managers 181 195 182 196 - [Qri](https://qri.io/). An evolution of the classical open portals that added [[Decentralized Protocols]] (IPFS) and computing on top of the data. Sadly, [it came to an end early in 2022](https://qri.io/winding_down). 183 - - [Datalad](https://www.datalad.org/). [Extended to IPFS](https://kinshukk.github.io/posts/gsoc-summary-and-future-thoughts/). 197 + - [Datalad](https://www.datalad.org/). [Extended to IPFS](https://kinshukk.github.io/posts/gsoc-summary-and-future-thoughts/) 184 198 - Is a [great tool](https://archive.fosdem.org/2020/schedule/event/open_research_datalad/) and uses Git Annex (distributed binary object tracking layer on top of git). 185 199 - Complicated to wrap your head around. Lots of different commands and concepts. On the other hand, it's very powerful and flexible. Git Annex is complex but powerful and flexible. 186 - - The handbook is very good, but it's a lot of reading if you just want to test things out. 187 - - [Huggingface Datasets](https://huggingface.co/docs/datasets). 188 - - [Quilt](https://github.com/quiltdata/quilt). 189 - - Forces both Python and S3. 190 - - [Oxen](https://github.com/Oxen-AI/Oxen). 191 - - Data is not accesible from other tools. 192 - - [Docs](https://github.com/Oxen-AI/oxen-release#-oxen-release) are sparse. 193 - - Definitely more in the Git for Data space than Dataset Package Manager. 194 - - [Frictionless Data](https://frictionlessdata.io/projects/#software-and-standards). 195 - - [Datopian Data CLI](https://github.com/datopian/data-cli). Sucesor of [DPM](https://github.com/frictionlessdata/dpm-js). 196 - - [LakeFS](https://lakefs.io/blog/git-for-data/). More like Git for Data. 197 - - [Datasette](https://lite.datasette.io/). 198 - - [Algovera Metahub](https://github.com/AlgoveraAI/metahub). 199 - - [DVC](https://github.com/iterative/dvc). 200 - - [XVC](https://github.com/iesahin/xvc). 201 - - [ArtiVC](https://artivc.io/). 202 - - [Xetdata](https://xetdata.com/). 203 - - [Dud](https://github.com/kevin-hanselman/dud). 204 - - [Splitgraph](https://github.com/splitgraph/sgr). 205 - - [Deep Lake](https://github.com/activeloopai/deeplake). 206 - - [Dim](https://github.com/c-3lab/dim). 207 - - Hard to grok how to use it from the docs. 208 - - Quite small surface area. You can basically install datasets from URLs, create new ones, or apply some kind of GPT3 transformation on top of them. 209 - - [Juan Benet's data](https://github.com/jbenet/data). 210 - - [Colah's data](https://github.com/colah/data). 200 + - [Huggingface Datasets](https://huggingface.co/docs/datasets) 201 + - [Quilt](https://github.com/quiltdata/quilt) 202 + - Forces both Python and S3 203 + - [Oxen](https://github.com/Oxen-AI/Oxen) 204 + - Data is not accesible from other tools 205 + - [Docs](https://github.com/Oxen-AI/oxen-release#-oxen-release) are sparse 206 + - Definitely more in the Git for Data space than Dataset Package Manager 207 + - [Frictionless Data](https://frictionlessdata.io/projects/#software-and-standards) 208 + - [Datopian Data CLI](https://github.com/datopian/data-cli). Successor of [DPM](https://github.com/frictionlessdata/dpm-js) 209 + - [LakeFS](https://lakefs.io/blog/git-for-data/). More like Git for Data 210 + - [Datasette](https://lite.datasette.io/) 211 + - [Algovera Metahub](https://github.com/AlgoveraAI/metahub) 212 + - [DVC](https://github.com/iterative/dvc) 213 + - [XVC](https://github.com/iesahin/xvc) 214 + - [ArtiVC](https://artivc.io/) 215 + - [Xetdata](https://xetdata.com/) 216 + - [Dud](https://github.com/kevin-hanselman/dud) 217 + - [Splitgraph](https://github.com/splitgraph/sgr) 218 + - [Deep Lake](https://github.com/activeloopai/deeplake) 219 + - [Dim](https://github.com/c-3lab/dim) 220 + - Hard to grok how to use it from the docs 221 + - Quite small surface area. You can basically install datasets from URLs, create new ones, or apply some kind of GPT3 transformation on top of them 222 + - [Juan Benet's data](https://github.com/jbenet/data) 223 + - [Colah's data](https://github.com/colah/data) 211 224 - [Dolt](https://docs.dolthub.com/) is another interesting project in the space with some awesome data structures. They also [do data bounties](https://www.dolthub.com/repositories/dolthub/us-businesses)! 212 225 213 - ## Open Datasets 226 + ### Computation 227 + 228 + - [Kamu](https://www.kamu.dev/) 229 + - [Bacalhau](https://www.bacalhau.org/) 230 + - [Holium](https://docs.holium.org/) 231 + - [Ocean Protocol](https://oceanprotocol.com/technology/compute-to-data) 232 + - [The Graph](https://thegraph.com/) 233 + - [Trino](https://trino.io/) 234 + 235 + ### Large Open Datasets 214 236 215 237 - [Wikipedia](https://dumps.wikimedia.org/) 216 238 - [Github](https://www.gharchive.org/) ··· 286 308 287 309 ### Interesting Projects 288 310 289 - - [Rath](https://rath.kanaries.net/). 290 - - [Perspective](https://perspective.finos.org/). 291 - - [Rill Developer](https://github.com/rilldata/rill-developer). 292 - - [Datastation](https://app.datastation.multiprocess.io/). 311 + - [Rath](https://rath.kanaries.net/) 312 + - [Perspective](https://perspective.finos.org/) 313 + - [Rill Developer](https://github.com/rilldata/rill-developer) 314 + - [Datastation](https://app.datastation.multiprocess.io/) 293 315 294 316 #### Datafile 295 317 ··· 334 356 - Spec file locator with fallback to the package registry. 335 357 - Versioning and latest versions. 336 358 - Asset checksums. 359 + 360 + ## Architecture 361 + 362 + ![Architecture](https://user-images.githubusercontent.com/1682202/224966685-b2406d5f-b162-4a93-a68a-af0afca45ebe.png) 363 + 364 + _[Edit on Excalidraw](https://excalidraw.com/#json=RLkinyHZE-4Px_cl21UDI,z8D-l20khdaB-lRumpzN7w)_

Configure Feed

Configure Feed