:art: · davidgasquez.com/handbook@7977c73

+1 -1

Data/Data Culture.md

··· 6 6 - If analysis is not actionable, it does not really matter. Analysis must drive to action. [Clear results won't spur action themselves](https://www.linkedin.com/posts/eric-weber-060397b7_data-analytics-machinelearning-activity-6675746028144205824-CQxW/). The organization needs to be ready to pivot when something isn't working. 7 7 - [Data's impact is tough to measure — it doesn't always translate to value](https://dfrieds.com/articles/data-science-reality-vs-expectations.html). 8 8 - The Data Team should be building and iterating the [Data Product](https://locallyoptimistic.com/post/run-your-data-team-like-a-product-team/). 9 - - Data is fundamentally a collaborative design process rather than a tool, an analysis, or even a product. [Data works best when the entire feedback loop from ideation to production is an iterative process](https://pedram.substack.com/p/data-can-learn-from-design). 9 + - Data is fundamentally a collaborative design process rather than a tool, an analysis, or even a product. [Data works best when the entire feedback loop from idea to production is an iterative process](https://pedram.substack.com/p/data-can-learn-from-design). 10 10 - [To get buy in, explain how the business could benefit from better data](https://youtu.be/Mlz1VwxZuDs) (e.g: more and better insights). Start small and show value. 11 11 - Run *[Purpose Meetings](https://www.avo.app/blog/tracking-the-right-product-metrics)* or [Business Metrics Review](https://youtu.be/nlMn572Dabc). 12 12 - Purpose Meetings are 30 min meetings in which stakeholders, engineers and data align on the goal of a release and what is the best way to evaluate the impact and understand its success. Align on the goal, commit on metrics and design the data.

+1

Knowledge Graphs.md

··· 13 13 - It offers no protection against some team inside the company breaking the whole web by moving to a different URI or refactoring their domain model in incompatible ways. 14 14 - For the Semantic Web to work, the infrastructure behind it needs to permanently keep all of the necessary sources that a file relies on. This could be a place where [[IPFS]] or others [[Decentralized Protocols]] could help! 15 15 - It tends to assume that the world fits into neat categories. Instead, we live in a world where membership in categories is partial, probabilistic, contested (Pluto), and changes over time. 16 + - Knowledge graphs might be a great way to give AI a "world view". 16 17 - The status quo of the semantic web space is still SPARQL. 17 18 - You can build [a knowledge graph database on top of a relational engine](https://twitter.com/RelationalAI). 18 19 - Knowledge Graphs act as a semantic layer.

+15 -6

Open Data.md

··· 59 59 - Rewarding the datasets creators according to the usefulness. E.g: [CommonCrawl built an amazing repository](https://commoncrawl.org/) that OpenAI has used for their GPTs LLMs. Not sure how well CommonCrawl was compensated. 60 60 - **Open Source and Decentralized**. Datasets should be stored in multiple places. 61 61 - Don't create yet another standard. Provide a way for people to integrate current indexers. Work on _adapters_ for different datasets sources. Similar to: 62 - - [Foreign Data Wrappers in PostgreSQL](https://wiki.postgresql.org/wiki/Foreign_data_wrappers) 63 - - [Trustfall](https://github.com/obi1kenobi/trustfall). 64 - - Open source data integration projects like [Airbyte](https://airbyte.com/). They can used to build open data connectors making possible to replicate something from `$RANDOM_SOURCE` (e.g: spreadsheets, Ethereum Blocks, URL, ...) to any destination. 65 - - Adapters are created by the community so data becomes connected. 62 + - [Foreign Data Wrappers in PostgreSQL](https://wiki.postgresql.org/wiki/Foreign_data_wrappers) 63 + - [Trustfall](https://github.com/obi1kenobi/trustfall). 64 + - Open source data integration projects like [Airbyte](https://airbyte.com/). They can used to build open data connectors making possible to replicate something from `$RANDOM_SOURCE` (e.g: spreadsheets, Ethereum Blocks, URL, ...) to any destination. 65 + - Adapters are created by the community so data becomes connected. 66 66 - Integrate with the modern data stack to avoid reinventing the wheel. 67 67 - Decentralized the computation (where data lives) and then cache copies of the results (or aggregations) in CDNs. Most queries require only reading a small amount of data and going to be similar. 68 68 ··· 75 75 - **Distribution**. Decentralized. No central authority. Can work in a closed network. Cache/CDN friendly. 76 76 - A data package is an URI ([like in Deno](https://deno.land/manual@v1.31.2/examples/manage_dependencies)). You can import from an URL (`data add example.com/dataset.yml` or `data add example.com/hub_curated_datasets.yml`). 77 77 - As [Rufus Pollock puts it](https://datahub.io/docs/dms/notebook#go-modules-and-dependency-management-re-data-package-management-2020-05-16-rufuspollock), Keep it as simple as possible. Store the table location and schema and get me the data on the hard disk fast. 78 + - [Bootstrap a package registry](https://antonz.org/writing-package-manager/). E.g: a GitHub repository with lots of known datapackages that acts as fallback and quick way to get started with the tool. 78 79 - **Indexing**. Should be easy to list datasets matching a certain pattern or reading from a certain source. 79 80 - Datasets could be linked to a [[Open Data#Datafile|Datafile]]/`datapackage.yml` with description, default visualizations, WASM linked code... 80 81 - One repository, one dataset or catalog/hub. 81 82 - To avoid yet another open dataset portal, build adapters to integrate with other indexes. 82 - - For example, bring all HF datasets by making a simple PR on their repository that generates a `datapackage.yml` reusing their parquet files. 83 - - [Expose a JSON-LD so Google Dataset Search can index it](https://developers.google.com/search/docs/appearance/structured-data/dataset). 83 + - For example, bring all HF datasets by making a simple PR on their repository that generates a `datapackage.yml` reusing their parquet files. 84 + - [Expose a JSON-LD so Google Dataset Search can index it](https://developers.google.com/search/docs/appearance/structured-data/dataset). 84 85 - **Formatting**. Datasets should be saved and exposed in multiple formats (CSV, Parquet, ...). Could be done via WASM transformations or in the fly when pulling data. The package manager should be **format and storage agnostic**. 85 86 - **Social**. Allow users, organizations, stars, citations, attaching default visualizations (d3, [Vega](https://vega.github.io/), [Vegafusion](https://github.com/vegafusion/vegafusion/), and others), ... 86 87 - Importing datasets. Making possible to `data fork user/data`, improve something and publish the resulting dataset back (via something like a PR). ··· 321 322 primary_key: "country_name" 322 323 metadata: "..." 323 324 ``` 325 + 326 + #### Simple Package Manager Design 327 + 328 + - A package spec file describing a package. 329 + - A hierarchical owner/name folder structure for installed packages. 330 + - Spec file locator with fallback to the package registry. 331 + - Versioning and latest versions. 332 + - Asset checksums.

Configure Feed

Configure Feed