···66 - If analysis is not actionable, it does not really matter. Analysis must drive to action. [Clear results won't spur action themselves](https://www.linkedin.com/posts/eric-weber-060397b7_data-analytics-machinelearning-activity-6675746028144205824-CQxW/). The organization needs to be ready to pivot when something isn't working.
77 - [Data's impact is tough to measure — it doesn't always translate to value](https://dfrieds.com/articles/data-science-reality-vs-expectations.html).
88 - The Data Team should be building and iterating the [Data Product](https://locallyoptimistic.com/post/run-your-data-team-like-a-product-team/).
99-- Data is fundamentally a collaborative design process rather than a tool, an analysis, or even a product. [Data works best when the entire feedback loop from ideation to production is an iterative process](https://pedram.substack.com/p/data-can-learn-from-design).
99+- Data is fundamentally a collaborative design process rather than a tool, an analysis, or even a product. [Data works best when the entire feedback loop from idea to production is an iterative process](https://pedram.substack.com/p/data-can-learn-from-design).
1010 - [To get buy in, explain how the business could benefit from better data](https://youtu.be/Mlz1VwxZuDs) (e.g: more and better insights). Start small and show value.
1111 - Run *[Purpose Meetings](https://www.avo.app/blog/tracking-the-right-product-metrics)* or [Business Metrics Review](https://youtu.be/nlMn572Dabc).
1212 - Purpose Meetings are 30 min meetings in which stakeholders, engineers and data align on the goal of a release and what is the best way to evaluate the impact and understand its success. Align on the goal, commit on metrics and design the data.
+1
Knowledge Graphs.md
···1313 - It offers no protection against some team inside the company breaking the whole web by moving to a different URI or refactoring their domain model in incompatible ways.
1414 - For the Semantic Web to work, the infrastructure behind it needs to permanently keep all of the necessary sources that a file relies on. This could be a place where [[IPFS]] or others [[Decentralized Protocols]] could help!
1515 - It tends to assume that the world fits into neat categories. Instead, we live in a world where membership in categories is partial, probabilistic, contested (Pluto), and changes over time.
1616+- Knowledge graphs might be a great way to give AI a "world view".
1617- The status quo of the semantic web space is still SPARQL.
1718 - You can build [a knowledge graph database on top of a relational engine](https://twitter.com/RelationalAI).
1819- Knowledge Graphs act as a semantic layer.
+15-6
Open Data.md
···5959 - Rewarding the datasets creators according to the usefulness. E.g: [CommonCrawl built an amazing repository](https://commoncrawl.org/) that OpenAI has used for their GPTs LLMs. Not sure how well CommonCrawl was compensated.
6060- **Open Source and Decentralized**. Datasets should be stored in multiple places.
6161 - Don't create yet another standard. Provide a way for people to integrate current indexers. Work on _adapters_ for different datasets sources. Similar to:
6262- - [Foreign Data Wrappers in PostgreSQL](https://wiki.postgresql.org/wiki/Foreign_data_wrappers)
6363- - [Trustfall](https://github.com/obi1kenobi/trustfall).
6464- - Open source data integration projects like [Airbyte](https://airbyte.com/). They can used to build open data connectors making possible to replicate something from `$RANDOM_SOURCE` (e.g: spreadsheets, Ethereum Blocks, URL, ...) to any destination.
6565- - Adapters are created by the community so data becomes connected.
6262+ - [Foreign Data Wrappers in PostgreSQL](https://wiki.postgresql.org/wiki/Foreign_data_wrappers)
6363+ - [Trustfall](https://github.com/obi1kenobi/trustfall).
6464+ - Open source data integration projects like [Airbyte](https://airbyte.com/). They can used to build open data connectors making possible to replicate something from `$RANDOM_SOURCE` (e.g: spreadsheets, Ethereum Blocks, URL, ...) to any destination.
6565+ - Adapters are created by the community so data becomes connected.
6666 - Integrate with the modern data stack to avoid reinventing the wheel.
6767 - Decentralized the computation (where data lives) and then cache copies of the results (or aggregations) in CDNs. Most queries require only reading a small amount of data and going to be similar.
6868···7575- **Distribution**. Decentralized. No central authority. Can work in a closed network. Cache/CDN friendly.
7676 - A data package is an URI ([like in Deno](https://deno.land/manual@v1.31.2/examples/manage_dependencies)). You can import from an URL (`data add example.com/dataset.yml` or `data add example.com/hub_curated_datasets.yml`).
7777 - As [Rufus Pollock puts it](https://datahub.io/docs/dms/notebook#go-modules-and-dependency-management-re-data-package-management-2020-05-16-rufuspollock), Keep it as simple as possible. Store the table location and schema and get me the data on the hard disk fast.
7878+ - [Bootstrap a package registry](https://antonz.org/writing-package-manager/). E.g: a GitHub repository with lots of known datapackages that acts as fallback and quick way to get started with the tool.
7879- **Indexing**. Should be easy to list datasets matching a certain pattern or reading from a certain source.
7980 - Datasets could be linked to a [[Open Data#Datafile|Datafile]]/`datapackage.yml` with description, default visualizations, WASM linked code...
8081 - One repository, one dataset or catalog/hub.
8182 - To avoid yet another open dataset portal, build adapters to integrate with other indexes.
8282- - For example, bring all HF datasets by making a simple PR on their repository that generates a `datapackage.yml` reusing their parquet files.
8383- - [Expose a JSON-LD so Google Dataset Search can index it](https://developers.google.com/search/docs/appearance/structured-data/dataset).
8383+ - For example, bring all HF datasets by making a simple PR on their repository that generates a `datapackage.yml` reusing their parquet files.
8484+ - [Expose a JSON-LD so Google Dataset Search can index it](https://developers.google.com/search/docs/appearance/structured-data/dataset).
8485- **Formatting**. Datasets should be saved and exposed in multiple formats (CSV, Parquet, ...). Could be done via WASM transformations or in the fly when pulling data. The package manager should be **format and storage agnostic**.
8586- **Social**. Allow users, organizations, stars, citations, attaching default visualizations (d3, [Vega](https://vega.github.io/), [Vegafusion](https://github.com/vegafusion/vegafusion/), and others), ...
8687 - Importing datasets. Making possible to `data fork user/data`, improve something and publish the resulting dataset back (via something like a PR).
···321322 primary_key: "country_name"
322323metadata: "..."
323324```
325325+326326+#### Simple Package Manager Design
327327+328328+- A package spec file describing a package.
329329+- A hierarchical owner/name folder structure for installed packages.
330330+- Spec file locator with fallback to the package registry.
331331+- Versioning and latest versions.
332332+- Asset checksums.