···2020- [Feature Engineering Ideas](https://github.com/aikho/awesome-feature-engineering)
2121- [Deep Feature Synthesis](https://featuretools.alteryx.com/en/stable/getting_started/afe.html). [Simple tutorial](https://www.kaggle.com/willkoehrsen/automated-feature-engineering-basics).
22222323+## Exploratory Data Analysis Resources
2424+2525+- [HiPlot](https://facebookresearch.github.io/hiplot/)
2626+2327### Scikit Learn Compatible Transformers
24282529- [LEGO](https://github.com/koaning/scikit-lego)
···4549- [Awesome Collection](https://github.com/MaxBenChrist/awesome_time_series_in_python)
4650- [Video with great ideas](https://www.youtube.com/watch?v=9QtL7m3YS9I)
4751- [Tutorial Kaggle Notebook](https://www.kaggle.com/code/tumpanjawat/s3e19-course-eda-fe-lightgbm)
4848-- Think about adding external datasets like [related Google Trends search](https://trends.google.com/trends/), PiPy Packages downloads, weather, ...
5252+- Think about adding external datasets like [related Google Trends search](https://trends.google.com/trends/), PiPy Packages downloads, [Statista](https://www.statista.com/), weather, ...
49535054## Datathon Platforms
5155
+2-1
Decentralized Protocols.md
···2020 - A decentralized protocol can work with a centralized provider. It has the benefits of both (might be fast but no lock users in).
2121- A major downside of decentralized protocols/networks is that they tend to perform poorly. Hubs are efficient.
2222- [It's the properties decentralization gives us that we care about, not decentralization itself](https://haseebq.com/why-decentralization-isnt-as-important-as-you-think/). Decentralization is a global, emergent property. You can feel latency, you can feel transaction fees, but networks ostensibly feel the same whether they’re centralized or decentralized. Decentralization is valuable when it lets you do new things fundamentally better, not old things fundamentally worse.
2323+- Ultimately, [users don't care about decentralization](https://news.ycombinator.com/item?id=38694551). Most of the time, it doesn't matter if the service is distributed or comes from a single server sitting in someone's basement. Users want to use services (chat, write mails, watch videos, have a website, buy stuff, sell stuff) and not run infrastructure of any kind. Decentralization is a means to an end, not an end in itself.
2324- If a system requires a centralized part, a great alternative is give the user the ability to point to other centralized things taking care of that part.
2425- If you have a protocol, try enforcing the desired behavior using the protocol. Your ideas of how to solve it might not be the best and adding a protocol restriction (incentives/penalties) will make people figure out.
2525-- When building a technology, consider: [does this centralize or decentralize power?](https://geohot.github.io/blog/jekyll/update/2021/01/18/technology-without-industry.html)2626+- When building a technology, consider: [does this centralize or decentralize power?](https://geohot.github.io/blog/jekyll/update/2021/01/18/technology-without-industry.html)
+6
Large Language Models.md
···5858- [Generate structured data from text](https://thecaglereport.com/2023/03/16/nine-chatgpt-tricks-for-knowledge-graph-workers/).
5959- Do API request to SQL Semantic Layers (less prone for errors or hallucinating metric definitions)
60606161+## Cool Prompts for DALLE 3
6262+6363+- For logo generation:
6464+ - A 2d, symmetrical, flat logo for a company working on `[SOMETHING]` that is sleek and simple. Blue and Green. No text.
6565+ - Minimalistic `[SOMETHING]` design logo from word parlatur, open data, banksy, protocol, universe, interplanetary, white background, illustration.
6666+6167### Resources
62686369- [Official GPT Guide](https://platform.openai.com/docs/guides/gpt-best-practices).
+17-10
Open Data.md
···11# Open Data
22+23_Make Open Data compatible with the Modern Data Ecosystem_.
3445## Motivation
···18191920Open protocols create open systems. Open code creates tools. **Open data creates open knowledge**. We need better tools, protocols, and mechanisms to improve the Open Data ecosystem. It should be easy to find, download, process, publish, and collaborate on open datasets.
20212121-Iterative improvements over public datasets yield large amounts of value ([check how Dune did it with blockchain data](https://dune.com/blog/the-community-data-platform))¹. Access to data gives people the opportunity to create new business and make better decisions.
2222+Iterative improvements over public datasets yield large amounts of value ([check how Dune did it with blockchain data](https://dune.com/blog/the-community-data-platform))¹. Access to data gives people the opportunity to create new business and make better decisions.
22232324Open Source code has made a huge impact in the world. Let's make Open Data do the same! Let's make it possible for [anyone to fork and re-publish fixed, cleaned, reformatted datasets as easily as we do the same things with code](https://juan.benet.ai/blog/2014-02-21-data-management-problems/).
2425···28292930During the last few years, a large number of new data and open source tools have emerged. There are new query engines (e.g: DuckDB, DataFusion, ...), execution frameworks (WASM), data standards (Arrow, Parquet, ...), and a growing set of open data marketplaces (Datahub, HuggingFace Datasets, Kaggle Datasets).
30313131-These trends are already making it's way towards movements like [DeSci](https://ethereum.org/en/desci/) or smaller projects like [Py-Code Datasets](https://py-code.org/datasets). But, we still need more tooling around data to improve interoperability as much as possible. Lots of companies have figured out how to make the most of their datasets. **We should use similar tooling and approaches companies are using to manage the open datasets that surrounds us**. A sort of [Data Operating system](https://data-operating-system.com/).
3232+These trends are already making it's way towards movements like [DeSci](https://ethereum.org/en/desci/) or smaller projects like [Py-Code Datasets](https://py-code.org/datasets). But, we still need more tooling around data to improve interoperability as much as possible. Lots of companies have figured out how to make the most of their datasets. **We should use similar tooling and approaches companies are using to manage the open datasets that surrounds us**. A sort of [Data Operating system](https://data-operating-system.com/).
32333334Data wrangling is a perpetual maintenance commitment, taking a lot of ongoing attention and resources. [Better and modern data tooling can reduce these costs](https://github.com/catalyst-cooperative/pudl).
3435···8586 - For example, integrate all [Hugging Face datasets](https://huggingface.co/docs/datasets/index) by making an scheduled job that builds a Frictionless Catalog (bunch of `datapackage.yml`s pointing to their parquet files).
8687 - [Expose a JSON-LD so Google Dataset Search can index it](https://developers.google.com/search/docs/appearance/structured-data/dataset).
8788 - [FAIR](https://www.go-fair.org/fair-principles/).
8888-- **Formatting**. Datasets are saved and exposed in multiple formats (CSV, Parquet, ...). Could be done in the backend, or in the client when pulling data (WASM). The package manager should be **format and storage agnostic**. Give me the dataset with id `xyz` as a CSV in this folder.
8989+- **Formatting**. Datasets are saved and exposed in multiple formats (CSV, Parquet, ...). Could be done in the backend, or in the client when pulling data (WASM). The package manager should be **format and storage agnostic**. Give me the dataset with id `xyz` as a CSV in this folder.
8990- **Social**. Allow users, organizations, stars, citations, attaching default visualizations (d3, [Vega](https://vega.github.io/), [Vegafusion](https://github.com/vegafusion/vegafusion/), and others), ...
9091 - Importing datasets. Making possible to `data fork user/data`, improve something and publish the resulting dataset back (via something like a PR).
9192 - Have issues and discussions close to the dataset.
···127128 - [Templated validations to make sure datasets conform to certain standards](https://framework.frictionlessdata.io/docs/checks/baseline.html).
128129129130### Consumption
130130-- **Accessible**. Datasets are **files**. Datasets are static assets living somewhere. Don't get in the middle with libraries or gated databases.
131131+132132+- **Accessible**. Datasets are **files**. Datasets are static assets living somewhere. Don't get in the middle with libraries or gated databases.
131133- **Documentation**. Surface derived work (e.g: reports, other datasets, ...).
132134- **Embedded Visualizations**. Know what's in there before downloading it.
133135 - **Sane Defaults**. Suggest basic charts (bars, lines, time series, clustering). Multiple [views](https://tech.datopian.com/views/).
···135137 - **Dynamic**. Use only the data you need. No need to pull 150GB.
136138- **Default APIs**. For some datasets, allowing REST API / GraphQL endpoints might be useful. Same with providing an SQL interface.
137139138138-## Frequently Asked Questions
140140+## Frequently Asked Questions
139141140140-> I'm not super clear on these answers! Please [reach out](https://davidgasquez.github.io/) if you want to chat about it.
142142+> I'm not super clear on these answers! Please [reach out](https://davidgasquez.github.io/) if you want to chat about it.
1411431421441. What would be a great use case to start with?
143145···175177176178Homomorphic encryption?
177179178178-9. How could something like [Ver](https://raulcastrofernandez.com/data-discovery-updates/) works?
180180+9. How could something like [Ver](https://raulcastrofernandez.com/data-discovery-updates/) works?
179181180182If you can envision the table you would like to have in front of you, i.e., you can write down the attributes you would like the table to contain, then the system will find it for you. This probably needs a [[Knowledge Graphs]]!
181183···18518718618811. [How would a Substack for databases look like](https://tomcritchlow.com/2023/01/27/small-databases/)?
187189188188- An easy tool for creating, maintaining and publishing databases with the ability to restrict parts or all of it behind a pay wall. Pair it with the ability to send email updates to your audience about changes and additions.
189189-190190+An easy tool for creating, maintaining and publishing databases with the ability to restrict parts or all of it behind a pay wall. Pair it with the ability to send email updates to your audience about changes and additions.
191191+19019212. Curated and small data (e.g: at the community level) is not reachable by Google. How can we help there?
191193192194Indeed! With LLMs on the rise, community curated datasets become more important as they don't appear in the big data dumps.
···248250- [World Bank](https://data.worldbank.org/indicator)
249251- [Ecosyste.ms](https://repos.ecosyste.ms/open-data)
250252- [Deps.dev](https://deps.dev/)
253253+- [Twitter Community Notes](https://twitter.com/i/communitynotes/download-data)
251254252255### Open Data Organizations
253256···269272- [Datahub](https://datahub.io/awesome)
270273- [HuggingFace Datasets](https://huggingface.co/datasets)
271274- [Data World](https://data.world/datasets/open-data)
275275+- [Statista](https://www.statista.com/)
272276- [Enigma](https://enigma.com/)
273277- [DoltHub](https://www.dolthub.com/discover)
274278- [Socrata](https://dev.socrata.com/)
···287291- [Open Data Inception](https://opendatainception.io/)
288292- [Victoriano's Data Sources](https://victorianoi.notion.site/Data-Sources-79b28912c6d941af99e6ef102c578fa0)
289293- [Data is Plural](https://www.data-is-plural.com/)
294294+- [Open Sustainable Technology](https://opensustain.tech/)
290295- [Public APIs](https://github.com/public-api-lists/public-api-lists)
291296- [Real Time Datasets](https://github.com/bytewax/awesome-public-real-time-datasets)
297297+- [Environmental Data Initiative](https://edirepository.org/)
298298+- [Data One](https://www.dataone.org/)
292299293300## Open Source Web Data IDE
294301···370377371378
372379373373-_[Edit on Excalidraw](https://excalidraw.com/#json=RLkinyHZE-4Px_cl21UDI,z8D-l20khdaB-lRumpzN7w)_380380+_[Edit on Excalidraw](https://excalidraw.com/#json=RLkinyHZE-4Px_cl21UDI,z8D-l20khdaB-lRumpzN7w)_
+1
Teamwork.md
···115115 - Beware of [Normalization of Deviance](https://danluu.com/wat/).
116116- When meeting/emailing interesting people ask if they know anyone else you can meet with. [Try to expand your network with successful folks in the area/space!](https://twitter.com/AdamRy_n/status/1297920306900865024)
117117- Keep a [private work log](https://youtu.be/HiF83i1OLOM?list=PLYXaKIsOZBsu3h2SSKEovRn7rGy7wkUAV). It'll make easier for everyone to advocate what you did.
118118+- [Don't sabotage the team](https://erikbern.com/2023/12/13/simple-sabotage-for-software)!
118119119120## [How Small Teams Work](https://posthog.com/handbook/people/team-structure/why-small-teams#how-it-works)
120121