:art: · davidgasquez.com/handbook@13ea853

+5 -1

Datathons.md

··· 20 20 - [Feature Engineering Ideas](https://github.com/aikho/awesome-feature-engineering) 21 21 - [Deep Feature Synthesis](https://featuretools.alteryx.com/en/stable/getting_started/afe.html). [Simple tutorial](https://www.kaggle.com/willkoehrsen/automated-feature-engineering-basics). 22 22 23 + ## Exploratory Data Analysis Resources 24 + 25 + - [HiPlot](https://facebookresearch.github.io/hiplot/) 26 + 23 27 ### Scikit Learn Compatible Transformers 24 28 25 29 - [LEGO](https://github.com/koaning/scikit-lego) ··· 45 49 - [Awesome Collection](https://github.com/MaxBenChrist/awesome_time_series_in_python) 46 50 - [Video with great ideas](https://www.youtube.com/watch?v=9QtL7m3YS9I) 47 51 - [Tutorial Kaggle Notebook](https://www.kaggle.com/code/tumpanjawat/s3e19-course-eda-fe-lightgbm) 48 - - Think about adding external datasets like [related Google Trends search](https://trends.google.com/trends/), PiPy Packages downloads, weather, ... 52 + - Think about adding external datasets like [related Google Trends search](https://trends.google.com/trends/), PiPy Packages downloads, [Statista](https://www.statista.com/), weather, ... 49 53 50 54 ## Datathon Platforms 51 55

+2 -1

Decentralized Protocols.md

··· 20 20 - A decentralized protocol can work with a centralized provider. It has the benefits of both (might be fast but no lock users in). 21 21 - A major downside of decentralized protocols/networks is that they tend to perform poorly. Hubs are efficient. 22 22 - [It's the properties decentralization gives us that we care about, not decentralization itself](https://haseebq.com/why-decentralization-isnt-as-important-as-you-think/). Decentralization is a global, emergent property. You can feel latency, you can feel transaction fees, but networks ostensibly feel the same whether they’re centralized or decentralized. Decentralization is valuable when it lets you do new things fundamentally better, not old things fundamentally worse. 23 + - Ultimately, [users don't care about decentralization](https://news.ycombinator.com/item?id=38694551). Most of the time, it doesn't matter if the service is distributed or comes from a single server sitting in someone's basement. Users want to use services (chat, write mails, watch videos, have a website, buy stuff, sell stuff) and not run infrastructure of any kind. Decentralization is a means to an end, not an end in itself. 23 24 - If a system requires a centralized part, a great alternative is give the user the ability to point to other centralized things taking care of that part. 24 25 - If you have a protocol, try enforcing the desired behavior using the protocol. Your ideas of how to solve it might not be the best and adding a protocol restriction (incentives/penalties) will make people figure out. 25 - - When building a technology, consider: [does this centralize or decentralize power?](https://geohot.github.io/blog/jekyll/update/2021/01/18/technology-without-industry.html) 26 + - When building a technology, consider: [does this centralize or decentralize power?](https://geohot.github.io/blog/jekyll/update/2021/01/18/technology-without-industry.html)

+6

Large Language Models.md

··· 58 58 - [Generate structured data from text](https://thecaglereport.com/2023/03/16/nine-chatgpt-tricks-for-knowledge-graph-workers/). 59 59 - Do API request to SQL Semantic Layers (less prone for errors or hallucinating metric definitions) 60 60 61 + ## Cool Prompts for DALLE 3 62 + 63 + - For logo generation: 64 + - A 2d, symmetrical, flat logo for a company working on `[SOMETHING]` that is sleek and simple. Blue and Green. No text. 65 + - Minimalistic `[SOMETHING]` design logo from word parlatur, open data, banksy, protocol, universe, interplanetary, white background, illustration. 66 + 61 67 ### Resources 62 68 63 69 - [Official GPT Guide](https://platform.openai.com/docs/guides/gpt-best-practices).

+17 -10

Open Data.md

··· 1 1 # Open Data 2 + 2 3 _Make Open Data compatible with the Modern Data Ecosystem_. 3 4 4 5 ## Motivation ··· 18 19 19 20 Open protocols create open systems. Open code creates tools. **Open data creates open knowledge**. We need better tools, protocols, and mechanisms to improve the Open Data ecosystem. It should be easy to find, download, process, publish, and collaborate on open datasets. 20 21 21 - Iterative improvements over public datasets yield large amounts of value ([check how Dune did it with blockchain data](https://dune.com/blog/the-community-data-platform))¹. Access to data gives people the opportunity to create new business and make better decisions. 22 + Iterative improvements over public datasets yield large amounts of value ([check how Dune did it with blockchain data](https://dune.com/blog/the-community-data-platform))¹. Access to data gives people the opportunity to create new business and make better decisions. 22 23 23 24 Open Source code has made a huge impact in the world. Let's make Open Data do the same! Let's make it possible for [anyone to fork and re-publish fixed, cleaned, reformatted datasets as easily as we do the same things with code](https://juan.benet.ai/blog/2014-02-21-data-management-problems/). 24 25 ··· 28 29 29 30 During the last few years, a large number of new data and open source tools have emerged. There are new query engines (e.g: DuckDB, DataFusion, ...), execution frameworks (WASM), data standards (Arrow, Parquet, ...), and a growing set of open data marketplaces (Datahub, HuggingFace Datasets, Kaggle Datasets). 30 31 31 - These trends are already making it's way towards movements like [DeSci](https://ethereum.org/en/desci/) or smaller projects like [Py-Code Datasets](https://py-code.org/datasets). But, we still need more tooling around data to improve interoperability as much as possible. Lots of companies have figured out how to make the most of their datasets. **We should use similar tooling and approaches companies are using to manage the open datasets that surrounds us**. A sort of [Data Operating system](https://data-operating-system.com/). 32 + These trends are already making it's way towards movements like [DeSci](https://ethereum.org/en/desci/) or smaller projects like [Py-Code Datasets](https://py-code.org/datasets). But, we still need more tooling around data to improve interoperability as much as possible. Lots of companies have figured out how to make the most of their datasets. **We should use similar tooling and approaches companies are using to manage the open datasets that surrounds us**. A sort of [Data Operating system](https://data-operating-system.com/). 32 33 33 34 Data wrangling is a perpetual maintenance commitment, taking a lot of ongoing attention and resources. [Better and modern data tooling can reduce these costs](https://github.com/catalyst-cooperative/pudl). 34 35 ··· 85 86 - For example, integrate all [Hugging Face datasets](https://huggingface.co/docs/datasets/index) by making an scheduled job that builds a Frictionless Catalog (bunch of `datapackage.yml`s pointing to their parquet files). 86 87 - [Expose a JSON-LD so Google Dataset Search can index it](https://developers.google.com/search/docs/appearance/structured-data/dataset). 87 88 - [FAIR](https://www.go-fair.org/fair-principles/). 88 - - **Formatting**. Datasets are saved and exposed in multiple formats (CSV, Parquet, ...). Could be done in the backend, or in the client when pulling data (WASM). The package manager should be **format and storage agnostic**. Give me the dataset with id `xyz` as a CSV in this folder. 89 + - **Formatting**. Datasets are saved and exposed in multiple formats (CSV, Parquet, ...). Could be done in the backend, or in the client when pulling data (WASM). The package manager should be **format and storage agnostic**. Give me the dataset with id `xyz` as a CSV in this folder. 89 90 - **Social**. Allow users, organizations, stars, citations, attaching default visualizations (d3, [Vega](https://vega.github.io/), [Vegafusion](https://github.com/vegafusion/vegafusion/), and others), ... 90 91 - Importing datasets. Making possible to `data fork user/data`, improve something and publish the resulting dataset back (via something like a PR). 91 92 - Have issues and discussions close to the dataset. ··· 127 128 - [Templated validations to make sure datasets conform to certain standards](https://framework.frictionlessdata.io/docs/checks/baseline.html). 128 129 129 130 ### Consumption 130 - - **Accessible**. Datasets are **files**. Datasets are static assets living somewhere. Don't get in the middle with libraries or gated databases. 131 + 132 + - **Accessible**. Datasets are **files**. Datasets are static assets living somewhere. Don't get in the middle with libraries or gated databases. 131 133 - **Documentation**. Surface derived work (e.g: reports, other datasets, ...). 132 134 - **Embedded Visualizations**. Know what's in there before downloading it. 133 135 - **Sane Defaults**. Suggest basic charts (bars, lines, time series, clustering). Multiple [views](https://tech.datopian.com/views/). ··· 135 137 - **Dynamic**. Use only the data you need. No need to pull 150GB. 136 138 - **Default APIs**. For some datasets, allowing REST API / GraphQL endpoints might be useful. Same with providing an SQL interface. 137 139 138 - ## Frequently Asked Questions 140 + ## Frequently Asked Questions 139 141 140 - > I'm not super clear on these answers! Please [reach out](https://davidgasquez.github.io/) if you want to chat about it. 142 + > I'm not super clear on these answers! Please [reach out](https://davidgasquez.github.io/) if you want to chat about it. 141 143 142 144 1. What would be a great use case to start with? 143 145 ··· 175 177 176 178 Homomorphic encryption? 177 179 178 - 9. How could something like [Ver](https://raulcastrofernandez.com/data-discovery-updates/) works? 180 + 9. How could something like [Ver](https://raulcastrofernandez.com/data-discovery-updates/) works? 179 181 180 182 If you can envision the table you would like to have in front of you, i.e., you can write down the attributes you would like the table to contain, then the system will find it for you. This probably needs a [[Knowledge Graphs]]! 181 183 ··· 185 187 186 188 11. [How would a Substack for databases look like](https://tomcritchlow.com/2023/01/27/small-databases/)? 187 189 188 - An easy tool for creating, maintaining and publishing databases with the ability to restrict parts or all of it behind a pay wall. Pair it with the ability to send email updates to your audience about changes and additions. 189 - 190 + An easy tool for creating, maintaining and publishing databases with the ability to restrict parts or all of it behind a pay wall. Pair it with the ability to send email updates to your audience about changes and additions. 191 + 190 192 12. Curated and small data (e.g: at the community level) is not reachable by Google. How can we help there? 191 193 192 194 Indeed! With LLMs on the rise, community curated datasets become more important as they don't appear in the big data dumps. ··· 248 250 - [World Bank](https://data.worldbank.org/indicator) 249 251 - [Ecosyste.ms](https://repos.ecosyste.ms/open-data) 250 252 - [Deps.dev](https://deps.dev/) 253 + - [Twitter Community Notes](https://twitter.com/i/communitynotes/download-data) 251 254 252 255 ### Open Data Organizations 253 256 ··· 269 272 - [Datahub](https://datahub.io/awesome) 270 273 - [HuggingFace Datasets](https://huggingface.co/datasets) 271 274 - [Data World](https://data.world/datasets/open-data) 275 + - [Statista](https://www.statista.com/) 272 276 - [Enigma](https://enigma.com/) 273 277 - [DoltHub](https://www.dolthub.com/discover) 274 278 - [Socrata](https://dev.socrata.com/) ··· 287 291 - [Open Data Inception](https://opendatainception.io/) 288 292 - [Victoriano's Data Sources](https://victorianoi.notion.site/Data-Sources-79b28912c6d941af99e6ef102c578fa0) 289 293 - [Data is Plural](https://www.data-is-plural.com/) 294 + - [Open Sustainable Technology](https://opensustain.tech/) 290 295 - [Public APIs](https://github.com/public-api-lists/public-api-lists) 291 296 - [Real Time Datasets](https://github.com/bytewax/awesome-public-real-time-datasets) 297 + - [Environmental Data Initiative](https://edirepository.org/) 298 + - [Data One](https://www.dataone.org/) 292 299 293 300 ## Open Source Web Data IDE 294 301 ··· 370 377 371 378 ![Architecture](https://user-images.githubusercontent.com/1682202/224966685-b2406d5f-b162-4a93-a68a-af0afca45ebe.png) 372 379 373 - _[Edit on Excalidraw](https://excalidraw.com/#json=RLkinyHZE-4Px_cl21UDI,z8D-l20khdaB-lRumpzN7w)_ 380 + _[Edit on Excalidraw](https://excalidraw.com/#json=RLkinyHZE-4Px_cl21UDI,z8D-l20khdaB-lRumpzN7w)_

+1

Teamwork.md

··· 115 115 - Beware of [Normalization of Deviance](https://danluu.com/wat/). 116 116 - When meeting/emailing interesting people ask if they know anyone else you can meet with. [Try to expand your network with successful folks in the area/space!](https://twitter.com/AdamRy_n/status/1297920306900865024) 117 117 - Keep a [private work log](https://youtu.be/HiF83i1OLOM?list=PLYXaKIsOZBsu3h2SSKEovRn7rGy7wkUAV). It'll make easier for everyone to advocate what you did. 118 + - [Don't sabotage the team](https://erikbern.com/2023/12/13/simple-sabotage-for-software)! 118 119 119 120 ## [How Small Teams Work](https://posthog.com/handbook/people/team-structure/why-small-teams#how-it-works) 120 121

Configure Feed

Configure Feed