:art: · davidgasquez.com/handbook@6979a8d

+10 -7

2 changed files

expand all

Datathons.md

IPFS.md

+9 -7

Datathons.md

··· 5 5 1. Learn more about the problem. Search for similar Kaggle competitions. Check the task in [Papers with Code](https://paperswithcode.com/). 6 6 2. Do a basic data exploration. Try to understand the problem and gather a sense of what can be important. 7 7 3. Get baseline model working. 8 - 4. Create `scikit-learn` compatible metric if needed. 8 + 4. Design an evaluation method as close as the final evaluation. Plot local evaluation metrics against the public ones (correlation) to validate how well your validation strategy works. 9 9 5. Try different approaches for preprocessing (encodings, Deep Feature Synthesis, lags, aggregations, imputers, ...). If you're working as a group, split preprocessing feature generation between files. 10 - 6. Plot learning curves ([sklearn](https://scikit-learn.org/stable/modules/learning_curve.html) or [external tools](https://github.com/reiinakano/scikit-plot)) to avoid overfitting. 11 - 7. Tune hyper-parameters once you've settled on an specific approach. ([optuna](https://optuna.readthedocs.io/)). 12 - 8. Plot and visualize the predictions (histograms, random prediction, ...) to make sure they're doing as expected. Explain the predictions with [SHAP](https://github.com/slundberg/shap). 13 - 9. Think about what postprocessing heuristics can be done to improve or correct predictions. 14 - 10. [Stack](https://scikit-learn.org/stable/auto_examples/ensemble/plot_stack_predictors.html) classifiers ([example](https://www.kaggle.com/couyang/featuretools-sklearn-pipeline#ML-Pipeline)). 15 - 11. Try AutoML models. For tabular data: [TPOT](https://github.com/EpistasisLab/tpot), [AutoSklearn](https://github.com/automl/auto-sklearn), [AutoGluon](https://auto.gluon.ai/stable/index.html), Google AI Platform, [PyCaret](https://github.com/pycaret/pycaret), [Fast.ai](https://docs.fast.ai/), [Alex](https://github.com/Alex-Lekov/AutoML_Alex).For time series: [AtsPy](https://github.com/firmai/atspy), [DeepAR](https://docs.aws.amazon.com/forecast/latest/dg/aws-forecast-recipe-deeparplus.html). 10 + 6. Plot learning curves ([sklearn](https://scikit-learn.org/stable/modules/learning_curve.html) or [external tools](https://github.com/reiinakano/scikit-plot)) to avoid overfitting. 11 + 7. Plot real and predicted target distribution to see how well your model understand the underlying distribution. Apply any postprocessing that might fix small things. 12 + 8. Tune hyper-parameters once you've settled on an specific approach ([hyperopt](target distribution), [optuna](https://optuna.readthedocs.io/)). 13 + 9. Plot and visualize the predictions (histograms, random prediction, ...) to make sure they're doing as expected. Explain the predictions with [SHAP](https://github.com/slundberg/shap). 14 + 10. Think about what postprocessing heuristics can be done to improve or correct predictions. 15 + 11. [Stack](https://scikit-learn.org/stable/auto_examples/ensemble/plot_stack_predictors.html) classifiers ([example](https://www.kaggle.com/couyang/featuretools-sklearn-pipeline#ML-Pipeline)). 16 + 12. Try AutoML models. For tabular data: [TPOT](https://github.com/EpistasisLab/tpot), [AutoSklearn](https://github.com/automl/auto-sklearn), [AutoGluon](https://auto.gluon.ai/stable/index.html), Google AI Platform, [PyCaret](https://github.com/pycaret/pycaret), [Fast.ai](https://docs.fast.ai/), [Alex](https://github.com/Alex-Lekov/AutoML_Alex).For time series: [AtsPy](https://github.com/firmai/atspy), [DeepAR](https://docs.aws.amazon.com/forecast/latest/dg/aws-forecast-recipe-deeparplus.html). 16 17 17 18 ## Preprocessing Resources 18 19 ··· 43 44 - [Sktime](https://github.com/alan-turing-institute/sktime) / [Aeon](https://github.com/aeon-toolkit/aeon) 44 45 - [Awesome Collection](https://github.com/MaxBenChrist/awesome_time_series_in_python) 45 46 - [Video with great ideas](https://www.youtube.com/watch?v=9QtL7m3YS9I) 47 + - [Tutorial Kaggle Notebook](https://www.kaggle.com/code/tumpanjawat/s3e19-course-eda-fe-lightgbm)

IPFS.md

··· 2 2 3 3 - It's a file system with [content based addressing](https://www.youtube.com/watch?v=5Uj6uR3fp-U). 4 4 - Files are automatically deduplicated. 5 + - [It chunks, hashes and organizes blobs in a smart way](https://docs.google.com/presentation/d/1Gx8vSqrWZ7X-3SCgITXqQdinZQeXIAA7ITqL25SsPN8/edit#slide=id.g741b4d76cd_0_13). 5 6 - Once something is added, it can't be changed anymore. 6 7 - IPFS supports versioning using commits. 7 8 - Keeping files available is a challenge. If the nodes storing a file go down, it'll disappear from the network. Filecoin can help with this adding incentives to the equation.

Configure Feed

Configure Feed