Data preparation is famously the least-loved aspect of Data Science. If done right, however, it needn’t be such a headache.
While scikit-learn has fallen out of vogue as a modelling library in recent years given the meteoric rise of PyTorch, LightGBM, and XGBoost, it’s still easily one of the best data preparation libraries out there.
And I’m not just talking about that old chestnut: train_test_split. If you’re prepared to dig a little deeper, you’ll find a treasure trove of helpful tools for more advanced data preparation techniques, all of which are perfectly compatible with using other libraries like lightgbm, xgboost and catboost for subsequent modelling.
In this article, I’ll walk through four scikit-learn classes which significantly speed up my data preparation workflows in my day-to-day job as a Data Scientist.