Missing Data Demystified: The Absolute Primer for Data Scientists

Earlier this year, I started a piece on several data quality issues (or characteristics) that heavily compromise our machine learning models.

One of them was, unsurprisingly, Missing Data.

I’ve been studying this topic for many years now (I know, right?!) but along some projects I contribute to in the Data-Centric Community, I realized that many data scientists still haven’t fully grasped the full complexity of the problem, which inspired me to create this comprehensive tutorial.

Today, we will delve into the intricacies the problem of missing data, discover the different types of missing data we may find in the wild, and explore how we can identify and mark missing values in real-world datasets.

The Problem of Missing Data

Missing Data is an interesting data imperfection since it may arise naturally due to the nature of the domain, or be inadvertently created during data, collection, transmission, or processing.

In essence, missing data is characterized by the appearance of absent values in data, i.e., missing values in some records or observations in the dataset, and can either be univariate (one feature has missing values) or multivariate (several features have missing values):

Website