Dealing with missing values in Datasets
Jimmy Rousseau
Author: Jimmy Rousseau | Published: 8/28/2023

How missing values can effect learning

Most machine learning libraries will throw errors for missing values. Here I will outline some main approaches to dealing with them.

Let's start with the example from Kaggle learning:

Create a helper function to measure the quality of predictions for each approach.

Simple Option

The easiest solution is to just drop columns with missing values. This is only good when most values in this column are missing, otherwise we may be dropping some important information our algorithm can learn from.

MAE from Approach 1 (Drop columns with missing values): 183550.22137772635

Next best option with Imputation

Here we would fill in the missing values with some number. Most commonly the mean value. This usually leads to more accurate models than we would have dropping things all together.

MAE from Approach 2 (Imputation): 178166.46269899711

Extending Imputation

Sometimes imputation can lead to the new filled in values being above or below the actual values (perhaps, they just were not collected in the dataset), or rows with missing values being unique in some way than the others. In this case extending imputation by creating a new column tagging all the rows that had to be filled in, or not. This way, we give the algorithm some indication that that values are different or special in some way, with a new feature to learn from.

MAE from Approach 3 (An Extension to Imputation): 178927.503183954

Information for this article is from Kaggle Learning: Missing Values by Alexis Cook