How missing values can effect learning
Most machine learning libraries will throw errors for missing values. Here I will outline some main approaches to dealing with them.
Let's start with the example from Kaggle learning:
Create a helper function to measure the quality of predictions for each approach.
Simple Option
The easiest solution is to just drop columns with missing values. This is only good when most values in this column are missing, otherwise we may be dropping some important information our algorithm can learn from.
MAE from Approach 1 (Drop columns with missing values): 183550.22137772635
Next best option with Imputation
Here we would fill in the missing values with some number. Most commonly the mean value. This usually leads to more accurate models than we would have dropping things all together.
MAE from Approach 2 (Imputation): 178166.46269899711
Extending Imputation
Sometimes imputation can lead to the new filled in values being above or below the actual values (perhaps, they just were not collected in the dataset), or rows with missing values being unique in some way than the others. In this case extending imputation by creating a new column tagging all the rows that had to be filled in, or not. This way, we give the algorithm some indication that that values are different or special in some way, with a new feature to learn from.
MAE from Approach 3 (An Extension to Imputation): 178927.503183954
Information for this article is from Kaggle Learning: Missing Values by Alexis Cook