Abstract

In machine learning tasks, it is common to handle missing data by removing observations with missing values, or replacing missing data with the mean value for its feature. To show why this is problematic, we use listwise deletion and mean imputing to recover missing values from artificially created datasets, and we compare those models against ones with full information. Unless quite strong independence assumptions are met, we observe large biases in the resulting coefficients and an increase in the model’s prediction error. We include a set of recommendations for handling missing data safely, and a case study showing how to put those recommendations into practice.

Keywords:data sciencemissing dataimputation