# How to impute missing values in training datasets

recategorized
I have a training dataset in which some values for some features are either blank or NaN. The dataset is a NumPy matrix. How can I replace those missing values with some proper values?

by (233k points)

Imputation of noisy features or missing feature values is a research question. However, there are some existing methods that can be used to impute the missing values. The sklearn library has univariate and multivariate imputation modules.

Here is an example using the univariate feature imputation method. Missing values can be imputed with a provided constant value, or using the statistics (mean, median, or most frequent) of each column in which the missing values are located.

>>> import numpy as np
>>> from sklearn.impute import SimpleImputer
>>> imp = SimpleImputer(missing_values=np.nan, strategy='mean')
>>> X_train = np.array([[4, 2, 3], [6, 1, 1], [7, 6, 5], [4, 9, 10]])
>>> X_train
array([[ 4,  2,  3],
[ 6,  1,  1],
[ 7,  6,  5],
[ 4,  9, 10]])
>>> X_test = np.array([[np.nan, 2, 3], [6, np.nan, 1], [7, 6, 5], [4, 9, np.nan]])
>>> X_test
array([[nan,  2.,  3.],
[ 6., nan,  1.],
[ 7.,  6.,  5.],
[ 4.,  9., nan]])
>>> imp.fit(X_train)
SimpleImputer()
>>> imp.transform(X_test)
array([[5.25, 2.  , 3.  ],
[6.  , 4.5 , 1.  ],
[7.  , 6.  , 5.  ],
[4.  , 9.  , 4.75]])