Missingness Classifier

Module to predict missingness in data and generate imputation test cases.

This module contains the MissingnessClassifier, which is used to predict missingness within a dataset using information derived from other features. The MissingnessClassifier also generates test cases for imputation. Often, we do not and will never have the true value of a missing data point, so its challenging to validate an imputation model’s performance. The MissingnessClassifer generates missing “test” samples from observed that have high likelihood of being missing, which a user can then “impute”. This practice is useful to validate models that contain truly missing data.

class autoimpute.imputations.mis_classifier.MissingnessClassifier(classifier=None, predictors='all')[source]

Classify values as missing or not, based on missingness patterns.

The class has has numerous use cases. First, it fits columns of a DataFrame and predicts whether or not an observation is missing, based on all available information in other columns. The class supports both class prediction and class probabilities.

Second, the class can generate test cases for imputation analysis. Test cases are values that are truly observed but have a high probability of being missing. These cases make imputation process supervised as opposed to unsupervised. A user never knows the true value of missing data but can verify imputation methods on test cases for which the true value is known.

__init__(classifier=None, predictors='all')[source]

Create an instance of the MissingnessClassifier.

The MissingnessClassifier inherits from sklearn BaseEstimator and ClassifierMixin. This inheritence and this class’ implementation ensure that the MissingnessClassifier is a valid classifier that will work in an sklearn pipeline.

Parameters:
  • classifier (classifier, optional) – valid classifier from sklearn. If None, default is xgboost. Note that classifier must conform to sklearn style. This means it must implement the predict_proba method and act as a porper classifier.
  • predictors (str, iter, dict, optiona) – defaults to all, i.e. use all predictors. If all, every column will be used for every class prediction. If a list, subset of columns used for all predictions. If a dict, specify which columns to use as predictors for each imputation. Columns not specified in dict will receive all by default.
classifier

Property getter to return the value of the classifier property

fit(X, **kwargs)[source]

Fit an individual classifier for each column in the DataFrame.

For each feature in the DataFrame, a classifier (default: xgboost) is fit with the feature as the response (y) and all other features as covariates (X). The resulting classifiers are stored in the class instance statistics. One fit for each column in the dataset. Column specification will be supported as well.

Parameters:
  • X (pd.DataFrame) – DataFrame on which to fit classifiers
  • **kwargs – keyword arguments used by classifiers
Returns:

instance of MissingnessClassifier

Return type:

self

fit_predict(X)[source]

Convenience method for fit and class prediction.

Parameters:X (pd.DataFrame) – DataFrame to fit classifier and predict class.
Returns:DataFrame of class predictions.
Return type:pd.DataFrame
fit_predict_proba(X)[source]

Convenience method for fit and class probability prediction.

Parameters:X (pd.DataFrame) – DataFrame to fit classifier and prredict prob.
Returns:DataFrame of class probability predictions.
Return type:pd.DataFrame
gen_test_df(X, thresh=0.5, m=0.05, inplace=False, use_exist=False)[source]

Generate new DatFrame with value of false positives set to missing.

Method generates new DataFrame with the locations (indices) of false positives set to missing. Utilizes gen_test_indices to detect index of false positives.

Parameters:
  • X (pd.DataFrame) – DataFrame from which test indices generated. Data first goes through fit_predict_proba.
  • thresh (float, optional) – Threshhold for generating false positive. If raw value is observed and P(missing) >= thresh, then the observation is considered a false positive and index is stored.
  • m (float, optional) – % false positive threshhold for warning. If % <= m, issue warning with % of test cases.
  • use_exist (bool, optional) – Whether or not to use existing fit and classifiers. Default is False.
Returns:

DataFrame with false positives set to missing.

Return type:

pd.DataFrame

gen_test_indices(X, thresh=0.5, use_exist=False)[source]

Generate indices of false positives for each fitted column.

Method generates the locations (indices) of false positives returned from classifiers. These are instances that have a high probability of being missing even though true value is observed. Use this method to get indices without mutating the actual DataFrame. To set the values to missing for the actual DataFrame, use gen_test_df.

Parameters:
  • X (pd.DataFrame) – DataFrame from which test indices generated. Data first goes through fit_predict_proba.
  • thresh (float, optional) – Threshhold for generating false positive. If raw value is observed and P(missing) >= thresh, then the observation is considered a false positive and index is stored.
  • use_exist (bool, optional) – Whether or not to use existing fit and classifiers. Default is False.
Returns:

test_indice available from self.test_indices

Return type:

self

predict(X, **kwargs)[source]

Predict class of each feature. 1 for missing; 0 for not missing.

First checks to ensure data has been fit. If fit, predict method uses the respective classifier of each feature (stored in statistics) and predicts class membership for each observation of each feature. 1 = missing; 0 = not missing. Prediction is binary, as class membership is hard. If probability deesired, use predict_proba method.

Parameters:
  • X (pd.DataFrame) – DataFrame used to create predictions.
  • kwargs – kewword arguments. Used by the classifer.
Returns:

DataFrame with class prediction for each observation.

Return type:

pd.DataFrame

predict_proba(X, **kwargs)[source]

Predict probability of missing class membership of each feature.

First checks to ensure data has been fit. If fit, predict_proba method uses the respsective classifier of each feature (in statistics) and predicts probability of missing class membership for each observation of each feature. Prediction is probability of missing. Therefore, probability of not missing is 1-P(missing). For hard class membership prediction, use predict.

Parameters:X (pd.DataFrame) – DataFrame used to create probabilities.
Returns:
DataFrame with probability of missing class for
each observation.
Return type:pd.DataFrame