DataFrame Imputers

This section documents the DataFrame Imputers within Autoimpute.

DataFrame Imputers are the primary feature of the package. The SingleImputer imputes each column within a DataFrame one time, while the MultipleImputer imputes each column within a DataFrame multiple times using independent runs. Under the hood, the MultipleImputer actually creates separate instances of the SingleImputer to handle each run. The MiceImputer takes the MultipleImputer one step futher, iteratively improving imputations in each column k times for each n runs the MultipleImputer performs.

The base class for all imputers is the BaseImputer. While you should not use the BaseImputer directly unless you’re creating your own imputer class, you should understand what it provides the other imputers. The BaseImputer also contains the strategy “key-value store”, or the methods that autoimpute currently supports.

Base Imputer

class autoimpute.imputations.BaseImputer(strategy, imp_kwgs, visit)[source]

Building blocks for more advanced DataFrame imputers.

The BaseImputer is not a stand-alone class and thus serves no purpose other than as a parent to Imputers. Therefore, the BaseImputer should not be used directly unless creating an Imputer. That being said, all DataFrame Imputers should inherit from BaseImputer. It contains base functionality for any new DataFrame Imputer, and it holds the set of strategies that make up this imputation library.

univariate_strategies

univariate imputation methods. | Key = imputation name; Value = function to perform imputation. | univariate default mean for numerical, mode for categorical. | time default interpolate for numerical, mode for categorical. | mean imputes missing values with the average of the series. | median imputes missing values with the median of the series. | mode imputes missing values with the mode of the series. | Method handles more than one mode (see ModeImputer for info). | random imputes random choice from set of series unique vals. | norm imputes series w/ random draws from normal distribution. | Mean and std calculated from observed values of the series. | categorical imputes series using random draws from pmf. | Proportions calculated from non-missing category instances. | interpolate imputes series using chosen interpolation method. | Default is linear. See InterpolateImputer for more info. | locf imputes series carrying last observation moving forward. | nocb imputes series carrying next observation moving backward. | normal unit variance imputes using unit variance w/ norm.

Type:dict
predictive_strategies

predictive imputation methods. | Key = imputation name; Value = function to perform imputation. | predictive default pmm for numerical,logistic for categorical. | least squares predict missing values from linear regression. | binary logistic predict missing values with 2 classes. | multinomial logistic predict missing values with multiclass. | stochastic linear regression+random draw from norm w/ mse std. | bayesian least squares draw from the posterior predictive | distribution for each missing value, using OLS model. | bayesian binary logistic draw from the posterior predictive | distribution for each missing value, using logistic model. | pmm imputes series using predictive mean matching. PMM is a | semi-supervised method using bayesian & hot-deck imputation. | lrd imputes series using local residual draws. LRD is a | semi-supervised method using bayesian & hot-deck imputation.

Type:dict
__init__(strategy, imp_kwgs, visit)[source]

Initialize the BaseImputer.

Parameters:
  • strategy (str, iter, dict; optional) – strategies for imputation. Default value is str -> predictive default. If str, single strategy broadcast to all series in DataFrame. If iter, must provide 1 strategy per column. Each method w/in iterator applies to column with same index value in DataFrame. If dict, must provide key = column name, value = imputer. Dict the most flexible and PREFERRED way to create custom imputation strategies if not using the default. Dict does not require method for every column; just those specified as keys.
  • imp_kwgs (dict, optional) – keyword arguments for each imputer. Default is None, which means default imputer created to match specific strategy. imp_kwgs keys can be either columns or strategies. If strategies, each column given that strategy is instantiated with same arguments.
  • visit (str, None) – order to visit columns for imputation. Default is default, which implements left-to-right. More strategies (random, monotone, etc.) TBD.
__weakref__

list of weak references to the object (if defined)

imp_kwgs

Property getter to return the value of imp_kwgs.

strategy

Property getter to return the value of the strategy property.

visit

Property getter to return the value of the visit property.

Single Imputer

class autoimpute.imputations.SingleImputer(strategy='default predictive', predictors='all', imp_kwgs=None, copy=True, seed=None, visit='default')[source]

Techniques to impute Series with missing values one time.

The SingleImputer class takes a DataFrame and performs imputations on each Series within the DataFrame. The Imputer does one pass for each column, and it supports numerous imputation methods for each column.

The SingleImputer delegates imputation to respective SeriesImputers, each of which maps to a specific strategy supported by the SingleImputer. Most of the SeriesImputers are inductive (fit and transform for new data). Transductive SeriesImputers (such as InterpolateImputer) still perform a “mock” fit stage but do all the imputation work in the transform step. The fit stage is performed to remain consistent with the sklearn API. The class is a valid sklearn transformer that can be used in an sklearn Pipeline because it inherits from the TransformerMixin and implements both fit and transform methods.

__init__(strategy='default predictive', predictors='all', imp_kwgs=None, copy=True, seed=None, visit='default')[source]

Create an instance of the SingleImputer class.

As with sklearn classes, all arguments take default values. Therefore, SingleImputer() creates a valid class instance. The instance is used to set up a SingleImputer and perform checks on arguments.

Parameters:
  • strategy (str, iter, dict; optional) – strategy for single imputer. Default value is str –> predictive default. See BaseImputer for all available strategies. If str, single strategy broadcast to all series in DataFrame. If iter, must provide 1 strategy per column. Each method w/in iterator applies to column with same index value in DataFrame. If dict, must provide key = column name, value = imputer. Dict the most flexible and PREFERRED way to create custom imputation strategies if not using the default. Dict does not require method for every column; just those specified as keys.
  • predictors (str, iter, dict, optional) – defaults to all, i.e. use all predictors. If all, every column will be used for every class prediction. If a list, subset of columns used for all predictions. If a dict, specify which columns to use as predictors for each imputation. Columns not specified in dict but present in strategy receive all other cols as preds. Note predictors are IGNORED for univariate imputation methods, so specifying is meaningless unless strategy is predictive.
  • imp_kwgs (dict, optional) – keyword args for each SeriesImputer. Default is None, which means default imputer created to match specific strategy. imp_kwgs keys can be either columns or strategies. If strategies, each column given that strategy is instantiated with same arguments. When strategy is default, imp_kwgs is ignored.
  • copy (bool, optional) – create copy of DataFrame or operate inplace. Default value is True. Copy created.
  • seed (int, optional) – seed setting for reproducible results. Defualt is None. No validation, but values should be integer.
fit(X, y=None, imp_ixs=None)[source]

Fit specified imputation methods to each column within a DataFrame.

The fit method calculates the statistics necessary to later transform a dataset (i.e. perform actual imputations). Inductive methods calculate statistic on the fit data, then impute new missing data with that value. Most currently supported methods are inductive.

It’s important to note that we have to fit X regardless of whether any data is missing. Transform step may have missing data if new data is used, so fit each column that appears in the given strategies.

Parameters:
  • X (pd.DataFrame) – pandas DataFrame on which imputer is fit.
  • y (pd.Series, pd.DataFrame Optional) – response. Default is None. Determined interally in fit method. Arg is present to remain compatible with sklearn Pipelines.
  • imp_ixs (dict) – Dictionary of lists of indices that indicate which data elements to impute per column or None to identify from missing elements per column
Returns:

instance of the SingleImputer class.

Return type:

self

Raises:
  • ValueError – error in specification of strategies. Raised through check_strategy_fit. See its docstrings for more info.
  • ValueError – error in specification of predictors. Raised through check_predictors_fit. See its docstrings for more info.
fit_transform(X, y=None, **trans_kwargs)[source]

Convenience method to fit then transform the same dataset.

Parameters:
  • X (pd.DataFrame) – DataFrame used for fit and transform steps.
  • y (pd.DataFrame, pd.Series, Optional) – response. Default is None. Set internally by fit method.
  • **trans_kwargs – dict, optional args for bayesian.
Returns:

imputed in place or copy of original.

Return type:

X (pd.DataFrame)

transform(X, imp_ixs=None, **trans_kwargs)[source]

Impute each column within a DataFrame using fit imputation methods.

The transform step performs the actual imputations. Given a dataset previously fit, transform imputes each column with it’s respective imputed values from fit (in the case of inductive) or performs new fit and transform in one sweep (in the case of transductive).

Parameters:
  • X (pd.DataFrame) – DataFrame to impute (same as fit or new data).
  • imp_ixs (dict) – Dictionary of lists of indices that indicate which data elements to impute per column or None to identify from missing elements per column
  • **trans_kwargs – dict, optional args for bayesian.
Returns:

imputed in place or copy of original.

Return type:

X (pd.DataFrame)

Raises:

ValueError – same columns must appear in fit and transform. Raised through _transform_strategy_validator.

Multiple Imputer

class autoimpute.imputations.MultipleImputer(n=5, strategy='default predictive', predictors='all', imp_kwgs=None, seed=None, visit='default', return_list=False)[source]

Techniques to impute Series with missing values multiple times.

The MultipleImputer class applies imputation multiple times. It leverages the methods found in the BaseImputer. This imputer passes all the work for each imputation to the SingleImputer, but it controls the arguments each imputer receives. The args are flexible depending on what the user specifies for each imputation.

Note that the Imputer allows for one imputation method per column only. Therefore, the behavior of strategy is the same as the SingleImputer, but the predictors are allowed to change for each imputation.

__init__(n=5, strategy='default predictive', predictors='all', imp_kwgs=None, seed=None, visit='default', return_list=False)[source]

Create an instance of the MultipleImputer class.

As with sklearn classes, all arguments take default values. Therefore, MultipleImputer() creates a valid class instance. The instance is used to set up an imputer and perform checks on arguments.

Parameters:
  • n (int, optional) – number of imputations to perform. Default is 5. Value must be greater than or equal to 1.
  • strategy (str, iter, dict; optional) – strategy for single imputer. Default value is str –> predictive default. See BaseImputer for all available strategies. If str, single strategy broadcast to all series in DataFrame. If iter, must provide 1 strategy per column. Each method w/in iterator applies to column with same index value in DataFrame. If dict, must provide key = column name, value = imputer. Dict the most flexible and PREFERRED way to create custom imputation strategies if not using the default. Dict does not require method for every column; just those specified as keys.
  • predictors (str, iter, dict, optional) – defaults to all, i.e. use all predictors. If all, every column will be used for every class prediction. If a list, subset of columns used for all predictions. If a dict, specify which columns to use as predictors for each imputation. Columns not specified in dict but present in strategy receive all other cols as preds.
  • imp_kwgs (dict, optional) – keyword arguments for each imputer. Default is None, which means default imputer created to match specific strategy. imp_kwgs keys can be either columns or strategies. If strategies, each column given that strategy is instantiated with same arguments. When strategy is default, imp_kwgs is ignored.
  • seed (int, optional) – seed setting for reproducible results. Defualt is None. No validation, but values should be integer.
  • return_list (bool, optional) – return m as list or generator. Default is False. m imputations returned as generator. More memory efficient. return as list if return_list=True
fit(X, y=None)[source]

Fit imputation methods to each column within a DataFrame.

The fit method calclulates the statistics necessary to later transform a dataset (i.e. perform actual imputatations). Inductive methods calculate statistic on the fit data, then impute new missing data with that value. All currently supported methods are inductive.

Parameters:X (pd.DataFrame) – pandas DataFrame on which imputer is fit.
Returns:instance of the PredictiveImputer class.
Return type:self
fit_transform(X, y=None, **trans_kwargs)[source]

Convenience method to fit then transform the same dataset.

n

Property getter to return the value of the n property.

transform(X, **trans_kwargs)[source]

Impute each column within a DataFrame using fit imputation methods.

The transform step performs the actual imputations. Given a dataset previously fit, transform imputes each column with it’s respective imputed values from fit (in the case of inductive) or performs new fit and transform in one sweep (in the case of transductive).

Parameters:
  • X (pd.DataFrame) – fit DataFrame to impute.
  • **trans_kwargs – dict, optional args for bayesian.
Returns:

imputed in place or copy of original.

Return type:

X (pd.DataFrame)

Raises:

ValueError – same columns must appear in fit and transform.

Mice Imputer

class autoimpute.imputations.MiceImputer(k=3, n=5, strategy='default predictive', predictors='all', imp_kwgs=None, seed=None, visit='default', return_list=False)[source]

Techniques to impute Series with missing values multiple times using repeated fits and applications to reach a stable imputation.

The MiceImputer class implements multiple imputation, i.e., a series or repetition of applications of imputation to reach a stable imputation, similar to the functioning of the R package MICE. It leverages the methods found in the BaseImputer. This imputer passes all the work for each imputation to the SingleImputer, but it controls the arguments each imputer receives. The args are flexible depending on what the user specifies for each imputation.

Note that the Imputer allows for one imputation method per column only. Therefore, the behavior of strategy is the same as the SingleImputer, but the predictors are allowed to change for each imputation.

__init__(k=3, n=5, strategy='default predictive', predictors='all', imp_kwgs=None, seed=None, visit='default', return_list=False)[source]

Create an instance of the SeriesImputer class.

As with sklearn classes, all arguments take default values. Therefore, SeriesImputer() creates a valid class instance. The instance is used to set up an imputer and perform checks on arguments.

Parameters:
  • k (int, optional) – number of repeated fits and transformations to apply to reach a stable impution. Default is 3. Value must be greater than or equal to 1.
  • n (int, optional) – number of imputations to perform. Default is 5. Value must be greater than or equal to 1.
  • strategy (str, iter, dict; optional) – strategy for single imputer. Default value is str –> predictive default. See BaseImputer for all available strategies. If str, single strategy broadcast to all series in DataFrame. If iter, must provide 1 strategy per column. Each method w/in iterator applies to column with same index value in DataFrame. If dict, must provide key = column name, value = imputer. Dict the most flexible and PREFERRED way to create custom imputation strategies if not using the default. Dict does not require method for every column; just those specified as keys.
  • predictors (str, iter, dict, optional) – defaults to all, i.e. use all predictors. If all, every column will be used for every class prediction. If a list, subset of columns used for all predictions. If a dict, specify which columns to use as predictors for each imputation. Columns not specified in dict but present in strategy receive all other cols as preds.
  • imp_kwgs (dict, optional) – keyword arguments for each imputer. Default is None, which means default imputer created to match specific strategy. imp_kwgs keys can be either columns or strategies. If strategies, each column given that strategy is instantiated with same arguments. When strategy is default, imp_kwgs is ignored.
  • seed (int, optional) – seed setting for reproducible results. Defualt is None. No validation, but values should be integer.
  • return_list (bool, optional) – return m as list or generator. Default is False. m imputations returned as generator. More memory efficient. return as list if return_list=True
k

Property getter to return the value of the k property.

transform(X)[source]

Impute each column within a DataFrame using fit imputation methods.

The transform step performs the actual imputations. Given a dataset previously fit, transform imputes each column with it’s respective imputed values from fit (in the case of inductive) or performs new fit and transform in one sweep (in the case of transductive). The transformations and fits are repeatedly applied and refitted self.k times to reach a stable imputation.

Parameters:X (pd.DataFrame) – fit DataFrame to impute.
Returns:imputed in place or copy of original.
Return type:X (pd.DataFrame)
Raises:ValueError – same columns must appear in fit and transform.