Deletion and Imputation Strategies

This section documents deletion and imputation strategies within Autoimpute.

Deletion is implemented through a single function, listwise_delete, documented below.

Imputation strategies are implemented as classes. The authors of this package refer to these classes as “series-imputers”. Each series-imputer maps to an imputation method - either univariate or multivariate - that imputes missing values within a pandas Series. The imputation methods are the workhorses of the DataFrame Imputers, the SingleImputer, MultipleImputer, and MiceImputer. Refer to the imputers documentation for more information on the DataFrame Imputers.

For more information regarding the relationship between DataFrame Imputers and series-imputers, refer to the following tutorial. The tutorial covers series-imputers in detail as well as the design patterns behind AutoImpute Imputers.

Deletion Methods

autoimpute.imputations.listwise_delete(data, inplace=False, verbose=False)[source]

Delete all rows from a DataFrame where any missing values exist.

Deletion is one way to handle missing values. This method removes any records that have a missing value in any of the features. This package focuses on imputation, not deletion. That being said, listwise deletion is a necessary component of any imputation package, as its the default method most people (and software) use to handle missing data.

Parameters:
  • data (pd.DataFrame) – DataFrame used to delete missing rows.
  • inplace (boolean, optional) – perform operation inplace. Defaults to False.
  • verbose (boolean, optional) – print information to console. Defaults to False.
Returns:

rows with missing values removed.

Return type:

pd.DataFrame

Raises:

ValueError – columns with all data missing. Raised through decorator.

Imputation Strategies

Manage the series imputations folder from the autoimpute package.

This module handles imports from the series imputations folder that should be accessible whenever someone imports autoimpute.imputations.series. Although these imputers are stand-alone classes, their direct use is discouraged. More robust imputers from the dataframe folder delegate work to these imputers whenever their respective strategies are requested.

This module handles from autoimpute.imputations.series import * with the __all__ variable below. This command imports the main public classes and methods from autoimpute.imputations.series.

class autoimpute.imputations.series.DefaultUnivarImputer(num_imputer=<class 'autoimpute.imputations.series.mean.MeanImputer'>, cat_imputer=<class 'autoimpute.imputations.series.mode.ModeImputer'>, num_kwgs=None, cat_kwgs={'fill_strategy': 'random'})[source]

Impute missing data using default methods for univariate imputation.

This imputer is the default for univariate imputation. The imputer determines how to impute based on the column type of each column in a dataframe. The imputer can be used directly, but such behavior is discouraged. DefaultUnivarImputer does not have the flexibility / robustness of more complex imputers, nor is its behavior identical. Preferred use is MultipleImputer(strategy=”default univariate”).

__init__(num_imputer=<class 'autoimpute.imputations.series.mean.MeanImputer'>, cat_imputer=<class 'autoimpute.imputations.series.mode.ModeImputer'>, num_kwgs=None, cat_kwgs={'fill_strategy': 'random'})[source]

Create an instance of the DefaultUnivarImputer class.

The dataframe imputers delegate work to the DefaultUnivarImputer if strategy=”default univariate” The DefaultUnivarImputer then determines how to impute numerical and categorical columns by default. It does so by passing its arguments to the DefaultBaseImputer, which handles validation and instantiation of numerical and categorical imputers.

Parameters:
  • num_imputer (Imputer, Optional) – valid Imputer for numerical data. Default is MeanImputer.
  • cat_imputer (Imputer, Optional) – valid Imputer for categorical data. Default is ModeImputer.
  • num_kwgs (dict, optional) – Keyword args for numerical imputer. Default is None.
  • cat_kwgs (dict, optional) – keyword args for categorical imputer. Default is {“fill_strategy”: “random”}.
Returns:

self. Instance of class.

fit(X, y=None)[source]

Defer fit to the DefaultBaseImputer.

impute(X)[source]

Defer transform to the DefaultBaseImputer.

class autoimpute.imputations.series.DefaultTimeSeriesImputer(num_imputer=<class 'autoimpute.imputations.series.interpolation.InterpolateImputer'>, cat_imputer=<class 'autoimpute.imputations.series.mode.ModeImputer'>, num_kwgs={'fill_strategy': 'linear'}, cat_kwgs={'fill_strategy': 'random'})[source]

Impute missing data using default methods for time series.

This imputer is the default imputer for time series imputation. The imputer determines how to impute based on the column type of each column in a dataframe. The imputer can be used directly, but such behavior is discouraged. DefaultTimeSeriesImputer does not have the flexibility / robustness of more complex imputers, nor is its behavior identical. Preferred use is MultipleImputer(strategy=”default time”).

__init__(num_imputer=<class 'autoimpute.imputations.series.interpolation.InterpolateImputer'>, cat_imputer=<class 'autoimpute.imputations.series.mode.ModeImputer'>, num_kwgs={'fill_strategy': 'linear'}, cat_kwgs={'fill_strategy': 'random'})[source]

Create an instance of the DefaultTimeSeriesImputer class.

The dataframe imputers delegate work to the DefaultTimeSeriesImputer if strategy=”default time”. The DefaultTimeSeriesImputer then determines how to impute numerical and categorical columns by default. It does so by passing its arguments to the DefaultBaseImputer, which handles validation and instantiation of default numerical and categorical imputers.

Parameters:
  • num_imputer (Imputer, Optional) – valid Imputer for numerical data. Default is InterpolateImputer.
  • cat_imputer (Imputer, Optional) – valid Imputer for categorical data. Default is ModeImputer.
  • num_kwgs (dict, optional) – Keyword args for numerical imputer. Default is {“strategy”: “linear”}.
  • cat_kwgs (dict, optional) – keyword args for categorical imputer. Default is {“fill_strategy”: “random”}.
Returns:

self. Instance of class.

fit(X, y=None)[source]

Defer fit to the DefaultBaseImputer.

impute(X)[source]

Defer transform to the DefaultBaseImputer.

class autoimpute.imputations.series.DefaultPredictiveImputer(num_imputer=<class 'autoimpute.imputations.series.pmm.PMMImputer'>, cat_imputer=<class 'autoimpute.imputations.series.logistic_regression.MultinomialLogisticImputer'>, num_kwgs=None, cat_kwgs=None)[source]

Impute missing data using default methods for prediction.

This imputer is the default imputer for the MultipleImputer class. When an end-user does not supply a strategy, the DefaultPredictiveImputer determines how to impute based on the column type of each column in a dataframe. The imputer can be used directly, but such behavior is discouraged. DefaultPredictiveImputer does not have the flexibility / robustness of more complex imputers, nor is its behavior identical. Preferred use is MultipleImputer(strategy=”default predictive”).

__init__(num_imputer=<class 'autoimpute.imputations.series.pmm.PMMImputer'>, cat_imputer=<class 'autoimpute.imputations.series.logistic_regression.MultinomialLogisticImputer'>, num_kwgs=None, cat_kwgs=None)[source]

Create an instance of the DefaultPredictiveImputer class.

The dataframe imputers delegate work to DefaultPredictiveImputer if strategy=”default predictive” or no strategy given when class is instantiated. The DefaultPredictiveImputer determines how to impute numerical and categorical columns by default. It does so by passing its arguments to the DefaultBaseImputer, which handles validation and instantiation of default numerical and categorical imputers.

Parameters:
  • num_imputer (Imputer, Optional) – valid Imputer for numerical data. Default is PMMImputer.
  • cat_imputer (Imputer, Optional) – valid Imputer for categorical data. Default is MultiLogisticImputer.
  • num_kwgs (dict, optional) – Keyword args for numerical imputer. Default is None.
  • cat_kwgs (dict, optional) – keyword args for categorical imputer. Default is None.
Returns:

self. Instance of class.

fit(X, y)[source]

Defer fit to the DefaultBaseImputer.

impute(X)[source]

Defer transform to the DefaultBaseImputer.

class autoimpute.imputations.series.MeanImputer[source]

Impute missing values with the mean of the observed data.

This imputer imputes missing values with the mean of observed data. The imputer can be used directly, but such behavior is discouraged. MeanImputer does not have the flexibility / robustness of dataframe imputers, nor is its behavior identical. Preferred use is MultipleImputer(strategy=”mean”).

__init__()[source]

Create an instance of the MeanImputer class.

fit(X, y)[source]

Fit the Imputer to the dataset and calculate the mean.

Parameters:
  • X (pd.Series) – Dataset to fit the imputer.
  • y (None) – ignored, None to meet requirements of base class
Returns:

self. Instance of the class.

fit_impute(X, y=None)[source]

Convenience method to perform fit and imputation in one go.

impute(X)[source]

Perform imputations using the statistics generated from fit.

The impute method handles the actual imputation. Missing values in a given dataset are replaced with the respective mean from fit.

Parameters:X (pd.Series) – Dataset to impute missing data from fit.
Returns:float – imputed dataset.
class autoimpute.imputations.series.MedianImputer[source]

Impute missing values with the median of the observed data.

This imputer imputes missing values with the median of observed data. The imputer can be used directly, but such behavior is discouraged. MedianImputer does not have the flexibility / robustness of dataframe imputers, nor is its behavior identical. Preferred use is MultipleImputer(strategy=”median”).

__init__()[source]

Create an instance of the MedianImputer class.

fit(X, y=None)[source]

Fit the Imputer to the dataset and calculate the median.

Parameters:
  • X (pd.Series) – Dataset to fit the imputer.
  • y (None) – ignored, None to meet requirements of base class
Returns:

self. Instance of the class.

fit_impute(X, y=None)[source]

Convenience method to perform fit and imputation in one go.

impute(X)[source]

Perform imputations using the statistics generated from fit.

The impute method handles the actual imputation. Missing values in a given dataset are replaced with the respective median from fit.

Parameters:X (pd.Series) – Dataset to impute missing data from fit.
Returns:float – imputed dataset.
class autoimpute.imputations.series.ModeImputer(fill_strategy=None)[source]

Impute missing values with the mode of the observed data.

The mode imputer calculates the mode of the observed dataset and uses it to impute missing observations. In the case where there are more than one mode, the user can supply a fill_strategy to choose the mode. The imputer can be used directly, but such behavior is discouraged. ModeImputer does not have the flexibility / robustness of dataframe imputers, nor is its behavior identical. Preferred use is MultipleImputer(strategy=”mode”).

__init__(fill_strategy=None)[source]

Create an instance of the ModeImputer class.

Parameters:fill_strategy (str, Optional) – strategy to pick mode, if multiple. Default is None, which means first mode taken. Options include None, first, last, random. First, None -> select first of modes. Last -> select the last of modes. Random -> randomly sample from modes with replacement.
fill_strategy

Property getter to return the value of fill_strategy property.

fit(X, y=None)[source]

Fit the Imputer to the dataset and calculate the mode.

Parameters:
  • X (pd.Series) – Dataset to fit the imputer.
  • y (None) – ignored, None to meet requirements of base class
Returns:

self. Instance of the class.

fit_impute(X, y=None)[source]

Convenience method to perform fit and imputation in one go.

impute(X)[source]

Perform imputations using the statistics generated from fit.

This method handles the actual imputation. Missing values in a given dataset are replaced with the mode observed from fit. Note that there can be more than one mode. If more than one mode, use the fill_strategy to determine how to use the modes.

Parameters:X (pd.Series) – Dataset to impute missing data from fit.
Returns:float or np.array – imputed dataset.
class autoimpute.imputations.series.RandomImputer[source]

Impute missing data using random draws from observed data.

The RandomImputer samples with replacement from observed data. The imputer can be used directly, but such behavior is discouraged. RandomImputer does not have the flexibility / robustness of dataframe imputers, nor is its behavior identical. Preferred use is MultipleImputer(strategy=”random”).

__init__()[source]

Create an instance of the RandomImputer class.

fit(X, y=None)[source]

Fit the Imputer to the dataset and get unique observed to sample.

Parameters:
  • X (pd.Series) – Dataset to fit the imputer.
  • y (None) – ignored, None to meet requirements of base class
Returns:

self. Instance of the class.

fit_impute(X, y=None)[source]

Convenience method to perform fit and imputation in one go.

impute(X)[source]

Perform imputations using the statistics generated from fit.

The transform method handles the actual imputation. Each missing value in a given dataset is replaced with a random draw from unique set of observed values determined during the fit stage.

Parameters:X (pd.Series) – Dataset to impute missing data from fit.
Returns:np.array – imputed dataset
class autoimpute.imputations.series.NormImputer[source]

Impute missing data with draws from normal distribution.

The NormImputer constructs a normal distribution using the sample mean and variance of the observed data. The imputer then randomly samples from this distribution to impute missing data. The imputer can be used directly, but such behavior is discouraged. NormImputer does not have the flexibility / robustness of dataframe imputers, nor is its behavior identical. Preferred use is MultipleImputer(strategy=”norm”).

__init__()[source]

Create an instance of the NormImputer class.

fit(X, y=None)[source]

Fit Imputer to dataset and calculate mean and sample variance.

Parameters:
  • X (pd.Series) – Dataset to fit the imputer.
  • y (None) – ignored, None to meet requirements of base class
Returns:

self. Instance of the class.

fit_impute(X, y)[source]

Convenience method to perform fit and imputation in one go.

impute(X)[source]

Perform imputations using the statistics generated from fit.

The transform method handles the actual imputation. It constructs a normal distribution using the sample mean and variance from fit. It then imputes missing values with a random draw from the respective distribution.

Parameters:X (pd.Series) – Dataset to impute missing data from fit.
Returns:np.array – imputed dataset.
class autoimpute.imputations.series.CategoricalImputer[source]

Impute missing data w/ draw from dataset’s categorical distribution.

The categorical imputer computes the proportion of observed values for each category within a discrete dataset. The imputer then samples the distribution to impute missing values with a respective random draw. The imputer can be used directly, but such behavior is discouraged. CategoricalImputer does not have the flexibility / robustness of dataframe imputers, nor is its behavior identical. Preferred use is MultipleImputer(strategy=”categorical”).

__init__()[source]

Create an instance of the CategoricalImputer class.

fit(X, y=None)[source]

Fit the Imputer to the dataset and calculate proportions.

Parameters:
  • X (pd.Series) – Dataset to fit the imputer.
  • y (None) – ignored, None to meet requirements of base class
Returns:

self. Instance of the class.

fit_impute(X, y=None)[source]

Convenience method to perform fit and imputation in one go.

impute(X)[source]

Perform imputations using the statistics generated from fit.

The impute method handles the actual imputation. Transform constructs a categorical distribution for each feature using the proportions of observed values from fit. It then imputes missing values with a random draw from the respective distribution.

Parameters:X (pd.Series) – Dataset to impute missing data from fit.
Returns:np.array – imputed dataset.
class autoimpute.imputations.series.NOCBImputer(end=None)[source]

Impute missing data by carrying the next observation backward.

NOCBImputer carries the next observation backward to impute missing data. The imputer can be used directly, but such behavior is discouraged. NOCBImputer does not have the flexibility / robustness of dataframe imputers, nor is its behavior identical. Preferred use is MultipleImputer(strategy=”nocb”).

__init__(end=None)[source]

Create an instance of the NOCBImputer class.

Parameters:end (any, optional) – can be any value to impute end if end is missing. Default is None, which ends up taking last observed value found. Can also use “mean” to end with mean of the series.
Returns:self. Instance of class.
fit(X, y=None)[source]

Fit the Imputer to the dataset and calculate the mean.

Parameters:
  • X (pd.Series) – Dataset to fit the imputer
  • y (None) – ignored, None to meet requirements of base class
Returns:

self. Instance of the class.

fit_impute(X, y=None)[source]

Convenience method to perform fit and imputation in one go.

impute(X)[source]

Perform imputations using the statistics generated from fit.

The impute method handles the actual imputation. Missing values in a given dataset are replaced with the next observation carried backward.

Parameters:X (pd.Series) – Dataset to impute missing data from fit.
Returns:np.array – imputed dataset.
class autoimpute.imputations.series.LOCFImputer(start=None)[source]

Impute missing values by carrying the last observation forward.

LOCFImputer carries the last observation forward to impute missing data. The imputer can be used directly, but such behavior is discouraged. LOCFImputer does not have the flexibility / robustness of dataframe imputers, nor is its behavior identical. Preferred use is MultipleImputer(strategy=”locf”).

__init__(start=None)[source]

Create an instance of the LOCFImputer class.

Parameters:start (any, optional) – can be any value to impute first if first is missing. Default is None, which ends up taking first observed value found. Can also use “mean” to start with mean of the series.
Returns:self. Instance of class.
fit(X, y=None)[source]

Fit the Imputer to the dataset.

Parameters:
  • X (pd.Series) – Dataset to fit the imputer.
  • y (None) – ignored, None to meet requirements of base class
Returns:

self. Instance of the class.

fit_impute(X, y=None)[source]

Convenience method to perform fit and imputation in one go.

impute(X)[source]

Perform imputations using the statistics generated from fit.

The impute method handles the actual imputation. Missing values in a given dataset are replaced with the last observation carried forward.

Parameters:X (pd.Series) – Dataset to impute missing data from fit.
Returns:np.array – imputed dataset.
class autoimpute.imputations.series.InterpolateImputer(fill_strategy='linear', start=None, end=None, order=None)[source]

Impute missing values using interpolation techniques.

The InterpolateImputer imputes missing values uses a valid pd.Series interpolation strategy. See __init__ method docs for supported strategies. The imputer can be used directly, but such behavior is discouraged. InterpolateImputer does not have the flexibility / robustness of dataframe imputers, nor is its behavior identical. Preferred use is MultipleImputer(strategy=”interpolate”).

__init__(fill_strategy='linear', start=None, end=None, order=None)[source]

Create an instance of the InterpolateImputer class.

Parameters:
  • fill_strategy (str, Optional) – type of interpolation to perform Default is linear. Other strategies supported include: time, quadratic, cubic, spline, barycentric, polynomial.
  • start (int, Optional) – value to impute if first number in Series is missing. Default is None, but first valid used when required for quadratic, cubic, polynomial.
  • end (int, Optional) – value to impute if last number in Series is missing. Default is None, but last valid used when required for quadratic, cubic, polynomial.
  • order (int, Optional) – if strategy is spline or polynomial, order must be number. Otherwise not considered.
Returns:

self. Instance of the class.

fill_strategy

Property getter to return the value of fill_strategy property.

fit(X, y=None)[source]

Fit the Imputer to the dataset. Nothing to calculate.

Parameters:
  • X (pd.Series) – Dataset to fit the imputer.
  • y (None) – ignored, None to meet requirements of base class
Returns:

self. Instance of the class.

fit_impute(X, y=None)[source]

Convenience method to perform fit and imputation in one go.

impute(X)[source]

Perform imputations using the statistics generated from fit.

The impute method handles the actual imputation. Missing values in a given dataset are replaced with results from interpolation.

Parameters:X (pd.Series) – Dataset to impute missing data from fit.
Returns:np.array – imputed dataset.
class autoimpute.imputations.series.LeastSquaresImputer(**kwargs)[source]

Impute missing values using predictions from least squares regression.

The LeastSquaresImputer produces predictions using the least squares methodology. The prediction from the line of best fit given a set of predictors become the imputations. To implement least squares, the imputer wraps the sklearn LinearRegression class. The imputer can be used directly, but such behavior is discouraged. LeastSquaresImputer does not have the flexibility / robustness of dataframe imputers, nor is its behavior identical. Preferred use is MultipleImputer(strategy=”least squares”).

__init__(**kwargs)[source]

Create an instance of the LeastSquaresImputer class.

Parameters:**kwargs – keyword arguments passed to LinearRegression
fit(X, y)[source]

Fit the Imputer to the dataset by fitting linear model.

Parameters:
  • X (pd.Dataframe) – dataset to fit the imputer.
  • y (pd.Series) – response, which is eventually imputed.
Returns:

self. Instance of the class.

fit_impute(X, y)[source]

Fit impute method to generate imputations where y is missing.

Parameters:
  • X (pd.Dataframe) – predictors in the dataset.
  • y (pd.Series) – response w/ missing values to impute.
Returns:

imputed dataset.

Return type:

np.array

impute(X)[source]

Generate imputations using predictions from the fit linear model.

The impute method returns the values for imputation. Missing values in a given dataset are replaced with the predictions from the least squares regression line of best fit. This transform method returns those predictions.

Parameters:X (pd.DataFrame) – predictors to determine imputed values.
Returns:imputed dataset.
Return type:np.array
class autoimpute.imputations.series.StochasticImputer(**kwargs)[source]

Impute missing values adding error to least squares regression preds.

The StochasticImputer predicts using the least squares methodology. The imputer then samples from the regression’s error distribution and adds the random draw to the prediction. This draw adds the stochastic element to the imputations. The imputer can be used directly, but such behavior is discouraged. StochasticImputer does not have the flexibility / robustness of dataframe imputers, nor is its behavior identical. Preferred use is MultipleImputer(strategy=”stochastic”).

__init__(**kwargs)[source]

Create an instance of the StochasticImputer class.

Parameters:**kwargs – keyword arguments passed to LinearRegression.
fit(X, y)[source]

Fit the Imputer to the dataset by fitting linear model.

The fit step also generates predictions on the observed data. These predictions are necessary to derive the mean_squared_error, which is passed as a parameter to the impute phase. The MSE is used to create the normal error distribution from which the imptuer draws.

Parameters:
  • X (pd.Dataframe) – dataset to fit the imputer.
  • y (pd.Series) – response, which is eventually imputed.
Returns:

self. Instance of the class.

fit_impute(X, y)[source]

Fit impute method to generate imputations where y is missing.

Parameters:
  • X (pd.Dataframe) – predictors in the dataset.
  • y (pd.Series) – response w/ missing values to impute
Returns:

imputated dataset.

Return type:

np.array

impute(X)[source]

Generate imputations using predictions from the fit linear model.

The impute method returns the values for imputation. Missing values in a given dataset are replaced with the predictions from the least squares regression line of best fit plus a random draw from the normal error distribution.

Parameters:X (pd.DataFrame) – predictors to determine imputed values.
Returns:imputed dataset.
Return type:np.array
class autoimpute.imputations.series.BinaryLogisticImputer(**kwargs)[source]

Impute missing values w/ predictions from binary logistic regression.

The BinaryLogisticImputer produces predictions using logsitic regression with two classes. The class predictions given a set of predictors become the imputations. To implement logistic regression, the imputer wraps the sklearn LogisticRegression class with a default solver (liblinear). The imputer can be used directly, but such behavior is discouraged. BinaryLogisticImputer does not have the flexibility / robustness of dataframe imputers, nor is its behavior identical. Preferred use is MultipleImputer(strategy=”binary logistic”).

__init__(**kwargs)[source]

Create an instance of the BinaryLogisticImputer class.

Parameters:**kwargs – keyword arguments passed to LogisticRegresion.
fit(X, y)[source]

Fit the Imputer to the dataset by fitting logistic model.

Parameters:
  • X (pd.Dataframe) – dataset to fit the imputer.
  • y (pd.Series) – response, which is eventually imputed.
Returns:

self. Instance of the class.

fit_impute(X, y)[source]

Fit impute method to generate imputations where y is missing.

Parameters:
  • X (pd.Dataframe) – predictors in the dataset.
  • y (pd.Series) – response w/ missing values to impute.
Returns:

imputed dataset.

Return type:

np.array

impute(X)[source]

Generate imputations using predictions from the fit logistic model.

The impute method returns the values for imputation. Missing values in a given dataset are replaced with the predictions from the logistic regression class specification.

Parameters:X (pd.DataFrame) – predictors to determine imputed values.
Returns:imputed dataset.
Return type:np.array
class autoimpute.imputations.series.MultinomialLogisticImputer(**kwargs)[source]

Impute missing values w/ preds from multinomial logistic regression.

The MultinomialLogisticImputer produces predictions w/ logsitic regression with more than two classes. Class predictions given a set of predictors become the imputations. To implement logistic regression, the imputer wraps the sklearn LogisticRegression class with a default solver (saga) and default multi_class set to multinomial. The imputer can be used directly, but such behavior is discouraged. MultinomialLogisticImputer does not have the flexibility / robustness of dataframe imputers, nor is its behavior identical. Preferred use is MultipleImputer(strategy=”multinomial logistic”).

__init__(**kwargs)[source]

Create an instance of the MultiLogisticImputer class.

Parameters:**kwargs – keyword arguments passed to LogisticRegression.
fit(X, y)[source]

Fit the Imputer to the dataset by fitting logistic model.

Parameters:
  • X (pd.Dataframe) – dataset to fit the imputer.
  • y (pd.Series) – response, which is eventually imputed.
Returns:

self. Instance of the class.

fit_impute(X, y)[source]

Fit impute method to generate imputations where y is missing.

Parameters:
  • X (pd.Dataframe) – predictors in the dataset.
  • y (pd.Series) – response w/ missing values to impute.
Returns:

imputed dataset.

Return type:

np.array

impute(X)[source]

Generate imputations using predictions from the fit logistic model.

The impute method returns the values for imputation. Missing values in a given dataset are replaced with the predictions from the logistic regression class specification.

Parameters:X (pd.DataFrame) – predictors to determine imputed values.
Returns:imputed dataset.
Return type:np.array
class autoimpute.imputations.series.BayesianLeastSquaresImputer(**kwargs)[source]

Impute missing values using bayesian least squares regression.

The BayesianLeastSquaresImputer produces predictions using the bayesian approach to least squares. Prior distributions are fit for the model parameters of interest (alpha, beta, epsilon). Imputations for missing values are samples from posterior predictive distribution of each missing point. To implement bayesian least squares, the imputer utlilizes the pymc library. The imputer can be used directly, but such behavior is discouraged. BayesianLeastSquaresImputer does not have the flexibility / robustness of dataframe imputers, nor is its behavior identical. Preferred use is MultipleImputer(strategy=”bayesian least squares”).

__init__(**kwargs)[source]

Create an instance of the BayesianLeastSquaresImputer class.

The class requires multiple arguments necessary to create priors for a bayesian linear regression equation. The regression is: alpha + beta * X + epsilson. Because paramaters are treated as random variables, we must specify their distributions, including the parameters of those distributions. In thie init method we also include arguments used to sample the posterior distributions.

Parameters:
  • **kwargs – default keyword arguments used for bayesian analysis. Note - kwargs popped for default arguments defined below. Rest of kwargs passed as params to sampling (see pymc).
  • am (float, Optional) – mean of alpha prior. Default 0.
  • asd (float, Optional) – std. deviation of alpha prior. Default 10.
  • bm (float, Optional) – mean of beta priors. Default 0.
  • bsd (float, Optional) – std. deviation of beta priors. Default 10.
  • sig (float, Optional) – parameter of sigma prior. Default 1.
  • sample (int, Optional) – number of posterior samples per chain. Default = 1000. More samples, longer to run, but better approximation of the posterior & chance of convergence.
  • tune (int, Optional) – parameter for tuning. Draws done in addition to sample. Default = 1000.
  • init (str, Optional) – MCMC algo to use for posterior sampling. Default = ‘auto’. See pymc docs for more info on choices.
  • fill_value (str, Optional) – How to draw from the posterior to create imputations. Default is None. ‘random’ and ‘mean’ supported for explicit options.
fit(X, y)[source]

Fit the Imputer to the dataset by fitting bayesian model.

Parameters:
  • X (pd.Dataframe) – dataset to fit the imputer.
  • y (pd.Series) – response, which is eventually imputed.
Returns:

self. Instance of the class.

fit_impute(X, y)[source]

Fit impute method to generate imputations where y is missing.

Parameters:
  • X (pd.Dataframe) – predictors in the dataset.
  • y (pd.Series) – response w/ missing values to impute.
Returns:

imputed dataset.

Return type:

np.array

impute(X, k=None)[source]

Generate imputations using predictions from the fit bayesian model.

The transform method returns the values for imputation. Missing values in a given dataset are replaced with the samples from the posterior predictive distribution of each missing data point.

Parameters:
  • X (pd.DataFrame) – predictors to determine imputed values.
  • k (integer) – optional, pass if and only if receiving from MICE
Returns:

imputed dataset.

Return type:

np.array

class autoimpute.imputations.series.BayesianBinaryLogisticImputer(**kwargs)[source]

Impute missing values using bayesian binary losgistic regression.

The BayesianBinaryLogisticImputer produces predictions using the bayesian approach to logistic regression. Prior distributions are fit for the model parameters of interest (alpha, beta, epsilon). Imputations for missing values are samples from the posterior predictive distribution of each missing point. To implement bayesian logistic regression, the imputer uses the pymc library. The imputer can be used directly, but such behavior is discouraged. BayesianBinaryLogisticImputer does not have the flexibility / robustness of dataframe imputers, nor is its behavior identical. Preferred use is MultipleImputer(strategy=”bayesian binary logistic”).

__init__(**kwargs)[source]

Create an instance of the BayesianBinaryLogisticImputer class.

The class requires multiple arguments necessary to create priors for a bayesian logistic regression equation. The parameters are the same as linear regression, but the regression equation is transformed using pymc’s invlogit method. Because paramaters are treated as random variables, we must specify their distributions, including the parameters of those distributions. In the init method we also include arguments used to sample the posterior distributions.

Parameters:
  • **kwargs – default keyword arguments used for bayesian analysis. Note - kwargs popped for default arguments defined below. Rest of kwargs passed as params to sampling (see pymc).
  • am (float, Optional) – mean of alpha prior. Default 0.
  • asd (float, Optional) – std. deviation of alpha prior. Default 10.
  • bm (float, Optional) – mean of beta priors. Default 0.
  • bsd (float, Optional) – std. deviation of beta priors. Default 10.
  • thresh (float, Optional) – threshold for class membership. Default 0.5. Max = 1, min = 0. Tune threshhold depending on class imbalance. Same as with logistic regression equation.
  • sample (int, Optional) – number of posterior samples per chain. Default = 1000. More samples, longer to run, but better approximation of the posterior & chance of convergence.
  • tune (int, Optional) – parameter for tuning. Draws done in addition to sample. Default = 1000.
  • init (str, Optional) – MCMC algo to use for posterior sampling. Default = ‘auto’. See pymc docs for more info on choices.
  • fill_value (str, Optional) – How to draw from the posterior to create imputations. Default is None. ‘random’ and ‘mean’ supported for explicit options.
fit(X, y)[source]

Fit the Imputer to the dataset by fitting bayesian model.

Parameters:
  • X (pd.Dataframe) – dataset to fit the imputer.
  • y (pd.Series) – response, which is eventually imputed.
Returns:

self. Instance of the class.

fit_impute(X, y)[source]

Fit impute method to generate imputations where y is missing.

Parameters:
  • X (pd.Dataframe) – predictors in the dataset.
  • y (pd.Series) – response w/ missing values to impute.
Returns:

imputed dataset.

Return type:

np.array

impute(X, k=None)[source]

Generate imputations using predictions from the fit bayesian model.

The impute method returns the values for imputation. Missing values in a given dataset are replaced with the samples from the posterior predictive distribution of each missing data point.

Parameters:
  • X (pd.DataFrame) – predictors to determine imputed values.
  • k (integer) – optional, pass if and only if receiving from MICE
Returns:

imputated dataset.

Return type:

np.array

class autoimpute.imputations.series.PMMImputer(**kwargs)[source]

Impute missing values using predictive mean matching.

The PMMIMputer produces predictions using a combination of bayesian approach to least squares and least squares itself. For each missing value PMM finds the n closest neighbors from a least squares regression prediction set, and samples from the corresponding true values for y of each of those n predictions. The imputation is the resulting sample. To implement bayesian least squares, the imputer utlilizes the pymc library. The imputer can be used directly, but such behavior is discouraged. PmmImputer does not have the flexibility / robustness of dataframe imputers, nor is its behavior identical. Preferred use is MultipleImputer(strategy=”pmm”).

__init__(**kwargs)[source]

Create an instance of the PMMImputer class.

The class requires multiple arguments necessary to create priors for a bayesian linear regression equation and least squares itself. Therefore, PMM arguments include all of those seen in bayesian least squares and least squares itself. New parameters include neighbors, or the number of neighbors that PMM uses to sample observed.

Parameters:
  • **kwargs – default keyword arguments for lm & bayesian analysis. Note - kwargs popped for default arguments defined below. Next set of kwargs popped and sent to linear regression. Rest of kwargs passed as params to sampling (see pymc).
  • am (float, Optional) – mean of alpha prior. Default 0.
  • asd (float, Optional) – std. deviation of alpha prior. Default 10.
  • bm (float, Optional) – mean of beta priors. Default 0.
  • bsd (float, Optional) – std. deviation of beta priors. Default 10.
  • sig (float, Optional) – parameter of sigma prior. Default 1.
  • sample (int, Optional) – number of posterior samples per chain. Default = 1000. More samples, longer to run, but better approximation of the posterior & chance of convergence.
  • tune (int, Optional) – parameter for tuning. Draws done in addition to sample. Default = 1000.
  • init (str, Optional) – MCMC algo to use for posterior sampling. Default = ‘auto’. See pymc docs for more info on choices.
  • fill_value (str, Optional) – How to draw from the posterior to create imputations. Default is “random”. ‘random’ and ‘mean’ supported for explicit options.
  • neighbors (int, Optional) – number of neighbors. Default is 5. Value should be greater than 0 and less than # observed, although anything greater than 10-20 generally too high unless dataset is massive.
  • fit_intercept (bool, Optional) – sklearn LinearRegression param.
  • normalize (bool, Optional) – sklearn LinearRegression param.
  • copy_x (bool, Optional) – sklearn LinearRegression param.
  • n_jobs (int, Optional) – sklearn LinearRegression param.
fit(X, y)[source]

Fit the Imputer to the dataset by fitting bayesian and LS model.

Parameters:
  • X (pd.Dataframe) – dataset to fit the imputer.
  • y (pd.Series) – response, which is eventually imputed.
Returns:

self. Instance of the class.

fit_impute(X, y)[source]

Fit impute method to generate imputations where y is missing.

Parameters:
  • X (pd.Dataframe) – predictors in the dataset.
  • y (pd.Series) – response w/ missing values to impute.
Returns:

imputed dataset.

Return type:

np.array

impute(X)[source]

Generate imputations using predictions from the fit bayesian model.

The transform method returns the values for imputation. Missing values in a given dataset are replaced with the random selection from the PMM process. Again, PMM imputes actually observed values, and the observed values are selected by finding the closest least squares predictions to a given prediction from the bayesian model.

Parameters:X (pd.DataFrame) – predictors to determine imputed values.
Returns:imputed dataset.
Return type:np.array
class autoimpute.imputations.series.LRDImputer(**kwargs)[source]

Impute missing values using local residual draws.

The LRDImputer produces predictions using a combination of bayesian approach to least squares and least squares itself. For each missing value LRD finds the n closest neighbors from a least squares regression prediction set, and samples from the corresponding true values for y of each of those n predictions. The imputation is the resulting sample plus the residual, or the distance between the prediction and the neighbor. To implement bayesian least squares, the imputer utlilizes the pymc library. The imputer can be used directly, but such behavior is discouraged. LRDImputer does not have the flexibility / robustness of dataframe imputers, nor is its behavior identical. Preferred use is MultipleImputer(strategy=”lrd”).

__init__(**kwargs)[source]

Create an instance of the LRDImputer class.

The class requires multiple arguments necessary to create priors for a bayesian linear regression equation and least squares itself. Therefore, LRD arguments include all of those seen in bayesian least squares and least squares itself. New parameters include neighbors, or the number of neighbors that LRD uses to sample observed.

Parameters:
  • **kwargs – default keyword arguments for lm & bayesian analysis. Note - kwargs popped for default arguments defined below. Next set of kwargs popped and sent to linear regression. Rest of kwargs passed as params to sampling (see pymc).
  • am (float, Optional) – mean of alpha prior. Default 0.
  • asd (float, Optional) – std. deviation of alpha prior. Default 10.
  • bm (float, Optional) – mean of beta priors. Default 0.
  • bsd (float, Optional) – std. deviation of beta priors. Default 10.
  • sig (float, Optional) – parameter of sigma prior. Default 1.
  • sample (int, Optional) – number of posterior samples per chain. Default = 1000. More samples, longer to run, but better approximation of the posterior & chance of convergence.
  • tune (int, Optional) – parameter for tuning. Draws done in addition to sample. Default = 1000.
  • init (str, Optional) – MCMC algo to use for posterior sampling. Default = ‘auto’. See pymc docs for more info on choices.
  • fill_value (str, Optional) – How to draw from the posterior to create imputations. Default is “random”. ‘random’ and ‘mean’ supported for explicit options.
  • neighbors (int, Optional) – number of neighbors. Default is 5. Value should be greater than 0 and less than # observed, although anything greater than 10-20 generally too high unless dataset is massive.
  • fit_intercept (bool, Optional) – sklearn LinearRegression param.
  • copy_x (bool, Optional) – sklearn LinearRegression param.
  • n_jobs (int, Optional) – sklearn LinearRegression param.
fit(X, y)[source]

Fit the Imputer to the dataset by fitting bayesian and LS model.

Parameters:
  • X (pd.Dataframe) – dataset to fit the imputer.
  • y (pd.Series) – response, which is eventually imputed.
Returns:

self. Instance of the class.

fit_impute(X, y)[source]

Fit impute method to generate imputations where y is missing.

Parameters:
  • X (pd.Dataframe) – predictors in the dataset.
  • y (pd.Series) – response w/ missing values to impute.
Returns:

imputed dataset.

Return type:

np.array

impute(X)[source]

Generate imputations using predictions from the fit bayesian model.

The transform method returns the values for imputation. Missing values in a given dataset are replaced with the random selection from the LRD process. Again, LRD imputes actually observed values, and the observed values are selected by finding the closest least squares predictions to a given prediction from the bayesian model.

Parameters:X (pd.DataFrame) – predictors to determine imputed values.
Returns:imputed dataset.
Return type:np.array
class autoimpute.imputations.series.NormUnitVarianceImputer[source]

Impute missing values assuming normally distributed data with unknown mean and known variance.

__init__()[source]

Create an instance of the NormUnitVarianceImputer class.

fit(X, y)[source]

Fit the Imputer to the dataset and calculate the mean.

Parameters:
  • X (pd.Series) – Dataset to fit the imputer.
  • y (None) – ignored, None to meet requirements of base class
Returns:

self. Instance of the class.

fit_impute(X, y=None)[source]

Convenience method to perform fit and imputation in one go.

impute(X)[source]

Perform imputations using the statistics generated from fit.

The impute method handles the actual imputation. Missing values in a given dataset are replaced with the respective mean from fit.

Parameters:X (pd.Series) – Dataset to impute missing data from fit.
Returns:np.array – imputed dataset.