Analysis Models

This section documents analysis models within Autoimpute and their respective diagnostics.

The MiLinearRegression and MiLogisticRegression extend linear and logistic regression to multiply imputed datasets. Under the hood, each regression class uses a MiceImputer to handle missing data prior to supervised analysis. Users of each regression class can tweak the underlying MiceImputer through the mi_kwgs argument or pass a pre-configured instance to the mi argument (recommended).

Users can also specify whether the classes should use sklearn or statsmodels to implement linear or logistic regression. The default is statsmodels. When used, end users get more detailed parameter diagnostics for regression on multiply imputed data.

Finally, this section provides diagnostic helper methods to assess bias of parameters from a regression model.

Linear Regression for Multiply Imputed Data

class autoimpute.analysis.MiLinearRegression(mi=None, model_lib='statsmodels', mi_kwgs=None, model_kwgs=None)[source]

Linear Regression wrapper for multiply imputed datasets.

The MiLinearRegression class wraps the sklearn and statsmodels libraries to extend linear regression to multiply imputed datasets. The class wraps statsmodels as well as sklearn because sklearn alone does not provide sufficient functionality to pool estimates under Rubin’s rules. sklearn is for machine learning; therefore, important inference capabilities are lacking, such as easily calculating std. error estimates for parameters. If users want inference from regression analysis of multiply imputed data, utilze the statsmodels implementation in this class instead.

linear_models

linear models used by supported python libs.

Type:dict
__init__(mi=None, model_lib='statsmodels', mi_kwgs=None, model_kwgs=None)[source]

Create an instance of the Autoimpute MiLinearRegression class.

Parameters:
  • mi (MiceImputer, Optional) – An instance of a MiceImputer. Default is none. Can create one through mi_kwgs instead.
  • model_lib (str, Optional) – library the regressor will use to implement regression. Options are sklearn and statsmodels. Default is statsmodels.
  • mi_kwgs (dict, Optional) – keyword args to instantiate MiceImputer. Default is None. If valid MiceImputer passed as mi argument, then mi_kwgs ignored.
  • model_kwgs (dict, Optional) – keyword args to instantiate regressor. Default is None.
Returns:

self. Instance of the class.

fit(X, y)[source]

Fit model specified to multiply imputed dataset.

Fit a linear regression on multiply imputed datasets. The method first creates multiply imputed data using the MiceImputer instantiated when creating an instance of the class. It then runs a linear model on each m datasets. The linear model comes from sklearn or statsmodels. Finally, the fit method calculates pooled parameters from the m linear models. Note that variance for pooled parameters using Rubin’s rules is available for statsmodels only. sklearn does not implement parameter inference out of the box. Autoimpute sklearn pooling TBD.

Parameters:
  • X (pd.DataFrame) – predictors to use. can contain missingness.
  • y (pd.Series, pd.DataFrame) – response. can contain missingness.
Returns:

self. Instance of the class

predict(X)[source]

Make predictions using statistics generated from fit.

The regression uses the pooled parameters from each of the imputed datasets to generate a set of single predictions. The pooled params come from multiply imputed datasets, but the predictions themselves follow the same rules as an ordinary linear regression.

Parameters:X (pd.DataFrame) – data to make predictions using pooled params.
Returns:predictions.
Return type:np.array
summary()[source]

Provide a summary for model parameters, variance, and metrics.

The summary method brings together the statistics generated from fit as well as the variance ratios, if available. The statistics are far more valuable when using statsmodels than sklearn.

Returns:summary statistics
Return type:pd.DataFrame

Logistic Regression for Multiply Imputed Data

class autoimpute.analysis.MiLogisticRegression(mi=None, model_lib='statsmodels', mi_kwgs=None, model_kwgs=None)[source]

Logistic Regression wrapper for multiply imputed datasets.

The MiLogisticRegression class wraps the sklearn and statsmodels libraries to extend logistic regression to multiply imputed datasets. The class wraps statsmodels as well as sklearn because sklearn alone does not provide sufficient functionality to pool estimates under Rubin’s rules. sklearn is for machine learning; therefore, important inference capabilities are lacking, such as easily calculating std. error estimates for parameters. If users want inference from regression analysis of multiply imputed data, utilze the statsmodels implementation in this class instead.

logistic_models

logistic models used by supported python libs.

Type:dict
__init__(mi=None, model_lib='statsmodels', mi_kwgs=None, model_kwgs=None)[source]

Create an instance of the Autoimpute MiLogisticRegression class.

Parameters:
  • mi (MiceImputer, Optional) – An instance of a MiceImputer. Default is None. Can create one through mi_kwgs instead.
  • model_lib (str, Optional) – library the regressor will use to implement regression. Options are sklearn and statsmodels. Default is statsmodels.
  • mi_kwgs (dict, Optional) – keyword args to instantiate MiceImputer. Default is None. If valid MiceImputer passed as mi argument, then mi_kwgs ignored.
  • model_kwgs (dict, Optional) – keyword args to instantiate regressor. Default is None.
Returns:

self. Instance of the class.

fit(X, y)[source]

Fit model specified to multiply imputed dataset.

Fit a logistic regression on multiply imputed datasets. The method creates multiply imputed data using the MiceImputer instantiated when creating an instance of the class. It then runs a logistic model on m datasets. The logistic model comes from sklearn or statsmodels. Finally, the fit method calculates pooled parameters from m logistic models. Note that variance for pooled parameters using Rubin’s rules is available for statsmodels only. sklearn does not implement parameter inference out of the box.

Parameters:
  • X (pd.DataFrame) – predictors to use. can contain missingness.
  • y (pd.Series, pd.DataFrame) – response. can contain missingness.
Returns:

self. Instance of the class

predict(X, threshold=0.5)[source]

Make predictions using statistics generated from fit.

The predict method calls on the predict_proba method, which returns the probability of class membership for each prediction. These probabilities range from 0 to 1. Therefore, anything below the set threshold is assigned to class 0, while anything above the threshold is assigned to class 1. The deafult threshhold is 0.5, which indicates a balanced dataset.

Parameters:
  • X (pd.DataFrame) – data to make predictions using pooled params.
  • threshold (float, Optional) – boundary for class membership. Default is 0.5. Values can range from 0 to 1.
Returns:

predictions.

Return type:

np.array

predict_proba(X)[source]

Predict probabilities of class membership for logistic regression.

The regression uses the pooled parameters from each of the imputed datasets to generate a set of single predictions. The pooled params come from multiply imputed datasets, but the predictions themselves follow the same rules as an logistic regression. Because this is logistic regression, the sigmoid function is applied to the result of the normal equation, giving us probabilities between 0 and 1 for each prediction. This method returns those probabilities.

Parameters:X (pd.Dataframe) – predictors to predict response
Returns:prob of class membership for predicted observations.
Return type:np.array
summary()[source]

Provide a summary for model parameters, variance, and metrics.

The summary method brings together the statistics generated from fit as well as the variance ratios, if available. The statistics are far more valuable when using statsmodels than sklearn.

Returns:summary statistics
Return type:pd.DataFrame

Diagnostics

autoimpute.analysis.raw_bias(Q_bar, Q)[source]

Calculate raw bias between coefficients Q and actual Q.

Q_bar can be one estimate (scalar) or a vector of estimates. This equation subtracts the expected Q_bar from Q, element-wise. The result is the bias of each coefficient from its true value.

Parameters:
  • Q_bar (number, array) – single estimate or array of estimates.
  • Q (number, array) – single truth or array of truths.
Returns:

element-wise difference between estimates and truths.

Return type:

scalar, array

Raises:
autoimpute.analysis.percent_bias(Q_bar, Q)[source]

Calculate precent bias between coefficients Q and actual Q.

Q_bar can be one estimate (scalar) or a vector of estimates. This equation subtracts the expected Q_bar from Q, element-wise. The result is the bias of each coefficient from its true value. We then divide this number by Q itself, again in element-wise fashion, to produce % bias.

Parameters:
  • Q_bar (number, array) – single estimate or array of estimates.
  • Q (number, array) – single truth or array of truths.
Returns:

element-wise difference between estimates and truths.

Return type:

scalar, array

Raises: