Utility Methods

Methods to numerically assess patterns in missing data.

This module is a collection of methods to explore missing data and its patterns. The module’s methods are heavily influenced by those found in section 4.1 of Flexible Imputation of Missing Data (Van Buuren). Their main purpose is to identify trends and patterns in missing data that can help inform what type of imputation method may apply or what cautions to take when performing imputations in general.

autoimpute.utils.patterns.flux(data)[source]

Caclulates inbound, influx, outbound, outflux, pobs, for DataFrame.

Port of Van Buuren’s flux method in R. Calculates: - pobs: Proportion observed (column from the proportions method). - ainb: Average inbound statistic. - aout: Average outbound statistic. - influx: Influx coefficient, Ij (from the influx method). - outflux: Outflux coefficient, Oj (from the outflux method).

Parameters:data (pd.DataFrame) – DataFrame to calculate relevant statistics.
Returns:
one column for each summary statistic.
Columns of DataFrame equal the name of the summary statistics. Indices of DataFrame equal the original DataFrame columns.
Return type:pd.DataFrame
autoimpute.utils.patterns.get_stat_for(func, data)[source]

Generic method to get a missing data statistic from data.

This method can be used directly with helper methods, but this behavior is discouraged. Instead, use specific public methods below. These special methods utilize this function internally to compute summary statistics.

Parameters:
  • func (function) – Function that calculates a statistic.
  • data (pd.DataFrame) – DataFrame on which to run the function.
Returns:

Output from statistic chosen.

Return type:

np.ndarray

autoimpute.utils.patterns.inbound(data)[source]

Calculates proportion of usable cases (Ijk) from Van Buuren 4.1.

Method ported from VB, called “inbound statistic”, Ijk. Ijk = 1 if variable Yk observed in all records where Yj missing. Used to quickly select potential predictors Yk for imputing Yj. High values are preferred.

Parameters:data (pd.DataFrame) – DataFrame to calculate inbound statistic.
Returns:
inbound statistic between each of the features.
Inbound between a feature and itself is 0.
Return type:pd.DataFrame
autoimpute.utils.patterns.influx(data)[source]

Calculates the influx coefficient (Ij) from Van Buuren 4.1.

Method ported from VB, called “influx coefficient”, Ij. Ij = # pairs (Yj,Yk) w/ Yj missing & Yk observed / # observed data cells. Value depends on the proportion of missing data of the variable. Influx of a completely observed variable is equal to 0. Influx for completely missing variables is equal to 1. For two variables with the same proportion of missing data: - Variable with higher influx is better connected to the observed data. - Variable with higher influx might thus be easier to impute.

Parameters:data (pd.DataFrame) – DataFrame to calculate influx coefficient.
Returns:influx coefficient for each column.
Return type:pd.DataFrame
autoimpute.utils.patterns.md_locations(data, both=False)[source]

Produces locations where values are missing in a DataFrame.

Takes in a DataFrame and identifies locations where data is complete or missing. Normally, fully complete issues warning, and fully incomplete throws error, but this method simply shows missingness locations, so the general standard for mixed complete-missing pattern not necessary. Method marks 1 = missing, 0 = not missing.

Parameters:
  • data (pd.DataFrame) – DataFrame to find missing & complete observations.
  • both (boolean, optional) – return data along with missingness indicator. Defaults to False, so just missingness indicator returned.
Returns:

missingness indicator DataFrame OR pd.DataFrame: missingness indicator DataFrame concatenated column-wise

with original DataFame.

Return type:

pd.DataFrame

Raises:

TypeError – if data is not a DataFrame. Error raised through decorator.

autoimpute.utils.patterns.md_pairs(data)[source]

Calculates pairwise missing data statistics.

This method mimics the behavior of MICE md.pairs. - rr: response-response pairs - rm: response-missing pairs - mr: missing-response pairs - mm: missing-missing pairs Returns a square matrix for each, where n = number of columns.

Parameters:data (pd.DataFrame) – DataFrame to calculate pairwise stats.
Returns:keys are pair types, values are DataFrames w/ pair stats.
Return type:dict
Raises:TypeError – if data is not a DataFrame. Error raised through decorator.
autoimpute.utils.patterns.md_pattern(data)[source]

Calculates row-wise missing data statistics in input data.

Method is a port of md.pattern method from VB 4.1. The number of rows indicates the number of different row patterns of missingness. The ‘nmis’ column is the number of missing values in a given row pattern. The ‘count’ is number of total rows with a given row pattern. In this method, 0 = missing, 1 = missing.

Parameters:data (pd.DataFrame) – DataFrame to calculate missing data pattern.
Returns:
DataFrame with missing data pattern and two
additional columns w/ row-wise stats: count and nmis.
Return type:pd.DataFrame
autoimpute.utils.patterns.nullility_corr(data, method='pearson')[source]

Calculates the nullility correlation between features in a DataFrame.

Leverages pandas method to calculate correlation of nullility. Note that this method drops NA values to compute correlation. It also employs check_missingness decorator to ensure DataFrame not fully missing. If a DataFrame is fully observed, nothing is returned, as there is no nullility.

Parameters:
  • data (pd.DataFrame) – DataFrame to calculate nullility correlation.
  • method (string, optional) – correlation method to use. Default pearson, but spearman should be used with categorical or ordinal encoding.
Returns:

DataFrame with nullility correlation b/w each feature.

Return type:

pd.DataFrame

Raises:
  • TypeError – If data not pd.DataFrame. Raised through decorator.
  • ValueError – If DataFrame values all missing and none complete. Also raised through decorator.
  • ValueError – If method for correlation not an accepted method.
autoimpute.utils.patterns.nullility_cov(data)[source]

Calculates the nullility covariance between features in a DataFrame.

Leverages pandas method to calculate covariance of nullility. Note that this method drops NA values to compute covariance. It also employs check_missingness decorator to ensure DataFrame not fully missing. If a DataFrame is fully observed, nothing is returned, as there is no nullility.

Parameters:

data (pd.DataFrame) – DataFrame to calculate nullility covariance.

Returns:

DataFrame with nullility covariance b/w each feature.

Return type:

pd.DataFrame

Raises:
  • TypeError – If data not pd.DataFrame. Raised through decorator.
  • ValueError – If DataFrame values all missing and none complete. Also raised through decorator.
autoimpute.utils.patterns.outbound(data)[source]

Calculates the outbound statistic (Ojk) from Van Buuren 4.1.

Method ported from VB, called “outbound statistic”, Ojk. Ojk measures how observed data Yj connect to rest of missing data. Ojk = 1 if Yj observed in all records where Yk is missing. Used to evaluate whether Yj is a potential predictor for imputing Yk. High values are preferred.

Parameters:data (pd.DataFrame) – DataFrame to calculate outbound statistic.
Returns:
outbound statistic between each of the features.
Outbound between a feature and itself is 0.
Return type:pd.DataFrame
autoimpute.utils.patterns.outflux(data)[source]

Calculates the outflux coefficient (Oj) from Van Buuren 4.1.

Method ported from VB, called “outflux coefficient”, Oj. Oj = # pairs w/ Yj observed and Yk missing / # incomplete data cells. Value depends on the proportion of missing data of the variable. Outflux of a completely observed variable is equal to 1. Outflux of a completely missing variable is equal to 0. For two variables having the same proportion of missing data: - Variable with higher outflux is better connected to the missing data. - Variable with higher outflux more useful for imputing other variables.

Parameters:data (pd.DataFrame) – DataFrame to calculate outflux coefficient.
Returns:outflux coefficient for each column.
Return type:pd.DataFrame
autoimpute.utils.patterns.proportions(data)[source]

Calculates the proportions of the data missing and data observed.

Method calculates two arrays: - poms: Proportion of missing data. - pobs: Proportion of observed data.

Parameters:data (pd.DataFrame) – DataFrame to calculate proportions.
Returns:
two columns, one for poms and one for pobs.
The sum of each row should equal 1. Index = original data cols.
Return type:pd.DataFrame
Raises:TypeError – if data not DataFrame. Error raised through decorator.