Visualization Methods

This section documents visualization methods within Autoimpute.

Visualization methods support all functionality within Autoimpute, from missing data exploration to imputation analysis. The documentation below breaks down each visualization method and groups them into their respsective categories. The categories represent other modules within Autoimpute.

NOTE: The visualization module is currently under development. While the functions outlined below are stable in 0.12.x, they might change thereafterr.

Utility

Autoimpute comes with a number of utility methods to examine missing data before imputation takes place. This package supports these methods with a number of visualization techniques to explore patterns within missing data. The primary techniques wrap the excellent missingno package. Autoimpute simply leverages missingno to make its offerings familiar in this packages’ API design. The methods appear below:

autoimpute.visuals.plot_md_locations(data, **kwargs)[source]

Plot the locations where data is missing within a DataFrame.

Parameters:
  • data (pd.DataFrame) – DataFrame to plot.
  • **kwargs – Keyword arguments for plot. Passed to missingno.matrix.
Returns:

missingness location plot.

Return type:

matplotlib.axes._subplots.AxesSubplot

Raises:

TypeError – if data is not a DataFrame. Error raised through decorator.

autoimpute.visuals.plot_md_percent(data, **kwargs)[source]

Plot the percentage of missing data by column within a DataFrame.

Parameters:
  • data (pd.DataFrame) – DataFrame to plot.
  • **kwargs – Keyword arguments for plot. Passed to missingno.bar.
Returns:

missingness percent plot.

Return type:

matplotlib.axes._subplots.AxesSubplot

Raises:

TypeError – if data is not a DataFrame. Error raised through decorator.

autoimpute.visuals.plot_nullility_corr(data, **kwargs)[source]

Plot the nullility correlation of missing data within a DataFrame.

Parameters:
  • data (pd.DataFrame) – DataFrame to plot.
  • **kwargs – Keyword arguments for plot. Passed to missingno.heatmap.
Returns:

nullility correlation plot.

Return type:

matplotlib.axes._subplots.AxesSubplot

Raises:
  • TypeError – if data is not a DataFrame. Error raised through decorator.
  • ValueError – dataset fully observed. Raised through helper method.
autoimpute.visuals.plot_nullility_dendogram(data, **kwargs)[source]

Plot the nullility dendogram of missing data within a DataFrame.

Parameters:
  • data (pd.DataFrame) – DataFrame to plot.
  • **kwargs – Keyword arguments for plot. Passed to missingno.dendogram.
Returns:

nullility dendogram plot.

Return type:

matplotlib.axes._subplots.AxesSubplot

Raises:
  • TypeError – if data is not a DataFrame. Error raised through decorator.
  • ValueError – dataset fully observed. Raised through helper method.

Imputation

Two main classes within Autoimpute are the SingleImputer and MultipleImputer. The visualization module within this package contains a number of techniques to visually assess the quality and performance of these imputers. The important methods appear below:

autoimpute.visuals.helpers._validate_data(d, mi, imp_col=None)[source]

Private helper method to validate data vs multiple imputations.

Parameters:
  • d (list) – dataset returned from multiple imputation.
  • mi (MultipleImputer) – multiple imputer used to generate d.
  • imp_col (str) – column to plot. Should be a column with imputations.
Raises:
autoimpute.visuals.plot_imp_scatter(d, x, y, strategy, color=None, title='Jointplot after Imputation', h=8.27, imp_kwgs=None, a=0.5, marginals=None, obs_color='navy', imp_color='red', **plot_kwgs)[source]

Plot the joint scatter and density plot after single imputation.

Use this method to visualize a scatterplot between two features, x and y, where y is imputed and x is a predictor used to impute y. This method performs single imputation and is useful to determine how an imputation method looks under the hood.

Parameters:
  • d (pd.DataFrame) – DataFrame with data to impute and plot.
  • x (str) – column to plot on x axis.
  • y (str) – column to plot on y axis and set color for imputation.
  • strategy (str) – imputation method for SingleImputer.
  • color (str, Optional) – which variable to color with imputations. Deafult is none, which means y is colored. Other option is to color “x”. Color should be the same as “x” or “y”.
  • title (str, Optional) – title of plot. “Defualt is Jointplot after Imputation”.
  • h (float, Optional) – height of the jointplot. Default is 8.27
  • imp_kwgs (dict, Optional) – imp kwgs for SingleImputer procedure. Default is None.
  • a (float, Optional) – alpha for plot color. Default is 0.5
  • marginals (dict, Optional) – dictionary of marginal plot args. Default is None, configured in code below.
  • obs_color (str, Optional) – color of observed. Default is navy.
  • imp_color (str, Optional) – color of imputations. Default is red.
  • **plot_kwgs – keyword arguments used by sns.set.
Raises:

ValueError – x and y must be names of columns in data

autoimpute.visuals.plot_imp_dists(d, mi, imp_col, title='Distributions after Imputation', include_observed=True, separate_observed=True, side_by_side=False, hist_observed=False, hist_imputed=False, gw=(0.5, 0.5), gh=(0.5, 0.5), **plot_kwgs)[source]

Plot the density between imputations for a given column.

Use this method to plot the density of a given column after multiple imputation. The function allows the user to also plot the observed data from the column prior to imputation taking place. Further, the user can specify whether the observed should be separated into its own plot or not.

Parameters:
  • d (list) – dataset returned from multiple imputation.
  • mi (MultipleImputer) – multiple imputer used to generate d.
  • imp_col (str) – column to plot. Should be a column with imputations.
  • title (str, Optional) – title of plot. Default is “Distributions after Imputation”.
  • include_observed (bool, Optional) – whether or not to include observed data in the plot. Default is True. If False, observed data for imp_col will not be included as a distribution for density.
  • separate_observed (bool, Optional) – whether or not to separate the observed data when plotting against imputed. Default is True. If False, observed data distribution will be plotted on same plot as the imputed data distribution. Note, this attribute matters if and only if include_observed=True.
  • side_by_side (bool, Optional) – whether columns should be plotted next to each other or stacked vertically. Default is False. If True, plots will be plotted side-by-side. Note, this attribute matters if and only if include_observed=True.
  • hist_observed (bool, Optional) – whether histogram should be plotted along with the density for observed values. Default is False. Note, this attribute matters if and only if include_observed=True.
  • hist_imputed (bool, Optional) – whether histogram should be plotted along with the density for imputed values. Default is False. Note, this attribute matters if and only if include_observed=True.
  • gw (tuple, Optional) – if side-by-side plot, the width ratios for each plot. Default is (.5, .5), so each plot will be same width. Matters if and only if include_observed=True and side_by_side=True.
  • gh (tuple, Optional) – if stacked plot, the height ratios for each plot. Default is (.5, .5), so each plot will be the same height. Matters if and only if include_observed=True and side_by_side=False.
  • **plot_kwgs – keyword arguments used by sns.set.
Returns:

densityplot for observed and/or imputed data

Return type:

sns.distplot

Raises:

ValueError – see _validate_data method

autoimpute.visuals.plot_imp_boxplots(d, mi, imp_col, side_by_side=False, title='Observed vs. Imputed Boxplots', obs_kwgs=None, imp_kwgs=None, **plot_kwgs)[source]

Plot the boxplots between observed and imputations for a given column.

Use this method to plot the boxplots of a given column after multiple imputation. The function also plots the boxplot of the observed data from the column prior to imputation taking place. Further, the user can specify additional arguments to tailor the design of the plots themselves.

Parameters:
  • d (list) – dataset returned from multiple imputation.
  • mi (MultipleImputer) – multiple imputer used to generate d.
  • imp_col (str) – column to plot. Should be a column with imputations.
  • side_by_side (bool, Optional) – whether columns should be plotted next to each other or stacked vertically. Default is False. If True, plots will be plotted side-by-side.
  • title (str, Optional) – title of boxplots. Default is “Observed vs. Imputed Boxplots.”
  • obs_kwgs (dict, Optional) – dictionary of arguments to unpack for observed boxplot. Default is None, so no additional tailoring.
  • imp_kwgs (dict, Optional) – dictionary of arguments to unpack for imputed boxplots. Default is None, so no additional tailoring.
  • **plot_kwgs – keyword arguments used by sns.set.
Returns:

boxplots for observed and imputed data

Return type:

sns.distplot

Raises:

ValueError – see _validate_data method.

autoimpute.visuals.plot_imp_swarm(d, mi, imp_col, palette=None, title='Imputation Swarm', **plot_kwgs)[source]

Create the swarm plot for multiply imputed data.

Parameters:
  • d (list) – dataset returned from multiple imputation.
  • mi (MultipleImputer) – multiple imputer used to generate d.
  • imp_col (str) – column to plot. Should be a column with imputations.
  • title (str, Optional) – title of plot. Default is “Imputation Swarm”.
  • palette (list, tuple, Optional) – colors for the imps and observed. Default is None. if None, colors default to [“r”,”c”].
  • **plot_kwgs – keyword arguments used by sns.set.
Returns:

swarmplot for imputed data

Return type:

sns.distplot

Raises:

ValueError – see _validate_data method.

autoimpute.visuals.plot_imp_strip(d, mi, imp_col, palette=None, title='Imputation Strip', **plot_kwgs)[source]

Create the strip plot for multiply imputed data.

Parameters:
  • d (list) – dataset returned from multiple imputation.
  • mi (MultipleImputer) – multiple imputer used to generate d.
  • imp_col (str) – column to plot. Should be a column with imputations.
  • title (str, Optional) – title of plot. Default is “Imputation Strip”.
  • palette (list, tuple, Optional) – colors for the imps and observed. Default is None. if None, colors default to [“r”,”c”].
  • **plot_kwgs – keyword arguments used by sns.set.
Returns:

stripplot for imputed data

Return type:

sns.distplot

Raises:

ValueError – see _validate_data method.