Key Driver Analysis

Key driver analysis to yield clues into potential causal relationships in your data by determining variables with high predictive power, high correlation with outcome, etc.

source

KeyDriverAnalysis

 KeyDriverAnalysis (df, outcome_col='outcome', text_col=None,
                    include_cols=[], ignore_cols=[], verbose=1)

Performs key driver analysis


source

KeyDriverAnalysis.correlations

 KeyDriverAnalysis.correlations (outcome_only=True)

Computes corelations between independent variables and outcome

import pandas as pd
from causalnlp.key_driver_analysis import KeyDriverAnalysis
df = pd.read_csv('sample_data/houses.csv')
kda = KeyDriverAnalysis(df, outcome_col='SalePrice', ignore_cols=['Id', 'YearSold'])
outcome column (numerical): SalePrice
treatment column: CausalNLP_temp_treatment
numerical/categorical covariates: ['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition']
preprocess time:  0.3556947708129883  sec
df_results = kda.correlations()
df_results.head()
SalePrice
OverallQual 0.790982
GrLivArea 0.708624
GarageCars 0.640409
GarageArea 0.623431
TotalBsmtSF 0.613581
assert df_results.iloc[[0]].index.values[0] == 'OverallQual'

source

KeyDriverAnalysis.importances

 KeyDriverAnalysis.importances (plot=True, split_pct=0.2, use_shap=False,
                                shap_background_size=50, rf_model=None,
                                n_estimators=100, n_jobs=-1,
                                random_state=42)

Identifies important predictors using a RandomForest model.

Example: Variable Importances for Housing Prices

df_results = kda.importances()
df_results.head()
R^2 Training Score: 0.98 
OOB Score: 0.85 
R^2 Validation Score: 0.89
Driver Importance
3 OverallQual 0.557707
15 GrLivArea 0.121145
11 TotalBsmtSF 0.035977
13 2ndFlrSF 0.033758
8 BsmtFinSF1 0.028563

Example: Variable Importances for Probability of Making Over $50K

import pandas as pd
df = pd.read_csv('sample_data/adult-census.csv')
kda = KeyDriverAnalysis(df, outcome_col='class', ignore_cols=['fnlwgt'])
df_results = kda.importances(use_shap=True, plot=True)
df_results.head()
replaced ['<=50K', '>50K'] in column "class" with [0, 1]
outcome column (categorical): class
treatment column: CausalNLP_temp_treatment
numerical/categorical covariates: ['age', 'workclass', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country']
preprocess time:  0.5094420909881592  sec
R^2 Training Score: 0.98 
OOB Score: 0.85 
R^2 Validation Score: 0.85

Driver Importance
2 capital-gain 0.102854
0 age 0.036508
1 education-num 0.035481
32 marital-status_Married-civ-spouse 0.031246
52 relationship_Husband 0.028451