Key driver analysis to yield clues into **potential** causal relationships in your data by determining variables with high predictive power, high correlation with outcome, etc.

class KeyDriverAnalysis[source]

KeyDriverAnalysis(df, outcome_col='outcome', text_col=None, include_cols=[], ignore_cols=[], verbose=1)

Performs key driver analysis

KeyDriverAnalysis.correlations[source]

KeyDriverAnalysis.correlations(outcome_only=True)

Computes corelations between independent variables and outcome

import pandas as pd
df = pd.read_csv('sample_data/houses.csv')
from causalnlp.key_driver_analysis import KeyDriverAnalysis
kda = KeyDriverAnalysis(df, outcome_col='SalePrice', ignore_cols=['Id', 'YearSold'])
outcome column (numerical): SalePrice
treatment column: CausalNLP_temp_treatment
numerical/categorical covariates: ['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition']
preprocess time:  0.3556947708129883  sec
df_results = kda.correlations()
df_results.head()
SalePrice
OverallQual 0.790982
GrLivArea 0.708624
GarageCars 0.640409
GarageArea 0.623431
TotalBsmtSF 0.613581
assert df_results.iloc[[0]].index.values[0] == 'OverallQual'

KeyDriverAnalysis.importances[source]

KeyDriverAnalysis.importances(plot=True, split_pct=0.2, use_shap=False, shap_background_size=50, rf_model=None, n_estimators=100, n_jobs=-1, random_state=42)

Identifies important predictors using a RandomForest model.

Example: Variable Importances for Housing Prices

df_results = kda.importances()
df_results.head()
R^2 Training Score: 0.98 
OOB Score: 0.85 
R^2 Validation Score: 0.89
Driver Importance
3 OverallQual 0.557707
15 GrLivArea 0.121145
11 TotalBsmtSF 0.035977
13 2ndFlrSF 0.033758
8 BsmtFinSF1 0.028563

Example: Variable Importances for Probability of Making Over $50K

import pandas as pd
df = pd.read_csv('sample_data/adult-census.csv')
kda = KeyDriverAnalysis(df, outcome_col='class', ignore_cols=['fnlwgt'])
df_results = kda.importances(use_shap=True, plot=True)
df_results.head()
replaced ['<=50K', '>50K'] in column "class" with [0, 1]
outcome column (categorical): class
treatment column: CausalNLP_temp_treatment
numerical/categorical covariates: ['age', 'workclass', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country']
preprocess time:  0.5094420909881592  sec
R^2 Training Score: 0.98 
OOB Score: 0.85 
R^2 Validation Score: 0.85
Driver Importance
2 capital-gain 0.102854
0 age 0.036508
1 education-num 0.035481
32 marital-status_Married-civ-spouse 0.031246
52 relationship_Husband 0.028451