Causal Inference

Causal Inference API

CausalInferenceModel

 CausalInferenceModel (df, method='t-learner', metalearner_type=None,
                       treatment_col='treatment', outcome_col='outcome',
                       text_col=None, ignore_cols=[], include_cols=[],
                       treatment_effect_col='treatment_effect',
                       learner=None, effect_learner=None, min_df=0.05,
                       max_df=0.5, ngram_range=(1, 1),
                       stop_words='english', verbose=1)

*Infers causality from the data contained in df using a metalearner.

Usage:

>>> cm = CausalInferenceModel(df, 
                              treatment_col='Is_Male?', 
                              outcome_col='Post_Shared?', text_col='Post_Text',
                              ignore_cols=['id', 'email'])
    cm.fit()

Parameters:

df : pandas.DataFrame containing dataset
method : metalearner model to use. One of {‘t-learner’, ‘s-learner’, ‘x-learner’, ‘r-learner’} (Default: ‘t-learner’)
metalearner_type : Alias of method for backwards compatibility. Overrides method if not None.
treatment_col : treatment variable; column should contain binary values: 1 for treated, 0 for untreated.
outcome_col : outcome variable; column should contain the categorical or numeric outcome values
text_col : (optional) text column containing the strings (e.g., articles, reviews, emails).
ignore_cols : columns to ignore in the analysis
include_cols : columns to include as covariates (e.g., possible confounders)
treatment_effect_col : name of column to hold causal effect estimations. Does not need to exist. Created by CausalNLP.
learner : an instance of a custom learner. If None, Log/Lin Regression is used for S-Learner and a default LightGBM model will be used for all other metalearner types. # Example learner = LGBMClassifier(num_leaves=1000)
effect_learner: used for x-learner/r-learner and must be regression model
min_df : min_df parameter used for text processing using sklearn
max_df : max_df parameter used for text procesing using sklearn
ngram_range: ngrams used for text vectorization. default: (1,1)
stop_words : stop words used for text processing (from sklearn)
verbose : If 1, print informational messages. If 0, suppress.*

	Type	Default	Details
df
method	str	t-learner
metalearner_type	NoneType	None	alias for method
treatment_col	str	treatment
outcome_col	str	outcome
text_col	NoneType	None
ignore_cols	list	[]
include_cols	list	[]
treatment_effect_col	str	treatment_effect
learner	NoneType	None
effect_learner	NoneType	None
min_df	float	0.05
max_df	float	0.5
ngram_range	tuple	(1, 1)
stop_words	str	english
verbose	int	1

source

CausalInferenceModel.fit

 CausalInferenceModel.fit (p=None)

Fits a causal inference model and estimates outcome with and without treatment for each observation. For X-Learner and R-Learner, propensity scores will be computed using default propensity model unless p is not None. Parameter p is not used for other methods.

source

CausalInferenceModel.tune_and_use_default_learner

 CausalInferenceModel.tune_and_use_default_learner (split_pct=0.2,
                                                    random_state=314,
                                                    scoring=None)

Tunes the hyperparameters of a default LightGBM model, replaces CausalInferenceModel.learner, and returns best parameters. Should be invoked prior to running CausalInferencemodel.fit. If scoring is None, then ‘roc_auc’ is used for classification and ‘negative_mean_squared_error’ is used for regresssion.

source

CausalInferenceModel.predict

 CausalInferenceModel.predict (df, p=None)

Estimates the treatment effect for each observation in df. The DataFrame represented by df should be the same format as the one supplied to CausalInferenceModel.__init__. For X-Learner and R-Learner, propensity scores will be computed using default propensity model unless p is not None. Parameter p is not used for other methods.

source

CausalInferenceModel.get_required_columns

 CausalInferenceModel.get_required_columns ()

Returns required columns that must exist in any DataFrame supplied to CausalInferenceModel.predict.

source

CausalInferenceModel.estimate_ate

 CausalInferenceModel.estimate_ate (bool_mask=None)

Estimates the treatment effect for each observation in self.df.

The bool_mask parameter can be used to estimate the conditional average treatment estimate (CATE). For instance, to estimate the average treatment effect for only those individuals over 18 years of age:

cm.estimate_ate(cm.df['age']>18])

source

CausalInferenceModel.evaluate_robustness

 CausalInferenceModel.evaluate_robustness (sample_size=0.8)

Evaluates robustness on four sensitivity measures (see CausalML package for details on these methods): - Placebo Treatment: ATE should become zero. - Random Cause: ATE should not change. - Random Replacement: ATE should not change. - Subset Data: ATE should not change.

source

CausalInferenceModel.interpret

 CausalInferenceModel.interpret (plot=False, method='feature_importance')

Returns feature importances of treatment effect model. The method parameter must be one of {‘feature_importance’, ‘shap_values’}

source

CausalInferenceModel.explain

 CausalInferenceModel.explain (df, row_index=None, row_num=0,
                               background_size=50, nsamples=500)

*Explain the treatment effect estimate of a single observation using SHAP.

Parameters: - df (pd.DataFrame): a pd.DataFrame of test data is same format as original training data DataFrame - row_num (int): raw row number in DataFrame to explain (default:0, the first row) - background_size (int): size of background data (SHAP parameter) - nsamples (int): number of samples (SHAP parameter)*

Usage Example: Do social media posts by women get shared more often than those by men?

Let’s create a simulated dataset.

import itertools
import pandas as pd

data = ((*a, b) for (a, b) in zip(itertools.product([0,1], [0,1], [0,1]), [36, 234, 25, 55, 6, 81, 71, 192]))
df = pd.DataFrame(data, columns=['Is_Male?', 'Post_Text', 'Post_Shared?', 'N'])
df = df.loc[df.index.repeat(df['N'])].reset_index(drop=True).drop(columns=['N'])
values = sorted(df['Post_Text'].unique())
df['Post_Text'].replace(values, ['I really love my job!', 'My boss is pretty terrible.'], inplace=True)
original_df = df.copy()
df = None
original_df.head()

	Is_Male?	Post_Text	Post_Shared?
0	0	I really love my job!	0
1	0	I really love my job!	0
2	0	I really love my job!	0
3	0	I really love my job!	0
4	0	I really love my job!	0

At first glance, it seems like posts by women get shared more often. More specifically, it appears that being male reduces your the chance your post is shared by 4.5 percentage points:

male_probability = original_df[(original_df['Is_Male?']==1)]['Post_Shared?'].value_counts(normalize=True)[1]
male_probability

0.78

female_probability = original_df[(original_df['Is_Male?']==0)]['Post_Shared?'].value_counts(normalize=True)[1]
female_probability

0.8257142857142857

male_probability-female_probability

-0.04571428571428571

However, this is inaccurate. In fact, this is an example of Simpson’s Paradox, and the true causal effect of being male in this simulated datsaet is roughly 0.05 (as opposed to -0.045) with men’s posts being more likely to be shared. The reason is that women in this simulation tend to make more positive posts which tend to be shared more often here. Post sentiment, then, is a [mediator](https://en.wikipedia.org/wiki/Mediation_(statistics), which is statistically statistically similar to a confounder.

When controlling for the sentiment of the post (the mediator variable in this dataset), it is revealed that men’s posts are, in fact, shared more often (for both negative posts and positive posts). This can be quickly and easily estimated in CausalNLP.

Causal Inference from Text with Autocoders

Let’s first use the Autocoder to transform the raw text into sentiment. We can then control for sentiment when estimating the causal effect.

from causalnlp.autocoder import Autocoder

ac = Autocoder()

df = ac.code_sentiment(original_df['Post_Text'].values, original_df, binarize=False, batch_size=16)

df.head()

	Post_Text	negative	positive
0	I really love my job!	0.019191	0.980809
1	I really love my job!	0.019191	0.980809
2	I really love my job!	0.019191	0.980809
3	I really love my job!	0.019191	0.980809
4	I really love my job!	0.019191	0.980809

When autocoding the raw text for sentiment, we have chosen to use the raw “probabilities” with binarize=False. A binary variable can also be used with binarize=True.

Next, let’s estimate the treatment effects. We will ignore the positive and Post_Shared? columns, as their information is captured by the negative column in this example. We will use the T-Learner. See this paper for more information on metalearner types.

from causalnlp import CausalInferenceModel

cm = CausalInferenceModel(df, method='t-learner',
                          treatment_col='Is_Male?', outcome_col='Post_Shared?',
                          include_cols=['negative'])
cm.fit()

outcome column (categorical): Post_Shared?
treatment column: Is_Male?
numerical/categorical covariates: ['negative']
preprocess time:  0.013550996780395508  sec
start fitting causal inference model
time to fit causal inference model:  0.8901166915893555  sec

<causalnlp.core.causalinference.CausalInferenceModel>

Upon controlling for sentiment, we see that the overall average treatment is correctly estimated as 0.05.

ate = cm.estimate_ate()
ate

{'ate': 0.05366850622769351}

Since this is a small, simulated, toy problem, we can manually calculate the adjusted treatment effect by controlling for the single counfounder (i.e., post negativity):

from collections import defaultdict

def ATE_adjusted(C, T, Y):
    x = defaultdict(list)
    for c, t, y in zip(C, T, Y):
        x[c, t].append(y)

    C0_ATE = np.mean(x[0,1]) - np.mean(x[0,0])
    C1_ATE = np.mean(x[1,1]) - np.mean(x[1,0])
    return np.mean([C0_ATE, C1_ATE])
ATE_adjusted((df['negative']>0.5).astype('int'), df['Is_Male?'].values, df['Post_Shared?'].values)

0.0534529194528211

We see that this value is close to our estimate.

CausalNLP allows you to easily compute conditional or individualized treatment effects. For instance, for negative posts, being male increases the chance of your post being shared by about 4 percentage points:

cm.estimate_ate(cm.df['negative']>0.9)

{'ate': 0.042535751074149745}

For positive posts, being male increases the chance of your post being shared by about 6 percentage points:

cm.estimate_ate(cm.df['negative']<0.1)

{'ate': 0.06436468274776497}

assert ate['ate'] > 0.05
assert ate['ate'] < 0.055

Predictions can be made for new observations. We just have to make sure it contains the relevant columns included in the DataFrame supplied to CausalInferenceModel.fit. In this case, it must include Is_Male? and negative. This can be verified with the CausalInferenceModel.get_required_columns method:

cm.get_required_columns()

['Is_Male?', 'negative']

test_df = pd.DataFrame({
     'text' : ['I love my life.'],
    'Is_Male?' : [0],
    'negative' : [0]
      })
effect = cm.predict(test_df)
assert effect[0][0] < 0.065
assert effect[0][0] > 0.064
print(effect)

[[0.06436468]]

Causal Inference Using Raw Text as a Confounder/Mediator

In the example above, we approached the problem under the assumption that a specific lingustic property (sentiment) was an important mediator or confounder for which to control. In some cases, there may also be other unknown lingustic properties that are potential confounders/mediators (e.g., topic, politeness, toxic language, readability).

In CausalNLP, we can also use the raw text as the potential confounder/mediator.

cm = CausalInferenceModel(df, method='t-learner',
                          treatment_col='Is_Male?', outcome_col='Post_Shared?', text_col='Post_Text',
                         ignore_cols=['negative', 'positive'])
cm.fit()

outcome column (categorical): Post_Shared?
treatment column: Is_Male?
numerical/categorical covariates: []
text covariate: Post_Text
preprocess time:  0.015369415283203125  sec
start fitting causal inference model
time to fit causal inference model:  0.5458502769470215  sec

<causalnlp.core.causalinference.CausalInferenceModel>

Although we have excluded the negative and positive columns as extra covariates, you can use traditional categorical/numerical covariates in combination with a text field covariate (if they exist as extra columns in the dataframe).

Here, we see that the same causal estimates are returned, as the text is easy to infer as positive or negative based on their correlations with the outcomes in this problem.

ate = cm.estimate_ate()
ate

{'ate': 0.05366850622769351}

cm.estimate_ate(df['Post_Text'] == 'My boss is pretty terrible.')

{'ate': 0.042535751074149745}

cm.estimate_ate(df['Post_Text'] == 'I really love my job!')

{'ate': 0.06436468274776497}

assert ate['ate'] > 0.05
assert ate['ate'] < 0.055

Make predictions on new data.  Again, make sure the DataFrame contains the relevant columns included in the original DataFrame supplied to `CausalInferenceModel.fit`:

cm.get_required_columns()

['Is_Male?', 'Post_Text']

test_df = pd.DataFrame({
     'Post_Text' : ['I love my life.'],
    'New Column' : [1],
    'Is_Male?' : [0],
    'negative' : [0]
      })
effect = cm.predict(test_df)
assert effect[0][0] < 0.065
assert effect[0][0] > 0.064
print(effect)

[[0.06436468]]

cm.interpret(plot=False, method='feature_importance')

{1: v_boss        1.0
 v_terrible    0.0
 v_pretty      0.0
 dtype: float64}

cm.interpret(plot=True, method='feature_importance')

cm.interpret(plot=True, method='shap_values')

Causal Inference With Text as a Treatment

Suppose we were interested in estimating the causal impact of sentiment on the outcome. That is, sentiment of text is the treatment, and the gender is a potential confounder. As we did above, we can use the Autocoder to create the treatment variable. The only difference is that we would supply the binarize=True as an argument.

df = ac.code_sentiment(original_df['Post_Text'].values, original_df, binarize=True, batch_size=16)

df.head()

	Post_Text	positive
0	I really love my job!	1
1	I really love my job!	1
2	I really love my job!	1
3	I really love my job!	1
4	I really love my job!	1

cm = CausalInferenceModel(df, method='t-learner',
                          treatment_col='positive', outcome_col='Post_Shared?',
                          include_cols=['Is_Male?'])
cm.fit()

outcome column (categorical): Post_Shared?
treatment column: positive
numerical/categorical covariates: ['Is_Male?']
preprocess time:  0.008125543594360352  sec
start fitting causal inference model
time to fit causal inference model:  0.5112130641937256  sec

<causalnlp.core.causalinference.CausalInferenceModel>

ate = cm.estimate_ate()
ate

{'ate': 0.19008080596986368}

assert ate['ate'] > 0.18
assert ate['ate'] < 0.2

cm.get_required_columns()

['positive', 'Is_Male?']

test_df = pd.DataFrame({
    'Is_Male?' : [1],
    'positive' : [1]
      })
effect = cm.predict(test_df)
print(effect)

[[0.20099539]]