>>> cm = CausalInferenceModel(df,
                              treatment_col='Is_Male?',
                              outcome_col='Post_Shared?', text_col='Post_Text',
                              ignore_cols=['id', 'email'])
    cm.fit()

>>> cm = CausalInferenceModel(df,
                              treatment_col='Is_Male?',
                              outcome_col='Post_Shared?', text_col='Post_Text',
                              ignore_cols=['id', 'email'])
    cm.fit()

          and a default LightGBM model will be used for all other metalearner types.

The bool_mask parameter can be used to estimate the conditional average treatment estimate (CATE). For instance, to estimate the average treatment effect for only those individuals over 18 years of age:

cm.estimate_ate(cm.df['age']>18])

Let's create a simulated dataset.

import itertools
import pandas as pd
data = ((*a, b) for (a, b) in zip(itertools.product([0,1], [0,1], [0,1]), [36, 234, 25, 55, 6, 81, 71, 192]))
df = pd.DataFrame(data, columns=['Is_Male?', 'Post_Text', 'Post_Shared?', 'N'])
df = df.loc[df.index.repeat(df['N'])].reset_index(drop=True).drop(columns=['N'])
values = sorted(df['Post_Text'].unique())
df['Post_Text'].replace(values, ['I really love my job!', 'My boss is pretty terrible.'], inplace=True)
original_df = df.copy()
df = None
original_df.head()

At first glance, it seems like posts by women get shared more often. More specifically, it appears that being male reduces your the chance your post is shared by 4.5 percentage points:

male_probability = original_df[(original_df['Is_Male?']==1)]['Post_Shared?'].value_counts(normalize=True)[1]
male_probability

0.78

female_probability = original_df[(original_df['Is_Male?']==0)]['Post_Shared?'].value_counts(normalize=True)[1]
female_probability

0.8257142857142857

male_probability-female_probability

-0.04571428571428571

However, this is inaccurate. In fact, this is an example of Simpson's Paradox, and the true causal effect of being male in this simulated datsaet is roughly 0.05 (as opposed to -0.045) with men's posts being more likely to be shared. The reason is that women in this simulation tend to make more positive posts which tend to be shared more often here. Post sentiment, then, is a mediator, which is statistically statistically similar to a confounder.

When controlling for the sentiment of the post (the mediator variable in this dataset), it is revealed that men's posts are, in fact, shared more often (for both negative posts and positive posts). This can be quickly and easily estimated in CausalNLP.

Causal Inference from Text with Autocoders

Let's first use the Autocoder to transform the raw text into sentiment. We can then control for sentiment when estimating the causal effect.

from causalnlp.autocoder import Autocoder
ac = Autocoder()

df = ac.code_sentiment(original_df['Post_Text'].values, original_df, binarize=False, batch_size=16)

df.head()

When autocoding the raw text for sentiment, we have chosen to use the raw "probabilities" with binarize=False. A binary variable can also be used with binarize=True.

Next, let's estimate the treatment effects. We will ignore the positive and Post_Shared? columns, as their information is captured by the negative column in this example. We will use the T-Learner. See this paper for more information on metalearner types.

from causalnlp.causalinference import CausalInferenceModel
cm = CausalInferenceModel(df, method='t-learner',
                          treatment_col='Is_Male?', outcome_col='Post_Shared?',
                          include_cols=['negative'])
cm.fit()

outcome column (categorical): Post_Shared?
treatment column: Is_Male?
numerical/categorical covariates: ['negative']
preprocess time:  0.012983322143554688  sec
start fitting causal inference model
time to fit causal inference model:  0.9569253921508789  sec

<causalnlp.causalinference.CausalInferenceModel at 0x7efd400dbd30>

Upon controlling for sentiment, we see that the overall average treatment is correctly estimated as 0.05.

ate = cm.estimate_ate()
ate

{'ate': 0.05366850622769351}

Since this is a small, simulated, toy problem, we can manually calculate the adjusted treatment effect by controlling for the single counfounder (i.e., post negativity):

from collections import defaultdict
def ATE_adjusted(C, T, Y):
    x = defaultdict(list)
    for c, t, y in zip(C, T, Y):
        x[c, t].append(y)

    C0_ATE = np.mean(x[0,1]) - np.mean(x[0,0])
    C1_ATE = np.mean(x[1,1]) - np.mean(x[1,0])
    return np.mean([C0_ATE, C1_ATE])
ATE_adjusted((df['negative']>0.5).astype('int'), df['Is_Male?'].values, df['Post_Shared?'].values)

0.0534529194528211

We see that this value is close to our estimate.

CausalNLP allows you to easily compute conditional or individualized treatment effects. For instance, for negative posts, being male increases the chance of your post being shared by about 4 percentage points:

cm.estimate_ate(cm.df['negative']>0.9)

{'ate': 0.042535751074149745}

For positive posts, being male increases the chance of your post being shared by about 6 percentage points:

cm.estimate_ate(cm.df['negative']<0.1)

{'ate': 0.06436468274776497}

assert ate['ate'] > 0.05
assert ate['ate'] < 0.055

Predictions can be made for new observations. We just have to make sure it contains the relevant columns included in the DataFrame supplied to CausalInferenceModel.fit. In this case, it must include Is_Male? and negative. This can be verified with the CausalInferenceModel.get_required_columns method:

cm.get_required_columns()

['Is_Male?', 'negative']

test_df = pd.DataFrame({
     'text' : ['I love my life.'],
    'Is_Male?' : [0],
    'negative' : [0]
      })
effect = cm.predict(test_df)
assert effect[0][0] < 0.065
assert effect[0][0] > 0.064
print(effect)

[[0.06436468]]

Causal Inference Using Raw Text as a Confounder/Mediator

In the example above, we approached the problem under the assumption that a specific lingustic property (sentiment) was an important mediator or confounder for which to control. In some cases, there may also be other unknown lingustic properties that are potential confounders/mediators (e.g., topic, politeness, toxic language, readability).

In CausalNLP, we can also use the raw text as the potential confounder/mediator.

cm = CausalInferenceModel(df, method='t-learner',
                          treatment_col='Is_Male?', outcome_col='Post_Shared?', text_col='Post_Text',
                         ignore_cols=['negative', 'positive'])
cm.fit()

outcome column (categorical): Post_Shared?
treatment column: Is_Male?
numerical/categorical covariates: []
text covariate: Post_Text
preprocess time:  0.01604151725769043  sec
start fitting causal inference model
time to fit causal inference model:  0.08830595016479492  sec

<causalnlp.causalinference.CausalInferenceModel at 0x7efd400ea358>

Although we have excluded the negative and positive columns as extra covariates, you can use traditional categorical/numerical covariates in combination with a text field covariate (if they exist as extra columns in the dataframe).

Here, we see that the same causal estimates are returned, as the text is easy to infer as positive or negative based on their correlations with the outcomes in this problem.

ate = cm.estimate_ate()
ate

{'ate': 0.05366850622769351}

cm.estimate_ate(df['Post_Text'] == 'My boss is pretty terrible.')

{'ate': 0.042535751074149745}

cm.estimate_ate(df['Post_Text'] == 'I really love my job!')

{'ate': 0.06436468274776497}

assert ate['ate'] > 0.05
assert ate['ate'] < 0.055

Make predictions on new data.  Again, make sure the DataFrame contains the relevant columns included in the original DataFrame supplied to [`CausalInferenceModel.fit`](/causalnlp/causalinference.html#CausalInferenceModel.fit):

cm.get_required_columns()

['Is_Male?', 'Post_Text']

test_df = pd.DataFrame({
     'Post_Text' : ['I love my life.'],
    'New Column' : [1],
    'Is_Male?' : [0],
    'negative' : [0]
      })
effect = cm.predict(test_df)
assert effect[0][0] < 0.065
assert effect[0][0] > 0.064
print(effect)

[[0.06436468]]

cm.interpret(plot=False, method='feature_importance')

{1: v_boss        1.0
 v_terrible    0.0
 v_pretty      0.0
 dtype: float64}

cm.interpret(plot=True, method='feature_importance')

cm.interpret(plot=True, method='shap_values')

Causal Inference With Text as a Treatment

Suppose we were interested in estimating the causal impact of sentiment on the outcome. That is, sentiment of text is the treatment, and the gender is a potential confounder. As we did above, we can use the Autocoder to create the treatment variable. The only difference is that we would supply the binarize=True as an argument.

df = ac.code_sentiment(original_df['Post_Text'].values, original_df, binarize=True, batch_size=16)

df.head()

cm = CausalInferenceModel(df, method='t-learner',
                          treatment_col='positive', outcome_col='Post_Shared?',
                          include_cols=['Is_Male?'])
cm.fit()

outcome column (categorical): Post_Shared?
treatment column: positive
numerical/categorical covariates: ['Is_Male?']
preprocess time:  0.009029388427734375  sec
start fitting causal inference model
time to fit causal inference model:  0.6834802627563477  sec

<causalnlp.causalinference.CausalInferenceModel at 0x7efd2cd955f8>

ate = cm.estimate_ate()
ate

{'ate': 0.19008080596986368}

assert ate['ate'] > 0.18
assert ate['ate'] < 0.2

cm.get_required_columns()

['positive', 'Is_Male?']

test_df = pd.DataFrame({
    'Is_Male?' : [1],
    'positive' : [1]
      })
effect = cm.predict(test_df)
print(effect)

[[0.20099539]]

Causal Inference

`class` `CausalInferenceModel`[source]

`CausalInferenceModel.fit`[source]

`CausalInferenceModel.tune_and_use_default_learner`[source]

`CausalInferenceModel.predict`[source]

`CausalInferenceModel.get_required_columns`[source]

`CausalInferenceModel.estimate_ate`[source]

`CausalInferenceModel.evaluate_robustness`[source]

`CausalInferenceModel.interpret`[source]

`CausalInferenceModel.explain`[source]

Causal Inference from Text with Autocoders

Causal Inference Using Raw Text as a Confounder/Mediator

Causal Inference With Text as a Treatment

`platt_scale`[source]

`gelu`[source]

`make_bow_vector`[source]

`class` `CausalBert`[source]

`class` `CausalBertModel`[source]

`CausalBertModel.train`[source]

`CausalBertModel.estimate_ate`[source]

`CausalBertModel.inference`[source]

	Is_Male?	Post_Text	Post_Shared?
0	0	I really love my job!	0
1	0	I really love my job!	0
2	0	I really love my job!	0
3	0	I really love my job!	0
4	0	I really love my job!	0

	Post_Text	negative	positive
0	I really love my job!	0.019191	0.980809
1	I really love my job!	0.019191	0.980809
2	I really love my job!	0.019191	0.980809
3	I really love my job!	0.019191	0.980809
4	I really love my job!	0.019191	0.980809

Causal Inference

class CausalInferenceModel[source]

CausalInferenceModel.fit[source]

CausalInferenceModel.tune_and_use_default_learner[source]

CausalInferenceModel.predict[source]

CausalInferenceModel.get_required_columns[source]

CausalInferenceModel.estimate_ate[source]

CausalInferenceModel.evaluate_robustness[source]

CausalInferenceModel.interpret[source]

CausalInferenceModel.explain[source]

Usage Example: Do social media posts by women get shared more often than those by men?

Causal Inference from Text with Autocoders

Causal Inference Using Raw Text as a Confounder/Mediator

Causal Inference With Text as a Treatment

platt_scale[source]

gelu[source]

make_bow_vector[source]

class CausalBert[source]

class CausalBertModel[source]

CausalBertModel.train[source]

CausalBertModel.estimate_ate[source]

CausalBertModel.inference[source]

`class` `CausalInferenceModel`[source]

`CausalInferenceModel.fit`[source]

`CausalInferenceModel.tune_and_use_default_learner`[source]

`CausalInferenceModel.predict`[source]

`CausalInferenceModel.get_required_columns`[source]

`CausalInferenceModel.estimate_ate`[source]

`CausalInferenceModel.evaluate_robustness`[source]

`CausalInferenceModel.interpret`[source]

`CausalInferenceModel.explain`[source]

`platt_scale`[source]

`gelu`[source]

`make_bow_vector`[source]

`class` `CausalBert`[source]

`class` `CausalBertModel`[source]

`CausalBertModel.train`[source]

`CausalBertModel.estimate_ate`[source]

`CausalBertModel.inference`[source]