The bool_mask
parameter can be used to estimate the conditional average treatment estimate (CATE).
For instance, to estimate the average treatment effect for only those individuals over 18 years of age:
cm.estimate_ate(cm.df['age']>18])
import itertools
import pandas as pd
data = ((*a, b) for (a, b) in zip(itertools.product([0,1], [0,1], [0,1]), [36, 234, 25, 55, 6, 81, 71, 192]))
df = pd.DataFrame(data, columns=['Is_Male?', 'Post_Text', 'Post_Shared?', 'N'])
df = df.loc[df.index.repeat(df['N'])].reset_index(drop=True).drop(columns=['N'])
values = sorted(df['Post_Text'].unique())
df['Post_Text'].replace(values, ['I really love my job!', 'My boss is pretty terrible.'], inplace=True)
original_df = df.copy()
df = None
original_df.head()
At first glance, it seems like posts by women get shared more often. More specifically, it appears that being male reduces your the chance your post is shared by 4.5 percentage points:
male_probability = original_df[(original_df['Is_Male?']==1)]['Post_Shared?'].value_counts(normalize=True)[1]
male_probability
female_probability = original_df[(original_df['Is_Male?']==0)]['Post_Shared?'].value_counts(normalize=True)[1]
female_probability
male_probability-female_probability
However, this is inaccurate. In fact, this is an example of Simpson's Paradox, and the true causal effect of being male in this simulated datsaet is roughly 0.05 (as opposed to -0.045) with men's posts being more likely to be shared. The reason is that women in this simulation tend to make more positive posts which tend to be shared more often here. Post sentiment, then, is a mediator, which is statistically statistically similar to a confounder.
When controlling for the sentiment of the post (the mediator variable in this dataset), it is revealed that men's posts are, in fact, shared more often (for both negative posts and positive posts). This can be quickly and easily estimated in CausalNLP.
Causal Inference from Text with Autocoders
Let's first use the Autocoder
to transform the raw text into sentiment. We can then control for sentiment when estimating the causal effect.
from causalnlp.autocoder import Autocoder
ac = Autocoder()
df = ac.code_sentiment(original_df['Post_Text'].values, original_df, binarize=False, batch_size=16)
df.head()
When autocoding the raw text for sentiment, we have chosen to use the raw "probabilities" with binarize=False
. A binary variable can also be used with binarize=True
.
Next, let's estimate the treatment effects. We will ignore the positive
and Post_Shared?
columns, as their information is captured by the negative
column in this example. We will use the T-Learner. See this paper for more information on metalearner types.
from causalnlp.causalinference import CausalInferenceModel
cm = CausalInferenceModel(df, method='t-learner',
treatment_col='Is_Male?', outcome_col='Post_Shared?',
include_cols=['negative'])
cm.fit()
Upon controlling for sentiment, we see that the overall average treatment is correctly estimated as 0.05.
ate = cm.estimate_ate()
ate
Since this is a small, simulated, toy problem, we can manually calculate the adjusted treatment effect by controlling for the single counfounder (i.e., post negativity):
from collections import defaultdict
def ATE_adjusted(C, T, Y):
x = defaultdict(list)
for c, t, y in zip(C, T, Y):
x[c, t].append(y)
C0_ATE = np.mean(x[0,1]) - np.mean(x[0,0])
C1_ATE = np.mean(x[1,1]) - np.mean(x[1,0])
return np.mean([C0_ATE, C1_ATE])
ATE_adjusted((df['negative']>0.5).astype('int'), df['Is_Male?'].values, df['Post_Shared?'].values)
We see that this value is close to our estimate.
CausalNLP allows you to easily compute conditional or individualized treatment effects. For instance, for negative posts, being male increases the chance of your post being shared by about 4 percentage points:
cm.estimate_ate(cm.df['negative']>0.9)
For positive posts, being male increases the chance of your post being shared by about 6 percentage points:
cm.estimate_ate(cm.df['negative']<0.1)
assert ate['ate'] > 0.05
assert ate['ate'] < 0.055
Predictions can be made for new observations. We just have to make sure it contains the relevant columns included in the DataFrame supplied to CausalInferenceModel.fit
. In this case, it must include Is_Male?
and negative
. This can be verified with the CausalInferenceModel.get_required_columns
method:
cm.get_required_columns()
test_df = pd.DataFrame({
'text' : ['I love my life.'],
'Is_Male?' : [0],
'negative' : [0]
})
effect = cm.predict(test_df)
assert effect[0][0] < 0.065
assert effect[0][0] > 0.064
print(effect)
Causal Inference Using Raw Text as a Confounder/Mediator
In the example above, we approached the problem under the assumption that a specific lingustic property (sentiment) was an important mediator or confounder for which to control. In some cases, there may also be other unknown lingustic properties that are potential confounders/mediators (e.g., topic, politeness, toxic language, readability).
In CausalNLP, we can also use the raw text as the potential confounder/mediator.
cm = CausalInferenceModel(df, method='t-learner',
treatment_col='Is_Male?', outcome_col='Post_Shared?', text_col='Post_Text',
ignore_cols=['negative', 'positive'])
cm.fit()
Although we have excluded the negative and positive columns as extra covariates, you can use traditional categorical/numerical covariates in combination with a text field covariate (if they exist as extra columns in the dataframe).
Here, we see that the same causal estimates are returned, as the text is easy to infer as positive or negative based on their correlations with the outcomes in this problem.
ate = cm.estimate_ate()
ate
cm.estimate_ate(df['Post_Text'] == 'My boss is pretty terrible.')
cm.estimate_ate(df['Post_Text'] == 'I really love my job!')
assert ate['ate'] > 0.05
assert ate['ate'] < 0.055
Make predictions on new data. Again, make sure the DataFrame contains the relevant columns included in the original DataFrame supplied to [`CausalInferenceModel.fit`](/causalnlp/causalinference.html#CausalInferenceModel.fit):
cm.get_required_columns()
test_df = pd.DataFrame({
'Post_Text' : ['I love my life.'],
'New Column' : [1],
'Is_Male?' : [0],
'negative' : [0]
})
effect = cm.predict(test_df)
assert effect[0][0] < 0.065
assert effect[0][0] > 0.064
print(effect)
cm.interpret(plot=False, method='feature_importance')
cm.interpret(plot=True, method='feature_importance')
cm.interpret(plot=True, method='shap_values')
Suppose we were interested in estimating the causal impact of sentiment on the outcome. That is, sentiment of text is the treatment, and the gender is a potential confounder. As we did above, we can use the Autocoder
to create the treatment variable. The only difference is that we would supply the binarize=True
as an argument.
df = ac.code_sentiment(original_df['Post_Text'].values, original_df, binarize=True, batch_size=16)
df.head()
cm = CausalInferenceModel(df, method='t-learner',
treatment_col='positive', outcome_col='Post_Shared?',
include_cols=['Is_Male?'])
cm.fit()
ate = cm.estimate_ate()
ate
assert ate['ate'] > 0.18
assert ate['ate'] < 0.2
cm.get_required_columns()
test_df = pd.DataFrame({
'Is_Male?' : [1],
'positive' : [1]
})
effect = cm.predict(test_df)
print(effect)