*Infers causality from the data contained in df using a metalearner.
Usage:
>>> cm = CausalInferenceModel(df, treatment_col='Is_Male?', outcome_col='Post_Shared?', text_col='Post_Text', ignore_cols=['id', 'email']) cm.fit()
Parameters:
df : pandas.DataFrame containing dataset
method : metalearner model to use. One of {‘t-learner’, ‘s-learner’, ‘x-learner’, ‘r-learner’} (Default: ‘t-learner’)
metalearner_type : Alias of method for backwards compatibility. Overrides method if not None.
treatment_col : treatment variable; column should contain binary values: 1 for treated, 0 for untreated.
outcome_col : outcome variable; column should contain the categorical or numeric outcome values
text_col : (optional) text column containing the strings (e.g., articles, reviews, emails).
ignore_cols : columns to ignore in the analysis
include_cols : columns to include as covariates (e.g., possible confounders)
treatment_effect_col : name of column to hold causal effect estimations. Does not need to exist. Created by CausalNLP.
learner : an instance of a custom learner. If None, Log/Lin Regression is used for S-Learner and a default LightGBM model will be used for all other metalearner types. # Example learner = LGBMClassifier(num_leaves=1000)
effect_learner: used for x-learner/r-learner and must be regression model
min_df : min_df parameter used for text processing using sklearn
max_df : max_df parameter used for text procesing using sklearn
ngram_range: ngrams used for text vectorization. default: (1,1)
stop_words : stop words used for text processing (from sklearn)
verbose : If 1, print informational messages. If 0, suppress.*
Fits a causal inference model and estimates outcome with and without treatment for each observation. For X-Learner and R-Learner, propensity scores will be computed using default propensity model unless p is not None. Parameter p is not used for other methods.
Tunes the hyperparameters of a default LightGBM model, replaces CausalInferenceModel.learner, and returns best parameters. Should be invoked prior to running CausalInferencemodel.fit. If scoring is None, then ‘roc_auc’ is used for classification and ‘negative_mean_squared_error’ is used for regresssion.
Estimates the treatment effect for each observation in df. The DataFrame represented by df should be the same format as the one supplied to CausalInferenceModel.__init__. For X-Learner and R-Learner, propensity scores will be computed using default propensity model unless p is not None. Parameter p is not used for other methods.
Estimates the treatment effect for each observation in self.df.
The bool_mask parameter can be used to estimate the conditional average treatment estimate (CATE). For instance, to estimate the average treatment effect for only those individuals over 18 years of age:
Evaluates robustness on four sensitivity measures (see CausalML package for details on these methods): - Placebo Treatment: ATE should become zero. - Random Cause: ATE should not change. - Random Replacement: ATE should not change. - Subset Data: ATE should not change.
*Explain the treatment effect estimate of a single observation using SHAP.
Parameters: - df (pd.DataFrame): a pd.DataFrame of test data is same format as original training data DataFrame - row_num (int): raw row number in DataFrame to explain (default:0, the first row) - background_size (int): size of background data (SHAP parameter) - nsamples (int): number of samples (SHAP parameter)*
Usage Example: Do social media posts by women get shared more often than those by men?
Let’s create a simulated dataset.
import itertoolsimport pandas as pd
data = ((*a, b) for (a, b) inzip(itertools.product([0,1], [0,1], [0,1]), [36, 234, 25, 55, 6, 81, 71, 192]))df = pd.DataFrame(data, columns=['Is_Male?', 'Post_Text', 'Post_Shared?', 'N'])df = df.loc[df.index.repeat(df['N'])].reset_index(drop=True).drop(columns=['N'])values =sorted(df['Post_Text'].unique())df['Post_Text'].replace(values, ['I really love my job!', 'My boss is pretty terrible.'], inplace=True)original_df = df.copy()df =Noneoriginal_df.head()
Is_Male?
Post_Text
Post_Shared?
0
0
I really love my job!
0
1
0
I really love my job!
0
2
0
I really love my job!
0
3
0
I really love my job!
0
4
0
I really love my job!
0
At first glance, it seems like posts by women get shared more often. More specifically, it appears that being male reduces your the chance your post is shared by 4.5 percentage points:
However, this is inaccurate. In fact, this is an example of Simpson’s Paradox, and the true causal effect of being male in this simulated datsaet is roughly 0.05 (as opposed to -0.045) with men’s posts being more likely to be shared. The reason is that women in this simulation tend to make more positive posts which tend to be shared more often here. Post sentiment, then, is a [mediator](https://en.wikipedia.org/wiki/Mediation_(statistics), which is statistically statistically similar to a confounder.
When controlling for the sentiment of the post (the mediator variable in this dataset), it is revealed that men’s posts are, in fact, shared more often (for both negative posts and positive posts). This can be quickly and easily estimated in CausalNLP.
Causal Inference from Text with Autocoders
Let’s first use the Autocoder to transform the raw text into sentiment. We can then control for sentiment when estimating the causal effect.
When autocoding the raw text for sentiment, we have chosen to use the raw “probabilities” with binarize=False. A binary variable can also be used with binarize=True.
Next, let’s estimate the treatment effects. We will ignore the positive and Post_Shared? columns, as their information is captured by the negative column in this example. We will use the T-Learner. See this paper for more information on metalearner types.
from causalnlp import CausalInferenceModel
cm = CausalInferenceModel(df, method='t-learner', treatment_col='Is_Male?', outcome_col='Post_Shared?', include_cols=['negative'])cm.fit()
outcome column (categorical): Post_Shared?
treatment column: Is_Male?
numerical/categorical covariates: ['negative']
preprocess time: 0.013550996780395508 sec
start fitting causal inference model
time to fit causal inference model: 0.8901166915893555 sec
Upon controlling for sentiment, we see that the overall average treatment is correctly estimated as 0.05.
ate = cm.estimate_ate()ate
{'ate': 0.05366850622769351}
Since this is a small, simulated, toy problem, we can manually calculate the adjusted treatment effect by controlling for the single counfounder (i.e., post negativity):
CausalNLP allows you to easily compute conditional or individualized treatment effects. For instance, for negative posts, being male increases the chance of your post being shared by about 4 percentage points:
cm.estimate_ate(cm.df['negative']>0.9)
{'ate': 0.042535751074149745}
For positive posts, being male increases the chance of your post being shared by about 6 percentage points:
cm.estimate_ate(cm.df['negative']<0.1)
{'ate': 0.06436468274776497}
assert ate['ate'] >0.05assert ate['ate'] <0.055
Predictions can be made for new observations. We just have to make sure it contains the relevant columns included in the DataFrame supplied to CausalInferenceModel.fit. In this case, it must include Is_Male? and negative. This can be verified with the CausalInferenceModel.get_required_columns method:
cm.get_required_columns()
['Is_Male?', 'negative']
test_df = pd.DataFrame({'text' : ['I love my life.'],'Is_Male?' : [0],'negative' : [0] })effect = cm.predict(test_df)assert effect[0][0] <0.065assert effect[0][0] >0.064print(effect)
[[0.06436468]]
Causal Inference Using Raw Text as a Confounder/Mediator
In the example above, we approached the problem under the assumption that a specific lingustic property (sentiment) was an important mediator or confounder for which to control. In some cases, there may also be other unknown lingustic properties that are potential confounders/mediators (e.g., topic, politeness, toxic language, readability).
In CausalNLP, we can also use the raw text as the potential confounder/mediator.
cm = CausalInferenceModel(df, method='t-learner', treatment_col='Is_Male?', outcome_col='Post_Shared?', text_col='Post_Text', ignore_cols=['negative', 'positive'])cm.fit()
outcome column (categorical): Post_Shared?
treatment column: Is_Male?
numerical/categorical covariates: []
text covariate: Post_Text
preprocess time: 0.015369415283203125 sec
start fitting causal inference model
time to fit causal inference model: 0.5458502769470215 sec
Although we have excluded the negative and positive columns as extra covariates, you can use traditional categorical/numerical covariates in combination with a text field covariate (if they exist as extra columns in the dataframe).
Here, we see that the same causal estimates are returned, as the text is easy to infer as positive or negative based on their correlations with the outcomes in this problem.
ate = cm.estimate_ate()ate
{'ate': 0.05366850622769351}
cm.estimate_ate(df['Post_Text'] =='My boss is pretty terrible.')
{'ate': 0.042535751074149745}
cm.estimate_ate(df['Post_Text'] =='I really love my job!')
{'ate': 0.06436468274776497}
assert ate['ate'] >0.05assert ate['ate'] <0.055
Make predictions on new data. Again, make sure the DataFrame contains the relevant columns included in the original DataFrame supplied to `CausalInferenceModel.fit`:
Suppose we were interested in estimating the causal impact of sentiment on the outcome. That is, sentiment of text is the treatment, and the gender is a potential confounder. As we did above, we can use the Autocoder to create the treatment variable. The only difference is that we would supply the binarize=True as an argument.