Preprocessing

Preprocesses dataset

DataframePreprocessor

 DataframePreprocessor (treatment_col='treatment', outcome_col='outcome',
                        text_col=None, include_cols=[], ignore_cols=[],
                        verbose=1)

Preproceses a pandas DataFrame for causal inference

source

DataframePreprocessor.preprocess

 DataframePreprocessor.preprocess (df, training=False, min_df=0.05,
                                   max_df=0.5, ngram_range=(1, 1),
                                   stop_words='english', na_cont_value=-1,
                                   na_cat_value='MISSING')

Preprocess a dataframe for causal inference.

import pandas as pd

df = pd.read_csv('sample_data/music_seed50.tsv', sep='\t', on_bad_lines='skip')

pp = DataframePreprocessor(treatment_col='T_ac', outcome_col='Y_sim', 
                           text_col='text', include_cols=['C_true', 'product'])

df, X, Y, T = pp.preprocess(df, training=True)

outcome column (categorical): Y_sim
treatment column: T_ac
numerical/categorical covariates: ['product', 'C_true']
text covariate: text
preprocess time:  1.49556303024292  sec

X.head()

	C_true	product_audio cd	product_mp3 music	v_album	v_buy	v_cd	v_don	v_heard	v_know	v_like	v_love	v_music	v_original
0	0	0	1	0.25232	0.850798	0.251679	0.386181	0.000000	0.000000	0.000000	0.000000	0.000000	0.00000
1	0	0	1	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.00000
2	1	1	0	0.00000	0.000000	0.542250	0.000000	0.000000	0.000000	0.000000	0.625138	0.561398	0.00000
3	0	0	1	0.00000	0.000000	0.000000	0.000000	0.000000	0.629106	0.000000	0.000000	0.777319	0.00000
4	1	1	0	0.00000	0.000000	0.000000	0.000000	0.527751	0.000000	0.392572	0.372982	0.334952	0.56219

test_df = pd.DataFrame({
    'C_true' : [0, 1],
    'product': ['vinyl', 'mp3 music'],
     'text' : ['This record hurts my ears.', "The music of Yanni is beautiful and breath-taking."],
    'Y_sim' : [0, 1],
     'T_ac' : [0, 1],
      })
test_df.head()

	C_true	product	text	Y_sim	T_ac
0	0	vinyl	This record hurts my ears.	0	0
1	1	mp3 music	The music of Yanni is beautiful and breath-tak...	1	1

_, X_test, _, _ = pp.preprocess(test_df, training=False)

assert sum([X_test.columns.values[i] == col for i,col in enumerate(X.columns.values)]) == len(X.columns.values)

test_df = pd.DataFrame({
    'product': ['vinyl', 'mp3 music'],
     'text' : ['This record hurts my ears.', "The music of Yanni is beautiful and breath-taking."],
    'Y_sim' : [0, 1],
     'T_ac' : [0, 1],
      })
error = False
try: 
    _, X_test, _, _ = pp.preprocess(test_df, training=False)
except ValueError:
    error = True
assert error is True