Preprocessing

Preprocesses dataset

source

DataframePreprocessor

 DataframePreprocessor (treatment_col='treatment', outcome_col='outcome',
                        text_col=None, include_cols=[], ignore_cols=[],
                        verbose=1)

Preproceses a pandas DataFrame for causal inference


source

DataframePreprocessor.preprocess

 DataframePreprocessor.preprocess (df, training=False, min_df=0.05,
                                   max_df=0.5, ngram_range=(1, 1),
                                   stop_words='english', na_cont_value=-1,
                                   na_cat_value='MISSING')

Preprocess a dataframe for causal inference.

import pandas as pd
df = pd.read_csv('sample_data/music_seed50.tsv', sep='\t', on_bad_lines='skip')
pp = DataframePreprocessor(treatment_col='T_ac', outcome_col='Y_sim', 
                           text_col='text', include_cols=['C_true', 'product'])
df, X, Y, T = pp.preprocess(df, training=True)
outcome column (categorical): Y_sim
treatment column: T_ac
numerical/categorical covariates: ['product', 'C_true']
text covariate: text
preprocess time:  1.49556303024292  sec
X.head()
C_true product_audio cd product_mp3 music product_vinyl v_album v_albums v_band v_beautiful v_best v_better v_bought v_buy v_cd v_collection v_did v_don v_excellent v_fan v_favorite v_good v_got v_great v_hear v_heard v_just v_know v_like v_listen v_listening v_love v_music v_new v_old v_original v_really v_record v_recording v_rock v_song v_songs v_sound v_sounds v_think v_time v_track v_tracks v_ve v_voice v_way v_work v_years
0 0 0 1 0 0.25232 0.0 0.0 0.0 0.0 0.0 0.0 0.850798 0.251679 0.0 0.0 0.386181 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.00000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0 0 1 0 0.00000 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.00000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 1 1 0 0 0.00000 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.542250 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.625138 0.561398 0.0 0.0 0.00000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0 0 1 0 0.00000 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.629106 0.000000 0.0 0.0 0.000000 0.777319 0.0 0.0 0.00000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 1 1 0 0 0.00000 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.527751 0.0 0.000000 0.392572 0.0 0.0 0.372982 0.334952 0.0 0.0 0.56219 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
test_df = pd.DataFrame({
    'C_true' : [0, 1],
    'product': ['vinyl', 'mp3 music'],
     'text' : ['This record hurts my ears.', "The music of Yanni is beautiful and breath-taking."],
    'Y_sim' : [0, 1],
     'T_ac' : [0, 1],
      })
test_df.head()
C_true product text Y_sim T_ac
0 0 vinyl This record hurts my ears. 0 0
1 1 mp3 music The music of Yanni is beautiful and breath-tak... 1 1
_, X_test, _, _ = pp.preprocess(test_df, training=False)
assert sum([X_test.columns.values[i] == col for i,col in enumerate(X.columns.values)]) == len(X.columns.values)
test_df = pd.DataFrame({
    'product': ['vinyl', 'mp3 music'],
     'text' : ['This record hurts my ears.', "The music of Yanni is beautiful and breath-taking."],
    'Y_sim' : [0, 1],
     'T_ac' : [0, 1],
      })
error = False
try: 
    _, X_test, _, _ = pp.preprocess(test_df, training=False)
except ValueError:
    error = True
assert error is True