import pandas as pd
Preprocessing
Preprocesses dataset
DataframePreprocessor
DataframePreprocessor (treatment_col='treatment', outcome_col='outcome', text_col=None, include_cols=[], ignore_cols=[], verbose=1)
Preproceses a pandas DataFrame for causal inference
DataframePreprocessor.preprocess
DataframePreprocessor.preprocess (df, training=False, min_df=0.05, max_df=0.5, ngram_range=(1, 1), stop_words='english', na_cont_value=-1, na_cat_value='MISSING')
Preprocess a dataframe for causal inference.
= pd.read_csv('sample_data/music_seed50.tsv', sep='\t', on_bad_lines='skip') df
= DataframePreprocessor(treatment_col='T_ac', outcome_col='Y_sim',
pp ='text', include_cols=['C_true', 'product']) text_col
= pp.preprocess(df, training=True) df, X, Y, T
outcome column (categorical): Y_sim
treatment column: T_ac
numerical/categorical covariates: ['product', 'C_true']
text covariate: text
preprocess time: 1.49556303024292 sec
X.head()
C_true | product_audio cd | product_mp3 music | product_vinyl | v_album | v_albums | v_band | v_beautiful | v_best | v_better | v_bought | v_buy | v_cd | v_collection | v_did | v_don | v_excellent | v_fan | v_favorite | v_good | v_got | v_great | v_hear | v_heard | v_just | v_know | v_like | v_listen | v_listening | v_love | v_music | v_new | v_old | v_original | v_really | v_record | v_recording | v_rock | v_song | v_songs | v_sound | v_sounds | v_think | v_time | v_track | v_tracks | v_ve | v_voice | v_way | v_work | v_years | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 0.25232 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.850798 | 0.251679 | 0.0 | 0.0 | 0.386181 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0 | 0 | 1 | 0 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 1 | 1 | 0 | 0 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.542250 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.625138 | 0.561398 | 0.0 | 0.0 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 0 | 0 | 1 | 0 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.629106 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.777319 | 0.0 | 0.0 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 1 | 1 | 0 | 0 | 0.00000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.527751 | 0.0 | 0.000000 | 0.392572 | 0.0 | 0.0 | 0.372982 | 0.334952 | 0.0 | 0.0 | 0.56219 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
= pd.DataFrame({
test_df 'C_true' : [0, 1],
'product': ['vinyl', 'mp3 music'],
'text' : ['This record hurts my ears.', "The music of Yanni is beautiful and breath-taking."],
'Y_sim' : [0, 1],
'T_ac' : [0, 1],
}) test_df.head()
C_true | product | text | Y_sim | T_ac | |
---|---|---|---|---|---|
0 | 0 | vinyl | This record hurts my ears. | 0 | 0 |
1 | 1 | mp3 music | The music of Yanni is beautiful and breath-tak... | 1 | 1 |
= pp.preprocess(test_df, training=False) _, X_test, _, _
assert sum([X_test.columns.values[i] == col for i,col in enumerate(X.columns.values)]) == len(X.columns.values)
= pd.DataFrame({
test_df 'product': ['vinyl', 'mp3 music'],
'text' : ['This record hurts my ears.', "The music of Yanni is beautiful and breath-taking."],
'Y_sim' : [0, 1],
'T_ac' : [0, 1],
})= False
error try:
= pp.preprocess(test_df, training=False)
_, X_test, _, _ except ValueError:
= True
error assert error is True