Metalearner Utils

Metalearner Utils
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/fastcore/docscrape.py:230: UserWarning: Unknown section Args
  else: warn(msg)

source

get_xgboost_objective_metric

 get_xgboost_objective_metric (objective)

Get the xgboost version-compatible objective and evaluation metric from a potentially version-incompatible input.

Type Details
objective
Returns A tuple with the translated objective and evaluation metric.

source

clean_xgboost_objective

 clean_xgboost_objective (objective)

Translate objective to be compatible with loaded xgboost version

Type Details
objective
Returns The translated objective, or original if no translation was required.

source

check_explain_conditions

 check_explain_conditions (method, models, X=None, treatment=None, y=None)

source

check_p_conditions

 check_p_conditions (p, t_groups)

source

check_treatment_vector

 check_treatment_vector (treatment, control_name=None)

source

convert_pd_to_np

 convert_pd_to_np (*args)

source

regression_metrics

 regression_metrics (y, p, w=None, metrics={'RMSE': <function rmse at
                     0x7f41ad79dcf0>, 'sMAPE': <function smape at
                     0x7f41ad79dd80>, 'Gini': <function gini at
                     0x7f41ad79f250>})

*Log metrics for regressors.

Args: y (numpy.array): target p (numpy.array): prediction w (numpy.array, optional): a treatment vector (1 or True: treatment, 0 or False: control). If given, log metrics for the treatment and control group separately metrics (dict, optional): a dictionary of the metric names and functions*


source

gini

 gini (y, p)

*Normalized Gini Coefficient.

Args: y (numpy.array): target p (numpy.array): prediction

Returns: e (numpy.float64): normalized Gini coefficient*


source

rmse

 rmse (y, p)

*Root Mean Squared Error (RMSE). Args: y (numpy.array): target p (numpy.array): prediction

Returns: e (numpy.float64): RMSE*


source

smape

 smape (y, p)

*Symmetric Mean Absolute Percentage Error (sMAPE). Args: y (numpy.array): target p (numpy.array): prediction

Returns: e (numpy.float64): sMAPE*


source

mape

 mape (y, p)

*Mean Absolute Percentage Error (MAPE). Args: y (numpy.array): target p (numpy.array): prediction

Returns: e (numpy.float64): MAPE*


source

ape

 ape (y, p)

*Absolute Percentage Error (APE). Args: y (float): target p (float): prediction

Returns: e (float): APE*


source

classification_metrics

 classification_metrics (y, p, w=None, metrics={'AUC': <function
                         roc_auc_score at 0x7f41ca9ca320>, 'Log Loss':
                         <function logloss at 0x7f41ad79f9a0>})

*Log metrics for classifiers.

Args: y (numpy.array): target p (numpy.array): prediction w (numpy.array, optional): a treatment vector (1 or True: treatment, 0 or False: control). If given, log metrics for the treatment and control group separately metrics (dict, optional): a dictionary of the metric names and functions*


source

logloss

 logloss (y, p)

Bounded log loss error. Args: y (numpy.array): target p (numpy.array): prediction Returns: bounded log loss error


source

MatchOptimizer

 MatchOptimizer (treatment_col='is_treatment', ps_col='pihat',
                 user_col=None, matching_covariates=['pihat'],
                 max_smd=0.1, max_deviation=0.1, caliper_range=(0.01,
                 0.5), max_pihat_range=(0.95, 0.999),
                 max_iter_per_param=5, min_users_per_group=1000,
                 smd_cols=['pihat'], dev_cols_transformations={'pihat':
                 <function mean at 0x7f41e7727db0>}, dev_factor=1.0,
                 verbose=True)

*Finds the set of parameters that gives the best matching result.

Score = (number of features with SMD > max_smd) + (sum of deviations for important variables * deviation factor)

The logic behind the scoring is that we are most concerned with minimizing the number of features where SMD is lower than a certain threshold (max_smd). However, we would also like the matched dataset not deviate too much from the original dataset, in terms of key variable(s), so that we still retain a similar userbase.

Args: - treatment_col (str): name of the treatment column - ps_col (str): name of the propensity score column - max_smd (float): maximum acceptable SMD - max_deviation (float): maximum acceptable deviation for important variables - caliper_range (tuple): low and high bounds for caliper search range - max_pihat_range (tuple): low and high bounds for max pihat search range - max_iter_per_param (int): maximum number of search values per parameters - min_users_per_group (int): minimum number of users per group in matched set - smd_cols (list): score is more sensitive to these features exceeding max_smd - dev_factor (float): importance weight factor for dev_cols (e.g. dev_factor=1 means a 10% deviation leads to penalty of 1 in score) - dev_cols_transformations (dict): dict of transformations to be made on dev_cols - verbose (bool): boolean flag for printing statements

Returns: The best matched dataset (pd.DataFrame)*


source

NearestNeighborMatch

 NearestNeighborMatch (caliper=0.2, replace=False, ratio=1, shuffle=True,
                       random_state=None)

*Propensity score matching based on the nearest neighbor algorithm.

Attributes: caliper (float): threshold to be considered as a match. replace (bool): whether to match with replacement or not ratio (int): ratio of control / treatment to be matched. used only if replace=True. shuffle (bool): whether to shuffle the treatment group data before matching random_state (numpy.random.RandomState or int): RandomState or an int seed*


source

create_table_one

 create_table_one (data, treatment_col, features)

*Report balance in input features between the treatment and control groups.

References: R’s tableone at CRAN: https://github.com/kaz-yos/tableone Python’s tableone at PyPi: https://github.com/tompollard/tableone

Args: data (pandas.DataFrame): total or matched sample data treatment_col (str): the column name for the treatment features (list of str): the column names of features

Returns: (pandas.DataFrame): A table with the means and standard deviations in the treatment and control groups, and the SMD between two groups for the features.*


source

smd

 smd (feature, treatment)

*Calculate the standard mean difference (SMD) of a feature between the treatment and control groups.

The definition is available at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3144483/#s11title

Args: feature (pandas.Series): a column of a feature to calculate SMD for treatment (pandas.Series): a column that indicate whether a row is in the treatment group or not

Returns: (float): The SMD of the feature*