Metalearner Utils
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/fastcore/docscrape.py:230: UserWarning: Unknown section Args
else: warn(msg)
get_xgboost_objective_metric
get_xgboost_objective_metric (objective)
Get the xgboost version-compatible objective and evaluation metric from a potentially version-incompatible input.
Type | Details | |
---|---|---|
objective | ||
Returns | A tuple with the translated objective and evaluation metric. |
clean_xgboost_objective
clean_xgboost_objective (objective)
Translate objective to be compatible with loaded xgboost version
Type | Details | |
---|---|---|
objective | ||
Returns | The translated objective, or original if no translation was required. |
check_explain_conditions
check_explain_conditions (method, models, X=None, treatment=None, y=None)
check_p_conditions
check_p_conditions (p, t_groups)
check_treatment_vector
check_treatment_vector (treatment, control_name=None)
convert_pd_to_np
convert_pd_to_np (*args)
regression_metrics
regression_metrics (y, p, w=None, metrics={'RMSE': <function rmse at 0x7f41ad79dcf0>, 'sMAPE': <function smape at 0x7f41ad79dd80>, 'Gini': <function gini at 0x7f41ad79f250>})
*Log metrics for regressors.
Args: y (numpy.array): target p (numpy.array): prediction w (numpy.array, optional): a treatment vector (1 or True: treatment, 0 or False: control). If given, log metrics for the treatment and control group separately metrics (dict, optional): a dictionary of the metric names and functions*
gini
gini (y, p)
*Normalized Gini Coefficient.
Args: y (numpy.array): target p (numpy.array): prediction
Returns: e (numpy.float64): normalized Gini coefficient*
rmse
rmse (y, p)
*Root Mean Squared Error (RMSE). Args: y (numpy.array): target p (numpy.array): prediction
Returns: e (numpy.float64): RMSE*
smape
smape (y, p)
*Symmetric Mean Absolute Percentage Error (sMAPE). Args: y (numpy.array): target p (numpy.array): prediction
Returns: e (numpy.float64): sMAPE*
mape
mape (y, p)
*Mean Absolute Percentage Error (MAPE). Args: y (numpy.array): target p (numpy.array): prediction
Returns: e (numpy.float64): MAPE*
ape
ape (y, p)
*Absolute Percentage Error (APE). Args: y (float): target p (float): prediction
Returns: e (float): APE*
classification_metrics
classification_metrics (y, p, w=None, metrics={'AUC': <function roc_auc_score at 0x7f41ca9ca320>, 'Log Loss': <function logloss at 0x7f41ad79f9a0>})
*Log metrics for classifiers.
Args: y (numpy.array): target p (numpy.array): prediction w (numpy.array, optional): a treatment vector (1 or True: treatment, 0 or False: control). If given, log metrics for the treatment and control group separately metrics (dict, optional): a dictionary of the metric names and functions*
logloss
logloss (y, p)
Bounded log loss error. Args: y (numpy.array): target p (numpy.array): prediction Returns: bounded log loss error
MatchOptimizer
MatchOptimizer (treatment_col='is_treatment', ps_col='pihat', user_col=None, matching_covariates=['pihat'], max_smd=0.1, max_deviation=0.1, caliper_range=(0.01, 0.5), max_pihat_range=(0.95, 0.999), max_iter_per_param=5, min_users_per_group=1000, smd_cols=['pihat'], dev_cols_transformations={'pihat': <function mean at 0x7f41e7727db0>}, dev_factor=1.0, verbose=True)
*Finds the set of parameters that gives the best matching result.
Score = (number of features with SMD > max_smd) + (sum of deviations for important variables * deviation factor)
The logic behind the scoring is that we are most concerned with minimizing the number of features where SMD is lower than a certain threshold (max_smd). However, we would also like the matched dataset not deviate too much from the original dataset, in terms of key variable(s), so that we still retain a similar userbase.
Args: - treatment_col (str): name of the treatment column - ps_col (str): name of the propensity score column - max_smd (float): maximum acceptable SMD - max_deviation (float): maximum acceptable deviation for important variables - caliper_range (tuple): low and high bounds for caliper search range - max_pihat_range (tuple): low and high bounds for max pihat search range - max_iter_per_param (int): maximum number of search values per parameters - min_users_per_group (int): minimum number of users per group in matched set - smd_cols (list): score is more sensitive to these features exceeding max_smd - dev_factor (float): importance weight factor for dev_cols (e.g. dev_factor=1 means a 10% deviation leads to penalty of 1 in score) - dev_cols_transformations (dict): dict of transformations to be made on dev_cols - verbose (bool): boolean flag for printing statements
Returns: The best matched dataset (pd.DataFrame)*
NearestNeighborMatch
NearestNeighborMatch (caliper=0.2, replace=False, ratio=1, shuffle=True, random_state=None)
*Propensity score matching based on the nearest neighbor algorithm.
Attributes: caliper (float): threshold to be considered as a match. replace (bool): whether to match with replacement or not ratio (int): ratio of control / treatment to be matched. used only if replace=True. shuffle (bool): whether to shuffle the treatment group data before matching random_state (numpy.random.RandomState or int): RandomState or an int seed*
create_table_one
create_table_one (data, treatment_col, features)
*Report balance in input features between the treatment and control groups.
References: R’s tableone at CRAN: https://github.com/kaz-yos/tableone Python’s tableone at PyPi: https://github.com/tompollard/tableone
Args: data (pandas.DataFrame): total or matched sample data treatment_col (str): the column name for the treatment features (list of str): the column names of features
Returns: (pandas.DataFrame): A table with the means and standard deviations in the treatment and control groups, and the SMD between two groups for the features.*
smd
smd (feature, treatment)
*Calculate the standard mean difference (SMD) of a feature between the treatment and control groups.
The definition is available at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3144483/#s11title
Args: feature (pandas.Series): a column of a feature to calculate SMD for treatment (pandas.Series): a column that indicate whether a row is in the treatment group or not
Returns: (float): The SMD of the feature*