R Package


Please note that this package requires the installation of R version 3.3 or higher. If you do not already have an updated version of R installed on your computer, you can find instructions on how to download the latest version here. This package also contains Python dependencies like Rcpp, which requires you to install Cplex first. If you do not already have an updated version of Cplex installed on your computer, you can find instructions on how to download and install it here.

Installation

AHB, along with its R dependencies, can be installed directly from CRAN via the following command:

install.packages('AHB')
Copied to clipboard!

Input Data Format

To begin using the R-AHB algorithm, first ensure that your dataset is stored as an R Data Frame. Remember, covariates can either be categorical, continuous, or mixed (categorical and continuous). In addition to the covariate columns, your dataset should include a column of binary or logical data types which specify whether a unit is treated (1) or control (0) and a column of numeric data types which specify unit outcomes. Below is a sample dataset in the required format:
x_1 (numeric) x_2 (numeric) ... x_m (numeric) treated (binary or logical) outcome (numeric)
3 2.0529 ... 4.7905 1 4.5321
0 3.9932 ... 7.6513 0 3.3348
... ... ... ... ... ...
1 6.9321 ... 1.5848 1 6.9320

Usage

To generate sample data for exploring AHB matching functionality, use the function gen_data as shown below. Remember to load the 'AHB' package a shown in line 1 before calling any of the functions discussed in this section. This example generates a data frame with n = 50 units and p = 5 covariates:
library('AHB')
set.seed(45)
n <- 50
p <- 5
data <- gen_data(n_units = n, p = p)
holdout <- gen_data(n_units = n, p = p)
To run the algorithm, use the AHB_fast_match or AHB_MIP_matchfunction as shown below. The required data parameter can either be a path to a .csv file or a dataframe. In this example, data and holdout generated above are used:
AHB_MIP_out <- AHB_MIP_match(data = data,  holdout = holdout, treated_column_name="treated", outcome_column_name="outcome")
AHB_fast_out <- AHB_fast_match(data = data,  holdout = holdout, treated_column_name="treated", outcome_column_name="outcome")
Take AHB_fast_out as an axample to illustrate the output of the AHB matching alogrithms. The object AHB_fast_out is a list of five entries:
AHB_fast_out$data: Data set was matched by AHB_fast_match(). If holdout is not a numeric value, the AHB_fast_out$data is the same as the data input into AHB_fast_match(). If holdout is a numeric scalar between 0 and 1, AHB_fast_out$data is the remaining proportion of data that were matched.
AHB_fast_out$units_id: A integer vector with unit_id for test treated units
AHB_fast_out$CATE: A numeric vector with the conditional average treatment effect estimates for every test treated unit in its matched group in AHB_fast_out$MGs
AHB_fast_out$bins: A numeric vector with the conditional average treatment effect estimates for every test treated unit in its matched group in AHB_fast_out$MGs
AHB_fast_out$MGs: A list of all the matched groups formed by AHB_fast_match(). For each test treated unit, each row contains all unit_id of the other units that fall into its box, including itself.
To find the average treatment effect (ATE) or average treatment effect on the treated (ATT), use the functions ATE and ATT, respectively, as shown below:
ATE(AHB_out = AHB_fast_out)
ATT(AHB_out = AHB_fast_out)

AHB - Parameters and Defaults

AHB_fast_match(data, holdout = 0.1, treated_column_name = "treated", outcome_column_name = "outcome", black_box = "BART", cv = T, C = 0.1)
AHB_MIP_match(data, holdout = 0.1, treated_column_name = "treated", outcome_column_name = "outcome", black_box = "BART", cv = T, gamma0 = 3, gamma1 = 3, Beta = 2, m = 1, M = 1e+05, n_prune = ifelse(is.numeric(holdout), round(0.1 * (1 - holdout) * nrow(data)), round(0.1 * nrow(data))))
Expand all Collapse all

Key Parameters for AHB_fast_match

data:
file, Dataframe, required
If holdout is not a numeric value, this is the data to be matched. If holdout is a numeric scalar between 0 and 1, that proportion of data will be made into a holdout set and only the remaining proportion of data will be matched.
holdout:
numeric, file, Dataframe, optional (default = 0.1)
Holdout data used to train the outcome model. If a numeric scalar, that proportion of data will be made into a holdout set and only the remaining proportion of data will be matched. Otherwise, if a file path or dataframe is provided, that dataset will serve as the holdout data.
treated_column_name:
string, optional (default = 'treated')
The name of the column which specifies whether a unit is treated or control.
outcome_column_name:
string, optional (default = 'outcome')
The name of the column which specifies each unit outcome.
black_box
string, optional (default = 'BART)
Denotes the method to be used to generate outcome model Y. If "BART" and cv = F, uses dbarts::bart with keeptrees = TRUE, keepevery = 10, verbose = FALSE, k = 2 and ntree =200 and then the default predict method to estimate the outcome. If "BART" and cv = T, k and ntree will be best values from cross validation. Defaults to 'BART'. There will be multiple choices about black_box in the future.
cv
logical, optional (default = T)
If TURE, do cross-validation on the train set to generate outcome model Y
C
A positive scalar, optional (default = 0.1)
Determines the stopping condition for Fast AHB. When the variance in a newly expanded region exceeds C times the variance in the previous expansion region, the algorithm stops. Thus, higher C encourages coarser bins while lower C encourages finer ones. The user should analyze the data with multiple values of C to see how robust results are to its choice.

Key Parameters for AHB_MIP_match

data:
file, Dataframe, required
If holdout is not a numeric value, this is the data to be matched. If holdout is a numeric scalar between 0 and 1, that proportion of data will be made into a holdout set and only the remaining proportion of data will be matched.
holdout:
numeric, file, Dataframe, optional (default = 0.1)
Holdout data used to train the outcome model. If a numeric scalar, that proportion of data will be made into a holdout set and only the remaining proportion of data will be matched. Otherwise, if a file path or dataframe is provided, that dataset will serve as the holdout data.
treated_column_name:
string, optional (default = 'treated')
The name of the column which specifies whether a unit is treated or control.
outcome_column_name:
string, optional (default = 'outcome')
The name of the column which specifies each unit outcome.
black_box
string, optional (default = 'BART)
Denotes the method to be used to generate outcome model Y. If "BART" and cv = F, uses dbarts::bart with keeptrees = TRUE, keepevery = 10, verbose = FALSE, k = 2 and ntree =200 and then the default predict method to estimate the outcome. If "BART" and cv = T, k and ntree will be best values from cross validation. Defaults to 'BART'. There will be multiple choices about black_box in the future.
cv
logical, optional (default = T)
If TURE, do cross-validation on the train set to generate outcome model Y
gamma0
A numeric scalar, optional (default = 3)
A numeric value, one of hyperparameters in global MIP that controls the weight placed on the outcome function portion of the loss.
gamma1
A numeric scalar, optional (default = 3)
A numeric value, one of hyperparameters in global MIP that controls the weight placed on the outcome function portion of the loss.
beta
A numeric scalar, optional (default = 2)
A numeric value, one of hyperparameters in global MIP that controls the weight placed on the outcome function portion of the loss.
m
A integer scalar, optional (default = 1)
Determines the at least number of control units that the box contains when estimating causal effects for a single treatment unit.
M
A positive integer scalar, optional (default = 1e+5)
Controls the weight placed on decision variable wij, which is an indicator for whether a unit is in the box.
n_prune
A positive inetger scalar, optional (default = 0.1* nrow(dataset to be matched))
Determines the number of candidate units selected to run the mip on for constructing the box. Dataset mentioned below is refered to the dataset for matching. If you match a small dataset with the number of units smaller than 400, it will run MIP on all dataset for each treated unit. If you match larger dataset and your memory of your computer cannot support such much computation, plase adjust n_prune below 400 or even smaller. The smaller number of candidate units selected to run the mip on for constructing the box, the faster this program runs.

Additional Functions - Parameters and Defaults

#returns average treatment effect ATE(AHB_out) #returns average treatment effect on the treated ATT(AHB_out)

Key Parameters

AHB_out:
Dataframe, required
The output of a call to AHB.