Propensity Score Matching in Excel
This tutorial will help you set up and perform a propensity score matching in Excel using the XLSTAT statistical software.
Dataset for running a propensity score matching
This tutorial uses a random subsample of a dataset originally published in Robert Lalonde (1986) and revisited in Dehejia and Wahba (1999).
The original study aimed at studying the effect of the participation in a particular job training program on the earnings of the individuals in 1978. Some confounding variables are suspected to introduce a serious bias in the study results so we wish to quantify this effect and build up a subsample of it that reduces this bias beforehand. This will be achieved by pairing similar individuals in terms of confounding variables using Propensity Score Matching.
There are 10 variables used in the original study. The first one is binary variable entitled Participation in job training that indicates if the participant did participate in the job training program (1) or not (0). This is our group variable. We believe that the probability for one participant to participate in the job training program is not random and that it can be explained to some extent with confounding variables.
The suspected confounding variables are the 8 variables below:
Age: the age of the participant in years;
Years of education: the number of years of education for the participant;
Afro-American: 1 if the participant is of Afro-American origin, 0 otherwise;
Hispanic-American: 1 if the participant is of Hispanic-American origin, 0 otherwise;
Married: 1 if the participant is married, 0 otherwise;
No degree: 1 if the participant did not receive any degree, 0 otherwise;
Earnings in 1974: the total revenue of the participant in 1974 expressed in 1978 dollars;
Earnings in 1975: the total revenue of the participant in 1975 expressed in 1978 dollars;
Age, years of education, earnings in 1974 and 1975 are quantitative data. The rest are binary categorical data.
Finally, the last variable named Earnings in 1978 is the total revenue of the participant in the year 1978. It was the variable of interest in the original study but it won't be used in this tutorial.
The subsample used in this tutorial is made of 200 participants, 82 of them did participate to the job training program.
Setting up a propensity score matching
First, open the downloaded file with Excel and activate XLSTAT. Once XLSTAT is activated, select the XLSTAT / Advanced features / Survival analysis / Propensity score matching (see below).
Once you have clicked on the button, the dialog box appears.
First, select the group variable, Participation in job training, in the column A. You should also select the treatment modality. In our case, this modality is 1. The treatment modality should be automatically detected when you select your data on the spreadsheet. If this is not the case, click on the refresh button located just on the right of the drop-down box.
Then, select the explanatory variables suspected to be the confounding factors. Click on the Quantitative checkbox to activate the quantitative field and select columns B, C, H and I corresponding to variables age, years of education, earnings in 1974 and earnings in 1975 respectively. You can use the Ctrl button to select multiple columns. Similarly, click on the Qualitative checkbox to activate the qualitative field and select columns D, E, F and G (Afro-american, Hispanic-American, Married and No Degree). As there are labels for each variable, the Variable labels checkbox should be ticked.
You are done with the General tab. The dialog box should look like the picture just above.
Let's go now to the Options tab to configure our matching method as shown on the picture below.
We will leave the options on the left side untouched. On the right side, activate the Optimal algorithm and then select the Mahalanobis distance in the drop-down box just below. Select One-to-One match in the number of matches and activate the caliper option. In the drop-down box just below, select 0.10 * sigma. We are done with the Options tab.
In the Missing data tab, we will leave the Remove the observations option activated as shown below.
In the Outputs, activate the option as shown below.
And finally, select normalized coefficients and box plot in the Charts tab as shown below.
Now, you can click on OK to start the computations.
Interpreting the results of the propensity matching
After the basic descriptive statistics of the selected variables, the first result is a table displaying the standardized coefficients with the associated chart as shown in the next figure.
Those are the coefficient of the logistic regression adjusted on the selected dataset. This is the model used to compute the propensity score associated to each participant.
The next table shown below displays a summary of the matching operation.
You can see in this table that 7% of participants of the Treatment group were not matched to any participant of the Control group. This means that there were no candidates found within the caliper radius of 0.10 * sigma. Just below the table, you can see the total cost of the matching operation. This is the sum of distances between participants of each pair. This criterion is useful to evaluate the efficiency of a particular matching operation.
Then, the propensity score and its lower and upper bounds are displayed as shown in the figure below.
Also displayed is the logit of the propensity score with its boundary values. This is the actual value used to compute the distance matrix shown just below.
In the distance matrix, participants of the treatment group are on rows and those of the control group on columns. Pairs of matched participants are indicated within the distance matrix by a bold distance value.
In the box plot shown below, you can see the effect of the matching operation on several parameters of the distribution of the propensity score within each group. Distributions are more comparable after the matching operation than they were before.
Finally, the last table displays the detailed matches between participants of each group as shown in the picture below. The distance between the two participants in term of logit of the propensity score is also given.
Using the XLSTAT statistical software, we were able to compute the propensity score associated to the participants of a study within Excel and perform a matching operation between participants based on the propensity score.
Was this article useful?