difference in difference in r

Difference in Difference in R: A Complete & Easy Tutorial

Conducting a difference in difference in R allows researchers to gain insight into the impact of a policy, or other outside factors, by taking into consideration two things:

  1. how a group mean changes before and after the policy or other outside factors was implemented. To that end, this is considered the treatment group.
  2. compare this change with the mean over time of a similar group which was not impacted by the policy or the other outside factor. Again, to that end, this is considered the control group.
difference in difference in r

The above image, from the Columbia University School of Public Health, does a great job at visualizing what a difference in difference model explores. In the graph, the red line is considered the treatment group while the green line is considered the control group.

The vertical blue line indicates the introduction of an outside influence (in this case, the terms pre-intervention and post-intervention are used). As you can see, the treatment group was impacted by the introduction of that outside influence while the control group remained steady from its pre-intervention stage.

The purpose of a difference in difference model in R, therefore, is to explore the unobserved counterfactual outcome trend for the treatment group in the above graph.

I do understand I just used a lot of technical jargon to explain that, and I am sorry. I have found that lots of websites with tutorials such as this take such a “dive into the weeds” that it is difficult to actually learn how to do the process in R.

So, my main goal here is to show you how to do a difference in difference in R without needing to have a seriously deep understanding of the theoretical concepts that form the foundation of the model.

Difference in Difference in R: My Data and the Cleaning Process

For the purposes of this tutorial, I am going to make use of a research paper I wrote back in the Fall of 2021. The purpose of the paper was to explore the impact that SoFi Stadium had on neighborhoods in the surrounding area of Inglewood, California.

My initial impetus to write the paper occurred when I read a quote from Stan Kroenke, the owner of the Los Angeles Rams, that the construction of SoFi Stadium – at a price tag of $5.5 billion – would create a “ripple effect so profound” that it would “boost the neighborhood’s subpar property values along the way.”

Kroenke makes his point while ignoring what an increase in property value can produce through such a gentrification process: marginal damage to the local education system as neighborhoods skew towards higher-income residents, the depletion in the long-term viability and supply of “low-cost housing,” and the “deepening class polarization” within the neighboring urban housing markets are among just some of the chief concerns.

Because of this, I wanted to explore the impact of the neighborhood’s property values before and after the construction of SoFi Stadium. This naturally lends itself to the use of a difference in difference model in R.

For this specific study, I turned to Zillow’s ZTRAX data. The data provided by Zillow is free to access once approved through the application process. Moreover, and important to the construction of a proper difference in difference in R, the data can be manipulated into a time-series, can be deployed to explore both at the macro and micro levels, and is overwhelming robust in both its accuracy and totality.

It is important to point out that the process of reading in the Zillow ZTRAX data proved to be extremely time and resource intensive. Because of this, the process of data compilation was moved to the Amazon Cloud to make use of the Elastic Computing system. Specifically, a RStudio Server Amazon Machine Image was virtually booted onto an Amazon virtual server with a 64-core CPU and 488GB of RAM.

Even with such computing power – far beyond any commercially available system – the process of compiling the data from the text files into a date frame within the virtual, cloud-based environment was difficult, taking over two hours to compile and then another two hours to complete the proceeding cleaning and preparation process. I have a pretty powerful computer and am no stranger to cleaning the environment in R, but this was still a struggle.

The cleaning process, in short, entailed removing any transaction with incomplete data. As well, only those transactions with a sales price greater than $1,000 were kept avoiding family-based transactions where the house was sold for nominal amounts, thus creating extreme outliers in the data. As well, only those residences within a 20-mile radius were kept limiting the scope of the study, as previous studies suggest that real estate value outside of that radius see drastically decreasing impact residuals.

Finally, because I was also interested in how a housing unit’s distance from SoFi Stadium impacted the property value, the geosphere package was used to calculate each houses’ total distance, in miles, from SoFi Stadium. To do so, the latitude and longitude of each house was compared against the latitude and longitude of SoFi Stadium and, using Vincenty’s ellipsoid formula, the distance was found:

sofi.cleaned <- sofi.cleaned %>%
  mutate(sofilat = 33.95356,
         sofilog = -118.33859)
meter2mile <- 0.000621371
sofi.cleaned[, distance := meter2mile * geosphere::distVincentyEllipsoid(
  cbind(PropertyAddressLongitude, PropertyAddressLatitude),
  cbind(sofilog, sofilat)) ]

To explore the impact of SoFi Stadium on sales prices based on proximity to the location in this difference in difference in R model, it was necessary to build several different treatment groups into the data. To do so, I partitioned the homes into different groups based upon their relative distance from the stadium:

sofi.cleaned <- sofi.cleaned %>%
  mutate(distance_ord = factor(
      distance <= 5 ~ "Short",
      distance >= 5.0001 & distance <= 10  ~ "Moderate",
      distance >= 10.0001 & distance <= 15 ~ "Long",
      distance >= 15.0001 & distance <= 20 ~ "Very Long")))

Categorizing the houses in such a fashion in effect creates a series of concentric circles with SoFi Stadium serving as the focal point. All the residences fall into one of the pre-determined groups – under five miles, between six and ten miles, between 11 and 15 miles, and between 15 and 20 miles – without overlap.

After the coding construction of the concentric circles, the number of residences was as follows: short (n = 11,000), moderate (n = 26,585), long (n = 19,426), and very long (n = 41,773). In other words, this difference in difference model in R will be utilizing nearly 100,000 data observations, resulting in robust findings in both breadth and depth.

One of the last steps in the data preparation process is one of the most important steps. While providing a “distance” variable is helpful in finding deeper meaning in the results, it is not a true example of a treatment and control setup for a difference in difference model. Remember: we are interested in the impact of SoFi Stadium before and after its construction.

Because of this, a variable needed to be created to indicate whether the sale of the property took place prior to 2016 or after 2016, as the land to build SoFi Stadium was purchased in that year:

sofi.cleaned$post16 = as.numeric(sofi.cleaned$year >= 2016)

In this case, houses sold prior to 2016 are the control group whereas houses sold in 2016, or after, are now part of the treatment group. Now that the data is cleaned, organized, and includes both the control and treatment variable, we can move to the construction of the actual model to do a difference in difference in R.

Difference in Difference in R: Construction of the Model

We are now prepared to construct a difference in difference in R model to compare the pre- vs post-announcement of the construction of SoFi Stadium to explore the impact of home values for those residences near the site based upon concentric circles with SoFi Stadium residing in the middle.

As well, the difference in difference model framework used in this example accounts for unique property characteristics, including the number of bedrooms, the total square footage, the age of the dwelling, and the total calculated bathroom.

As for the actual model used, there is no need to get fancy here: a standard interaction model is used:

price=β_0+ δ_0 post16+ β_1 ordinance+ δ_1 (ordinance ×post16)+ ε

To that end, a difference in difference in R is, at its core, is a regression model. And, given that the regression intercept is the prediction when all other coefficients are zero, it is necessary to means center these coefficient variables, as it is understood that houses do, indeed, have age, bedrooms, bathrooms, and square footage.

Please note that means centering variables is not always necessary. But, doing so is common practice when working with regression models that use real estate data.

Once the linear model is created, the mean centering can be conducted by passing a list of vectors with the variable names to be centered:

model <- lm(SalesPrice ~ distance_ord * post16 +
age + TotalBedrooms + TotalCalculatedBath + SqFt, data = sofi.cleaned)

v.center <- c("TotalBedrooms", "SqFt", "age", "TotalCalculatedBath")

meanCenter(model, centerOnlyInteractors = TRUE, centerDV = FALSE,
standardize = FALSE, terms = v.center)

The first bit of coding above is the most important part for you to look at, as it is the actual difference in difference in R model.

To explain: we are creating a regression model using SalesPrices and distance_ord and then finding the unobserved counterfactual outcome trend (as mentioned at the beginning of this article) for those houses sold post16.

The other stuff (v.center and meanCenter) are the additions needed to mean center the variables in this example. As I mentioned, it is not necessary to do this in all circumstances. But, it is vitally important when working with real estate data.

And that is it. You have officially created your first difference in difference in R model. After, you can start to interpret the results.

Difference in Difference in R: Understanding the Results

The above image presents the results of our just constructed difference in difference in R regression model.

The Intercept and initial Post 16 are based on the ‘Long’ distance variable, as – when conducting a regression model in the R programming language – the output uses alphabetical order to pick a reference point.

In this case, the regression indicates houses in the ‘Long’ distance, with average age, bedrooms, bathrooms, and square footage, sold at a mean of $378,857.91.

Houses in the ‘Short’ distance sold for, on average, $65,002.37 less, while ‘Moderate’ houses sold for $5,563.49 more, while ‘Very Long’ houses had an estimated $26,231.92 increase in sales prices average compared to the regression’s reference point.

In the post-2016 years, the difference in difference model highlights a $195,313.34 increase in mean sales prices over the intercept.

Houses in the ‘Short’ circle averaged a $33,893.27 increase, houses in the ‘Moderate’ circle averaged a $9,959.77 decrease in sales price, while houses in the ‘Very Long’ circle averaged a $54,782.33 increase.

In the above image, Table 2 the Intercept, with the additional post-2016 averages, sold for, on average, $574,171.25.

The Adjusted Treatment is based off that reference point. For instance, the ‘Short’ post-2016 average of $33,893.27 was added to the post-2016 Intercept average for a ‘Short’ post-2016 total average of $608,064.53.

After calculating the Adjusted Control and Adjusted Treatment, the Percent Difference presents the total growth between the two.

Based on the difference in differences approach, those houses in closest proximity – within a five miles radius – had the largest increase, by far, between Adjusted Control and Adjusted Treatment.

Additionally, Table 3 presents the results of the difference in difference in R model based on the rate of growth between 1993 and 2016 and then after the announcement of SoFi Stadium in 2016 to 2021. A standard rate of growth equation was used:

PercentChange = 100x(2016Mean – 1993Mean) / 1993Mean
GrowthRate = PercentChange / NumberOfYears

As seen in the table, after the announcement of SoFi Stadium, the rate of growth in average home sale prices for those in the Short distance variable was 4.72%, more than a doubled rate of growth compared to houses in all other concentric distances.

The results of this difference in difference model are certainly concerning, as it highlights that home values of residences closest to SoFi Stadium – meaning within the five-mile radius – are increasing at a higher percentage and growth rate post-construction of the facility relative to houses in further concentric circles.

This is especially concerning as that area of Inglewood has a long history of being home to minority citizens who are now being displaced as an ongoing result of the gentrification process.

Difference in Difference in R: Concluding Thoughts

The difference in difference model in R can be a very powerful tool. As highlighted in this example, the process of constructing the model is quite simple as, in the end, it is just a lm() function.

But, in order for it to be a true difference in difference approach, you have to be sure to provide a variable that places in your observations in either a control group or a treatment group.

In this case, since the outside influence, I wanted to explore was SoFi Stadium, I determined that the groups would be defined by the year that home was sold – either before and after 2016 to coincide with the construction of SoFi Stadium.

Share on facebook
Share on twitter
Share on linkedin
Share on reddit