A Better Way To Work With Zillow ZTRAX Data: A Guide To Wrangling the Data in R

For researchers and/or academics that have any interest in working with housing data, Zillow’s ZTRAX database is a must. The ZTRAX database, short for Zillow Transaction and Assessment Dataset, is unquestionably the largest real estate database that has ever been made available – free of charge – to qualified academic, nonprofit, and governmental researchers.

Previously updated on a quarterly basis, the database includes:

  • More than 400 million detailed public records across 2,750+ U.S. counties;
  • More than 20 years of deed transfers, mortgages, foreclosures, auctions, property tax delinquencies and more, for both commercial and residential properties;
  • Property characteristics, geographic information and prior valuations for approximately 150 million parcels in 3,100+ counties nationwide.


That said, the documentation provided by ZTRAX is beyond unhelpful when it comes to figuring out how to actually get the data into your RStudio environment. Further, one’s ability to quickly get into the process of locating relevant data and then analyzing it is hampered by the structure of ZTRAX data. It is intimidating, overwhelming, and – at no faults of yours – not clearly explained how it is organized.

Because of that, the point of this post is to provide a bit of a pathway to quickly understand ZTRAX and get you on your way to analyzing it as quickly as possible.

The Layout and Structure of Zillow ZTRAX Data

For each version of ZTRAX data you retrieve from Zillow’s server, you are going to see three initial files that are vitally important to understand if you wish to efficiently pull in the data you are looking for.

ztrax data

As seen in the above screenshot, the ZTRAX data is broken down into two different data “dumps.” The first is all data concerning assessment and the second relates to all transactions.

While the assessment data certainly has a time and place for its use, I think the majority of people working with ZTRAX data are interested in the transaction information.

Because of that, let’s take a deep dive into how to decipher where information is stored and how to get it into RStudio.

First, you need to take a close look at the “Main Guide” tab within the spreadsheet that layouts the transactions/event data information, as seen in the following screenshot:

ztrax data

These spreadsheets will serve as your “bible” as work with Zillow ZTRAX data. In the above screenshot, you can see that all the individual data points are broken down by “broad description” or “GROUP.” In this case, we let us look at the “Buyer Info” information within the spreadsheet. Situated in”Buyer Info,” we can explore any number of things, including:

  1. BuyerMailHouseNumber
  2. BuyerMailZip
  3. BuyerIndividualFullName

… and the list literally goes on and on and on.

However, once you have an idea of what you would like to examine, you next need to look at “Table Name” as that will tell you where specifically the data is stored and what input you to provide to access it.

For example, if you wanted to gather all the information for the “RecordingDistrict” (as provided in “Field Name”) within “Jurisdiction” (as provided in “Group”), you would need to access information housed in the table name “Region Specific.”

Once you have that information, you can see how the naming conventions carry over to the massive .txt files that ultimately store all of the information:

As you can see in the file size above, ‘RegionSpecific’ is one of the smaller .txt files in terms of the total amount of data it holds. On the other hand, the “Main” .txt file contains over 28 gigabytes of information.

And that is just for one state (in the case above, I am looking at the .txt files for just California).

Because of that, it is important that you grab data just for the states you are looking to work with. Otherwise, you will be trying to wrangle an unnecessary amount of information.

Once you are into the “backend” of ZTRAX, you first have to find the correct folder. To make things a bit annoying right out of the gate, the ZTRAX folders are not provided to you with names of specific states but with state-specific FIP codes.

Please know that some of the larger states – like California – can have a folder that is upwards of 50 gigabytes of data. In other words, either have a high-speed internet connection or be prepared to wait quite some time for it to download.

To make sure you are grabbing data for the correct state, you can see a list of state-specific FIP codes here.

Zillow ZTRAX: Getting The Data into RStudio

##LOAD REQUIREMENT LIBRARIES

library(tidyverse)
library(data.table)
library(readxl)

###SETTING DIRECTORY NAME
dir <- "D:\\ZTRAX\\Research\\SoFiStadium\\Data"  ###Switching to Correct Directory

######################
### Prototyping ###
######################
library(readxl)
###CALLING IN LAYOUT STYLESHEET
layoutZAsmt <- read_excel(file.path(dir, 'aug2020\\layout.xlsx'), sheet = 1)
layoutZTrans <- read_excel(file.path(dir, 'aug2020\\layout.xlsx'), 
                           sheet = 2,
                           col_types = c("text", "text", "numeric", "text", "text"))

First and foremost, you must be sure to be working out of the correct directory you downloaded your data (wherein the above code I set my directory to “dir”).

After that, you need to be sure to call into the “layout.xlsx” stylesheet for either assessment or transaction data.

More often than not, you are likely to be working with the transactions data. But, since it is so easy to bring in both stylesheets as they are housed within the same file (just within different tabs at the bottom), it is worth doing it from the beginning so you may see what each version of the data has and what might be useful to you.

As you can see in the provided code, you are using the ‘read_excel’ function within the readxl package to gather the data, being sure to indicate that the assessment data is housed in “page 1” and transaction data is housed on “page 2.”

As well, I think it is clear in the above example of coding that we also manually provide the column types for the information (numeric, text, etc.).

Once you pull the layout information in, the fun part starts. You can now retrieve the physical data housed in some of those massive .txt files. The code below is the beginning process.

######################
### utMainTrans.txt ###
######################

###CREATING DATA.TABLE OF utMain.TXT Stylesheet
col_namesMainTras.new <- layoutZAsmt[layoutZAsmt$TableName == 'utTaxDistrict', 'FieldName']

###PIVOTING STYLESHEETS TO WIDE FORMAT
col_namesMainTras.new <- col_namesMainTras.new %>%
  pivot_wider(names_from = FieldName, values_from = FieldName)

sofi.aug2020.transactions <- fread(file.path(dir, "\\aug2020\\ZTrans\\Main.txt"),
                      select = c(1,18,25),
                      sep = '|',
                      header = FALSE,
                      stringsAsFactors = FALSE,       
                      quote = "")

sofi.aug2020.transactions <- sofi.aug2020.transactions %>%
  rename(
    TransID = V1,
    DocumentDate = V18,
    SalesPrice = V25
  )

###REMOVING ALL ROWS WITH NA INFORMATION
sofi.aug2020.transactions <- na.omit(sofi.aug2020.transactions)

At this point, it is important to remember that you are working with very large data. Because of that, it is important that you be thoughtful with what you are pulling in. To avoid catching your computer on fire, or needing to turn to spend money on an AWS instance, it is vital that you make use of the “select” option within the ‘fread’ function.

As well, you may need to regularly clean the environment in R.

To start, you create a new dataframe in your RStudio environment (titled, in this case, col_namesMainTras.new). After that, you define the “TableName” to pull the information from, and that you want the “FieldName” from that “TableName.”

As the data is housed in “long” format, we need to do a simple pivot to “wide” format. And, after, you can use the select function within the “fread” package to specify the exact columns you want to grab information for.

Once the information is compiled onto your local machine, the next bit of coding is simply supplying detailed names to the columns and, afterward, dropping any ‘TransID’ that is missing any of the sought information.

You can complete the above for any of the associated .TXT files in the ZTRAX database. Once you complete all of the necessary work, merging them together into one complete file is simple:

######################
### MERGING ###
######################

sofi.data <- sofi.data %>%
  left_join(sofi.buildingarea, by = c("RowID" = "RowID"))

###REMOVING ALL ROWS WITH NA INFORMATION##
sofi.data <- na.omit(sofi.data)

As you can see in the above code, it is vitally important that you main the ‘RowID’ variable for each observation as it is the primary way in which you match up all the data across the dozens of .TXT files you have at your disposal within the ZTRAX database.

Lastly, if you have an interest in geocoding the information, you can run the code below. In this case, I wanted to measure the distance of each house – in miles – from SoFi Stadium in Inglewood, California:

######################
### GEO-CODING ###
######################

###CREATING VARIABLES FOR PEAKS ICE ARENA LAT AND LONG
sofi.data <- sofi.data %>%
  mutate(sofilat = 33.953581,
         sofilog = -118.339185)

###CREATING VARIABLE FOR DISTANCE FROM PROPERTY TO PEAKS ICE ARENA
meter2mile <- 0.000621371
sofi.data[, distance := meter2mile * geosphere::distVincentyEllipsoid(
  cbind(Longitude, Latitude),
  cbind(sofilog, sofilat)) ]
sofi.data

In the first chunk of coding, I am setting the coordinates for SoFi Stadium. In the second chunk, I am running those coordinates against the coordinates of every house in our merged database to calculate the distance. As well, the ‘meter2mile’ vector allows use to transform the distance, outputted in meters by default, into miles.

Working with Zillow ZTRAX in RStudio: Final Thoughts

While I could go on and on regarding all kinds of specific things you can do with the data, this is neither the time nor place. My main goal here was to provide a more in-depth look at how to retrieve the Zillow ZTRAX data than anywhere else I looked when I was trying to figure it out.

Hopefully this guide to Zillow ZTRAX data was helpful. That said, if you still have questions or you are running into issues, do not be afraid to reach out. Feel free to use the ‘Contact’ form at the top of my website, or to reach out to me on Twitter (@BradCongelio).

Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on reddit
Reddit