Thanks to the work of a handful of people (@mrcaseb, @benbbaldwin, @_TanHo, @LeeSharpeNFL, and @thomas_mock … to name a few), getting started with advanced analytics using NFL data is now easier than ever.
Without getting too far into the weeds of the history behind all this, the above-mentioned people are responsible for the creation of the ‘nflverse,’ which is a superb collection of data and R-based packages that allows literally anybody the ability to access deeply rich NFL data dating back as far as 1999.
The ‘nflverse’ was ultimately birthed from Ron Yurko’s ‘nflscrapR’ project. At the time of developing ‘nflscrapR,’ Ron was a PhD student in the Statistics and Data Science program at Carnegie Mellon University. While I am not savvy to the entire backstory, I believe ‘nflscrapR’ was more of a hobby for Ron and, as interest in its functions took off, he was more than happy to see the aforementioned Ben Baldwin and Sebastian Carl take over with their creation of ‘nflfastR.’ As ‘nflfastR’ become more robust and feature-rich, Ron formally announced the “end” of ‘nflscrapR’ on September 14, 2020:
As of the writing of this introductory tutorial, the ‘nflfastR’ project has grown by leaps and bounds. The ‘nflfastR’ project itself has merged into the above-mentioned ‘nflverse’ which includes the core packages required to do introductory NFL analytics all the way to advanced modeling and computer learning projects. Housed within the ‘nflverse’ are:
- the core set of functions to efficiently scrape NFL play-by-play data back to the 1999 season
- a “low-level” package for downloading data from all the repositories hosted within the ‘nflverse’
- this package host all of the necessary code that power Ben Baldwin’s infamous (?) 4th down bot on Twitter
- created by Lee Sharpe, this package allows users to simulate thousands of NFL seasons via user-created models
- contains significant amounts of data, some of which helps power the end result of the ‘nflverse’
For the purposes of this introductory tutorial, we are going to be focusing on ‘nflreadr’ to grab NFL play-by-play data from the 2020 season. Afterward, we will explore all of the various metrics cooked into the data. And, finally, we will explore several ways to manipulate the data to start finding results and observations that are worth doing data visualization with (which will be the next topic in this ongoing tutorial).
Before diving into this part of the tutorial, I need to forewarn you that I am making the assumption that you – the reader – are somewhat familiar with R and RStudio. In this case, R is the underlying software that performs the actual instructions. It is the “workhorse,” if you will, of all the coding you perform.
RStudio, on the other hand, is the Integrated Development Environment (IDE). In other words, it is the “graphical interface” that provides a more clear picture of what you are doing with your R code.
Oscar Torres-Reyna, from the Data and Statistical Services department at Princeton University, put together a fantastic .pdf file that illustrates all the different components of RStudio. While it is a bit dated (it was created in 2013), it is still an outstanding primer on knowing how R and RStudio work together and what the different panels are within the IDE.
As well, I am assuming you are familiar with the ‘tidyverse,’ which is a collection of packages that share an “underlying design philosophy, grammar, and date structure.” When manipulating NFL data, packages within the ‘tidyverse’ such as ‘dplyr’ and ‘purrr’ are often used.
Moreover, I need to clarify that I still use the piping function built into the ‘magrittr’ package rather than the new base R piping that was introduced in version 4.1 of R. So, for example, the following two bits of code do the exact same thing:
##magrittr piping magrittr.piping <- data %>% filter(play_type == "pass") ##Base R 4.1 piping baser.piping <- data |> filter(play_type == "pass")
As I said, I still use the magrittr piping in all of my code because, frankly, I do not like change.
However, if you look at code elsewhere for examples and/or help during your NFL analytics learning process, don’t be confused if you see the new base R piping being used.
In the end, they do the same thing.
With all that sorted out, let’s jump into collecting NFL data using ‘nflreadr.’
NFL Analytics: Collecting Data with ‘nflreadr’
Before you dive into pulling NFL play-by-play data into your RStudio environment, you need to make sure you have all the required packages installed.
In this case, we just need two to get started:
Doing so within RStudio is easy by utilizing the install.packages() function and then loading the packages for use with the library() function, as seen below:
install.packages("tidyverse") install.packages("nflreadr") library(tidyverse) library(nflreadr)
With both packages installed and loaded, you can jump into pulling NFL play-by-play data. For this working example, let’s grab all data from the 2020 NFL season. Again, I will highlight two examples of the code to highlight different ways of “structuring” it for personal preference while achieving the same goal.
data.2020 <- load_pbp(2020) data.2020 <- nflreadr::load_pbp(2020)
As you can see, in the first example of code, I call load_pbp() directly without indicating the package like I did in the second example of code, where I did nflreadr::load_pbp(2020).
There is typically no need to call the specific package prior to the function (with the caveat that sometimes packages as dplyr require it to avoid common function names with other packages). As I mentioned, it is a point of personal preference.
One of the great things about the nflverse, however, is the ability to quickly load all play-by-play data all the way back to the 1999 season. Instead of retrieving data for just one season, like in the above example, it is simply to retrieve all the day using the same load_pbp function call, like so:
data <- load_pbp(2000:2020)
In this case, we have retrieved all play-by-play data going back to the 2000 season and ending with data from the 2020 season. The end result is 1,001,722 observations over 372 different variables.
Obviously, 372 variables are a ton of information for anybody to wrap their head around. And, in the case of the data pulled using nflreadr, not all the data is information you are going to regularly use.
Because of this, it is important that you know what is included and what everything means. Luckily, the nflfastR team has put together a table on their website that goes through each variable and explains exactly what each one is. You can find the nflfastR data field descriptions here.
NFL Analytics: Your First Exploration of the Data
Now that you have the ability to pull the data into your RStudio environment, you can now start manipulating the data and finding answers for all those burning NFL analytics questions you have.
For the sake of this tutorial, let’s start with exploring air yards from the 2020 season.
If you are not familiar with advanced NFL statistics, air yards is the distance the quarterbacks passed the ball from the line of scrimmage to where the receiver either caught or did not catch, the football. Importantly, air yards do not take into account any yards after catch meaning it is a fantastic way to start quantifying a quarterback’s true impact on a team’s passing offense.
To start, let’s seek to answer this question: which quarterback, in the 2020 season, had the highest average air yards?
Or, in other words, which quarterback during the 2020 season – on average – threw the longest passes without considering any yards after catch by the wide receiver?
The process of answering this question allows us to explore some of the unique processes behind working with advanced NFL data. To start, let’s go with the bare minimum code required to pull the average air yards for the 2020 season:
data <- load_pbp(2020) air_yards <- data %>% group_by(passer) %>% summarize(avg.ay = mean(air_yards))
First, we are creating a new dataframe titled air_yards from the original data dataframe that was created when retrieving the play-by-play data from the 2020 season.
In essence, the “<-” in the code is saying “is.”
In this case, we are essentially saying “air_yards is data” and then using the piping function (“%>%”) to pass further arguments into the newly created air_yards dataframe.
Afterwards, we use the group_by function to group quarterbacks together by name, and then use the summarize function to create a new variable titled avg.ay that is the mean of air_yards within data for each quarterback.
The end result will look like this:
> air_yards # A tibble: 120 x 2 passer avg.ay <chr> <dbl> 1 A.Dalton NA 2 A.Erickson 21 3 A.Humphries NA 4 A.Lee 14 5 A.McCarron NA 6 A.Rodgers NA 7 A.Smith NA 8 B.Allen NA 9 B.Colquitt NA 10 B.DiNucci NA # ... with 110 more rows
There are obviously some significant issues with our data so far. First, we are getting “NA” returned for tons of players. And, more importantly, not all of the players returned are actually quarterbacks. We are seeing punters, running backs, and wide receivers included in the list.
That is because the data does not care what position the player actually is if a pass was attempted by that player. The “NA” issue is caused by trying to create an average when there is data missing within the statistics.
To correct these issues, let’s tackle them one at a time. First, let’s correct the “NA” issue as it is easy to correct.
In the vast majority of cases like this, you can correct the issue by simply instructing RStudio to remove any rows that are missing data. To do so, you simply include na.rm = T into your summarize line, as follows:
air_yards <- data %>% group_by(passer) %>% summarize(avg.ay = mean(air_yards, na.rm = T))
As you can see, we have now included na.rm = T within our request to average air_yards for all grouped passers in the data. The end result is much better looking:
# A tibble: 120 x 2 passer avg.ay <chr> <dbl> 1 A.Dalton 6.84 2 A.Erickson 21 3 A.Humphries NaN 4 A.Lee 14 5 A.McCarron 18 6 A.Rodgers 7.91 7 A.Smith 5.10 8 B.Allen 6.62 9 B.Colquitt NaN 10 B.DiNucci 8.31 # ... with 110 more rows
You can see there are still issues with the math taking place, as we have now switched from various NA values to NaN. The difference between the two is NA represents missing data, whereas NaN represents an “impossible value.” In this case, that is likely dividing by zero as a look at the data shows the NaN values are associated with non-quarterbacks.
However, rather than working out how to get rid of NaN values, we can focus on making sure the list includes just quarterbacks because doing so will inherently remove any passer (ie., non-QB).
To manipulate the data at this point to include just quarterbacks, we have one of two options:
- We can include total pass attempts into summarize and then filter by a self-determined minimum amoun of passes.
- Or we can pull in roster information from the fast_scraper_roster function, left_join the two dataframes together, and then filter for just the QB position.
For the sake of transparency, let’s look at how to do both. Let’s first start with including total pass attempts and then filtering out those players with a minimal number of attempts (as they likely are not quarterbacks):
air_yards <- data %>% group_by(passer) %>% summarize(avg.ay = mean(air_yards, na.rm = T), attempts = sum(pass_attempt))
NFL Analytics: Filtering Data by Minimum Number of Attempts
What we have done here is summarize a new column into our air_yards dataframe that includes the total number of pass attempts for each player during the 2020 season.
It is important to note that the output is not going to represent the actual number of attempts from a website such as Pro Football Reference because, in this case, we are not being careful to filter out plays with penalties, etc. The data now looks like this:
# A tibble: 120 x 3 passer avg.ay attempts <chr> <dbl> <dbl> 1 A.Dalton 6.84 357 2 A.Erickson 21 1 3 A.Humphries NaN 1 4 A.Lee 14 1 5 A.McCarron 18 2 6 A.Rodgers 7.91 NA 7 A.Smith 5.10 NA 8 B.Allen 6.62 149 9 B.Colquitt NaN 1 10 B.DiNucci 8.31 49
As you can see, Andy Dalton is listed as having 357 attempts in the 2020 season. Which is incorrect. According to Pro Football Reference, Dalton attempted 333 passes in 2020. However, since we are simply using attempts as a “quick and dirty” filtering process, we are not too concerned with pinpoint accuracy here.
To that end, if we were concerned with pinpoint accuracy, we’d be using the load_player_stats function that is typically 100% correct in situations such as this.
After including pass attempts into the dataframe, we can see that a good cutoff point for this exploration is a minimum of 100 pass attempts. Now that we have that number, we simply filter out the dataframe based on that minimum number and, for practical purposes, we will arrange the avg.ay in descending order:
air_yards <- data %>% group_by(passer) %>% summarize(avg.ay = mean(air_yards, na.rm = T), attempts = sum(pass_attempt)) %>% filter(attempts >= 100)
The end result is a dataframe that includes 30 quarterbacks and allows us to answer our initial question for this tutorial: which quarterback, in the 2020 season, had the highest average air yards?
# A tibble: 30 x 3 passer avg.ay attempts <chr> <dbl> <dbl> 1 J.Flacco 10.6 141 2 J.Hurts 9.20 161 3 D.Watson 8.92 595 4 R.Wilson 8.91 638 5 C.Wentz 8.79 492 6 L.Jackson 8.63 458 7 J.Burrow 8.58 433 8 M.Ryan 8.56 669 9 R.Tannehill 8.31 534 10 B.Mayfield 8.25 580 # ... with 20 more rows
Believe it or not, among quarterbacks with a minimum of 100 pass attempts in the 2020 season, Joe Flacco had the highest average air yards at 10.6 per attempt.
NFL Analytics: Getting Just QBs via Roster Information
As mentioned, there are two ways to go about “cleaning” the data to include just quarterbacks. In the above method, we used total pass attempts to drop any players who were obviously not quarterbacks.
Now, we will go a different route – getting to the same outcome – but by merging two dataframes together.
To get started, we need to load in the roster information for the 2020 season using the following code:
roster <- nflfastR::fast_scraper_roster(2020)
The output includes information for every player that was on an NFL roster during the 2020 season. By using player_ids, we are to join the two dataframes we’ve created together. However, prior to doing so, we need to use the decode_players_ids() function within nflfastR:
air_yards <- data %>% group_by(passer_player_id) %>% summarize(avg.ay = mean(air_yards, na.rm = T)) %>% decode_player_ids()
As well, you can see in the above code, we are now grouping the data by passer_player_id as that is what we will be using the join air_yards to our newly created roster dataframe.
To join, we use the left_join function and instruct RStudio to match the passer_player_id in the air_yards dataframe to the equivalent in the roster dataframe which is the column titled gsis_id, and then filter the data to include just those players with “QB” listed as their position:
air_yards <- air_yards %>% left_join(roster, by = c("passer_player_id" = "gsis_id")) %>% filter(position == "QB")
Afterward, you have a dataframe that includes just quarterbacks, regardless of the total number of pass attempts during the 2020 season.
Getting Started with NFL Analytics: Concluding Thoughts
Once you learn the basic process of gathering data and summarizing it down to the material you want, the process of doing NFL analytics with R and RStudio is easy to grasp.
Because of the breadth and depth of the data provided by nflreadr, there are endless possibilities that you can explore.
In the above example, we explore average air yards for quarterbacks in the 2020 season. However, want if you wanted to explore air yards for just completed passes? It is as simple as making a few adjustments to the code we’ve been using above:
air_yards <- data %>% group_by(passer) %>% filter(complete_pass == 1) %>% summarize(avg.ay = mean(air_yards, na.rm = T), completions = n())
We have now included an additional filter for complete_pass where the numerical “1” indicates that the play was, indeed, a complete pass. As well, we created a new column called “completions” and again arranged the information in descending order. The output is different than our original study:
> air_yards # A tibble: 37 x 3 passer avg.ay completions <chr> <dbl> <int> 1 D.Watson 7.45 382 2 M.Ryan 7.22 407 3 R.Tannehill 7.16 333 4 T.Brady 7.13 482 5 D.Prescott 6.93 151 6 M.Stafford 6.91 339 7 B.Mayfield 6.90 349 8 J.Allen 6.75 473 9 R.Fitzpatrick 6.75 183 10 K.Cousins 6.67 349 # ... with 27 more rows
You can see Flacco is no longer even in the top 10 and has been replaced by Deshaun Watson as the 2020 leader in average air yards if we include just those pass attempts that were completed.
As you can see, the world of NFL analytics can be your personal playground once you understand the basics of R and RStudio. That said, there is a lot more to learn.
As I continue to create more long-form tutorials on using nflreadr to do NFL analytics with R and RStudio, I will continue to link them below.
That said, please feel free to reach out to me if you have any questions or need assistance with any part of getting started on your journey with R and analytics.
More NFL Analytics Tutorials Using R and RStudio
- A Beginner’s Guide to NFL Analytics: Getting Started with nflfastR and RStudio
- A Beginner’s Guide to NFL Analytics: Take NFL Data and Visualizing It Using ggplot2
- Computing Player Performance Percentiles Using Scraped Data