scraping with rvest

Web Scraping with rvest: Exploring Sport Industry Jobs

Web scraping with rvest is easy and, surprisingly, comes in handy in situations that you may not have thought of.

For example, one of the unique things about academics is the constant need to stay “ahead of the curve,” meaning being nimble enough as a program to shift curriculum around to provide students training and education in those areas that are in demand within the industry.

In my specific case, I teach sport analytics and data science within my department. And it is no secret that sport analytics is an increasingly popular – and growing – field in need of properly trained graduates.

That is why I have both an R Tutorial for Beginners and an Introduction To NFL Analytics series on my website. It is clear they are growing fields of need.

But, what happens when things are not so clear and obvious?

Case in point: I am currently serving on a “task force” within my academic department that is charged with redefining the “future of our program.” In other words: creating a five-year plan to align our program to fits the needs of the industry our graduating students go into.

We want our students to have the necessary skills and education required to be as competitive on the job market as possible. And, as if the job market was not tough enough, the COVID-19 pandemic has only made it more difficult for graduates.

Because of this, we – as a committee – have devised multiple ways to “survey” the industry to see where it is heading in terms of popular and in-demand jobs.

A quick way to get a “broad” view of this is by simply scraping online job posting sites, such as TeamWork Online or the NCAA Job Market, using the rvest package in RStudio.

Fortunately, the rvest package makes scraping these websites relatively easy. And, typically, the code used is easy to edit – meaning you usually just have to change the URL structure and the sequencing around to make it work between sites.

Let’s take a look at how to scrape, for example, the NCAA Job Market website.

Scraping with rvest: Setting Up a Data.Frame

The first step in the web scraping process is setting up a data frame where all the information will be stored.

It does take a little bit of forward-thinking in order to do this correctly.

For example, let’s take a look at the NCAA Job Market website:

When first examining the website you are scraping with rvest, you need to consider the exact information you want to collect.

What would be beneficial? What would provide insightful data?

In this case, I see three things I want to scrape:

  1. the title of the job itself
  2. the institution that is hiring
  3. and the location of the job

Because of that, I need to create a dataframe that includes those three variables. Doing so in RStudio is simple:

listings <- data.frame(title=character(),
                       school=character(), 
                       location=character(), 
                       stringsAsFactors=FALSE) 

Once you create your dataframe, it is time to start constructing the script that will actually pull the information off the site. The first step is developing the sequencing and ensuring that you provide the correct URL structure.

Scraping with rvest: Sequencing and URL Structure

If you were to visit the NCAA Job Market during the time I wrote this post, you will see that there are currently seven pages of jobs with 25 jobs posted per page.

You, of course, want to grab all that information beyond just the first page. In order to do this, you have to instruct rvest on how the URLs are structured on the website.

If you click ahead to page 2 of the NCAA Job Market, you will see in your browser that the URL is structured as such:

https://ncaamarket.ncaa.org/jobs/?page=2

With that in mind, the code starts like this:

for (i in 1:7) {
  url_ds <- paste0("https://ncaamarket.ncaa.org/jobs/?page=", i)

Basically, you are instructing rvest to continuously “paste and go” the URL structure but running through the numbers 1 – 7 after page=.

As it does so, it pulls the information off of all seven pages.

Figuring out the URL structure is, honestly, the trickiest part of scraping with rvest.

It takes a little trial and error sometimes to figure the correct sequencing out. But, once you do it enough times, the process of piecing together the puzzle becomes easier.

Once you get this part sorted out, you can move on to pulling the information for all the variables we listed above (again: title, school, location).

Scraping with rvest: Pulling the Variables

At this point, the last thing you need to do is instruct rvest where exactly the information you are looking for is located on the site.

To better understand this, let’s look at the code:

#job title
  title <-  read_html(url_ds) %>% 
    html_nodes('#jobURL') %>%
    html_text() %>%
    str_extract("(\\w+.+)+")

  #school
  school <- read_html(url_ds)%>% 
    html_nodes('.bti-ui-job-result-detail-employer') %>%
    html_text() %>%
    str_extract("(\\w+).+") 
    #location
  location <- read_html(url_ds) %>% 
    html_nodes('.bti-ui-job-result-detail-location') %>%
    html_text() %>%
    str_extract("(\\w+).+") 

As you can see, you are using rvest to read the HTML of the URL you provided.

The most important part here, though, is the html_nodes section.

It is here that you tell rvest where to look for the information.

To get this yourself, you first need to install the Chrome widget called SelectorGadget.

Once you do that, visit the website, turn on SelectorGadget, and click on the information you want to scrape with rvest. In the SelectorGadget toolbar, it will tell you the html_node that you clicked on.

In this case, SelectorGadget is telling me that the title of the job is nested within the DIV tag called ‘#jobURL.’

I take that information and simply insert it into the html_nodes section of the code.

And then do the same for school and location.

Once you do all of that, the last thing you need to do is make sure you use the rbind() function to all the data within the dataframe. After doing so, the complete code looks like this:

listings <- data.frame(title=character(),
                       school=character(), 
                       location=character(), 
                       stringsAsFactors=FALSE) 
for (i in 1:7) {
  url_ds <- paste0("https://ncaamarket.ncaa.org/jobs/?page=", i)
  #job title
  title <-  read_html(url_ds) %>% 
    html_nodes('#jobURL') %>%
    html_text() %>%
    str_extract("(\\w+.+)+")
  #school
  school <- read_html(url_ds)%>% 
    html_nodes('.bti-ui-job-result-detail-employer') %>%
    html_text() %>%
    str_extract("(\\w+).+") 
  
  #location
  location <- read_html(url_ds) %>% 
    html_nodes('.bti-ui-job-result-detail-location') %>%
    html_text() %>%
    str_extract("(\\w+).+") 
  
  listings <- rbind(listings, as.data.frame(cbind(title,
                                                  school,
                                                  location)))
}     

Lastly, if you want to visualize the information, a word cloud is a good place to start the exploration process:

wordcloud(paste(listings$title), type="text", 
          lang="english", excludeWords = c("experience","will","work"),
          textStemming = FALSE,  colorPalette="Paired",
          max.words=5000)

Scraping with rvest: Conclusion

As mentioned, the process of web scraping with rvest is not overly difficult.

Once you figure out the URL structure, the rest kind of falls into place. Of course, the use of SelectorGadget makes it even easier since you do not have to manually dig through the HTML to find where the information you want to scrape is nested.

As for the information gathered from the NCAA Job Market, it should not be surprising that coaching is an in-demand job. Specifically, it looks like Assistant Women’s Coaches are a good in-demand position.

Doing the above process on other sites, such as Indeed or TeamWork Online, yielded vastly different results, however.

You have to keep in mind there is limited amounts of data on the NCAA Job Market – just under 200 jobs at the time of writing this post.

On the other hand, a search for “sports” on Indeed returns over a thousand results. TeamWork Online has nearly 700 jobs posted.

So, as you can imagine, the word clouds and the information you can produce from scraping those sites with rvest are a bit broader than the NCAA Job Market.

All said, though, the process of web scraping with rvest can quickly lead to some broad, overarching results that can help guide you to a more nuanced discussion.

Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on reddit
Reddit