This first chapter in my R Tutorial for Beginners takes the first look at the R programming language, and the associated RStudio software. To quickly summarize for absolute beginners, R is an incredibly flexible and easy-to-learn programming environment that is now widely used in both professional and academic settings.
In fact, R was the first programming language I learned and it is still the one that I use most often. That is not to say that Python, Julia, and other programming languages are “bad.” Because they are not. It is likely that, in any data science role, you will find R, Python, and Julia being used in concert with one another.
However, because it is impractical to expect you to learn all three at one time, and because I love the R programming language, we focus solely on R and RStudio on this blog.
Because this is an introductory piece in my comprehensive R tutorial for beginners, be sure to check out the ever-growing list of continuous tutorials at the bottom of this “lesson.” As well, if there is ever a reason to jump ahead to look at a future lesson, I will be sure to link directly to it.
I have found that, often, when beginners are learning the R programming language, that being able to put into context the bigger picture is helpful. For example, I eventually discuss the structure of your dataframes and whether they should be long or wide format.
That might not make sense in the specific context of that R for beginners tutorial. But, if I give you a “sneak peek” ahead to show you why data in long format is vitally important for visualization purposes, it may help you wrap your head around things.
That said: for this introductory piece into the R programming language, we are going to cover several highly important things that are necessary for you to know before we move on to more “exciting” stuff in later tutorials.
Installing R and RStudio
In order to start your journey with the R programming language, you will need to download two different things: R and RStudio.
R is the actual language that is installed onto your computer, while RStudio is the graphical interface that makes the learning process much easier. For you “techies” out there, RStudio is the IDE that operates R.
To download R, head one over to R-Project.org. Once there, you can select the option – in the left toolbar – to download R from CRAN. Once there, there are dozens of mirrors to pick from. Simply pick a mirror in your country – or the nearest one to you – to begin the download.
Once complete, simply open up the installer and run through the process. No need to do any sort of manual or advanced installation process.
After, you can head to RStudio.com. Once there, select Product from the top menu and select RStudio. When there, scroll down just a tad and select RStudio Desktop and then the Open Source Edition.
It is important to note that there is a paid version of RStudio, but there is no need to worry about that. The open-source edition is absolutely free and is perfectly fine for, like, 99% of all people who get into data science and the R programming language.
Once you run through the installation process for RStudio, you are prepared to officially start your R programming journey.
An Introuction to RStudio
At this point, you are welcome to open RStudio for the first time.
Once you have RStudio open for the first time, you are going to see four different windows (or panes … they mean the same thing):
- The source window/code editor
- The console window
- The environment/history window
- The files/plots/packages/help window
Before we go any further, it is important to discuss what each of these windows are for and how you will be using them as we go forward in this lesson.
The source window in RStudio is where you will write and edit all of your code. If you were to download a chunk of R programming code from somewhere, it would download in a file with a .R extension and, when selected, would open in the source window. Each and every time that you open RStudio, the source window will provide you a blank slate to work with.
All that said, you will be spending a lot of time staring at the source window.
The console window in RStudio is at the bottom left. It is arguably the second most important window in the program (next to the source window) as it is where R ultimately “runs” your code and outputs the information. In other words, anything you type into the source window, and then run, the results will show in the console.
Importantly, the output in the console window will also include any errors that occur when running your code, allowing you to correct problems as you go along.
The environment/history window in RStudio list all of your data objects (such as vectors, matrices, and dataframes). To that end, it will also show you the number of observations and the total number of rows in all data objects.
Lastly, the files/plots/packages/help window is located at the bottom right and serves a multitude of purposes. The files panel allows you to access files within directories on your hard drive. The plots panel will be where your visualizations output as you create them. The packages panel will list all of the packages currently installed on your computer and will also indicate whether or not they are currently loaded into your RStudio environment. The help panel is used when you hit the “F1” key when typing a function in the console window.
While that may seem like a bit of a whirlwind, I promise that it will make more sense as you progress through my R Tutorial for Beginners series. In fact, at the end of this tutorial for beginners, we will practice inputting some simple code into the source window so you can see how it spits out the results into the console window.
Installing Your First Package: tidyverse
Now it is time for a little bit of fun … installing your first package in RStudio.
First, packages are collections of R coding functions, data, and/or other types of compiled code that are structured in a well-defined format, with the intent to provide specific functionality as soon as you install them.
For example, when I am doing visualizations, I make use of the scales package to make the process of dealing with scales on the x- and y-axis much easier.
Right now, I am going to have you install the tidyverse package onto your computer and into RStudio.
Why the tidyverse? Because it is an absolutely essential collection of packages for data science. It includes:
This is not the time nor the place in this R Tutorial for Beginners to go through each one and explain what they do. If you continue on through this series, I promise we will interact with each and every one.
The process of installing packages is actually very simple in RStudio. To install, you want to use the install.packages function with the name of the package inside quotation marks.
However, once you install the package, it isn’t quite ready for use. You also have to load it, which requires the use of the library function but, this time, without the name of the package within quotation marks.
The complete process is as follows below. To “run” the code, you can either place your cursor at the end of a line and then use ctl-enter, or highlight everything and then click “Run” on the top of the source window in RStudio.
And just like that, you have installed your first package (and actually done your first bit of “coding”).
As well, take a look at your console screen. You will notice that, after running the library function from above, that there is a lot of information in there regarding the installation and loading of the tidyverse package.
Now, let’s move on to a bit of “real” coding to end this first article in my R Tutorial for Beginners.
Introduction to Coding in RStudio
It is at this point that lots of R Tutorials for Beginners have you coding basic arithmetic in the RStudio source window.
That’s super boring.
Instead, we are going to explore an actual dataframe so you can see what coding in R is capable of and to hopefully excite you to keep working on it.
To do so, we are going to use the nycflights13 package. It is a dataframe package that contains information about all flights that departed from New York City to destinations in the United States, Puerto Rico, and the American Virgin Islands in all of 2013.
That is equal to 336,776 flights.
To get started, let’s install and load the package using the same method we used for the tidyverse.
Now that it is installed and ready to go, simply type flights into your source window and run it. Below, in your console, should be the following output:
> flights # A tibble: 336,776 x 19 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN 183 5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL 116 6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD 150 7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR FLL 158 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA IAD 53 9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK MCO 140 10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA ORD 138 # ... with 336,766 more rows, and 4 more variables: distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> >
What you just created is called a tibble, a “sneak peek” if you will, at the data.
Across the top are all of the different variables you can use to explore the date, including:
- depature time
- scheduled departure time
- depature delay
- arrival time
- scheduled arrival time
- arrival delay
- flight number
- tail number
- total air time
As you can imagine, there is a lot a person can explore across that many variables.
So, let’s imagine you want to view just those flights that took to the air on January 1, 2013.
To effectively “pull” those out of the data, we are going to use the filter function, as follows:
filter(flights, month == 1, day == 1)
Afterward, another tibbble will output into your console that shows each and every flight that took to that air during that specific month and day.
In the above code, we called the filter function, provided the database name, followed by the comparisons and logical operators. The filter function is part of the dplyr package, which is housed within tidyverse and there are several ways to go about doing comparisons and/or logical operators. For example:
|dplyr package |
|For “less than”||<|
|For “greater than”||>|
|For “less than or |
|For “greater than or |
|For “does not equal”||=!=|
|For “and”||& or ,|
In the above example, based on dplyr logical operators, you could have used “&” or “,” to separate the new specific things you were filtering for.
How about filtering by the destination of the flights? Since Texas has two major international airports in Houston, let’s do a search that filters for just those two locations (IAH and HOU). To do so, I am going to introduce a new item (%in%) as seen below:
filter(flights, dest %in% c("IAH", "HOU")
The output in your console window will now show all the information for just those flights that landed at either IAH or HOU.
In the above code, the “%in%” piece of the code is essentially saying “includes.” And, when combing multiple observations as we did, you must do c() with the information inside the parenthesis.
Now, how about a quick little visualization to keep you interested in continuing with this R Tutorial for Beginners?
To get started on it, let’s put together the data:
by_dest <- group_by(flights, dest) delay <- summarize(by_dest, count = n(), dist = mean(air_time, na.rm = T), delay = mean(arr_delay, na.rm = T))
I am not going to get “too into the weeds” here, as we will be going over this stuff in later segments of this R Tutorial for Beginners.
But, in short, several things are happening here:
- we are creating a new dataframe called by_dest that is grouping flights together by similar destinations.
- we are creating a new dataframe called delay that is summarize the total of each destination, and then finding the average air_time for each and the average delay for each.
Once we have that put together, we can use the ggplot2 function to visualize the data:
ggplot(data = delay, aes(x = dist, y = delay)) + geom_point(aes(size = count), alpha = 1/3) + geom_smooth(se = FALSE)
And just like that, you have created your first visualization in RStudio!
I think it is safe to say that, if you are a beginner to R and RStudio, that you took a pretty big leap in knowledge in this tutorial.
We went from installing the software to quickly plotting out your first visualization.
However, there is a lot more to learn and I hope you will hang around in this R Tutorial for Beginners. As I continue to write this series, the step-by-step tutorials will be linked right below.