In a previous article on linear regression, we went through an example of using it to predict an outcome based on input and output variables. In this article, we’re going to use a linear regression prediction to do the same thing but for a more complicated data set by using a dataset detailing the crime rate in New York City between 2003 and 2012 to make a machine-learning-based prediction.

In order to use RStudio to create a multiple linear regression prediction, first open RStudio. If you are new to R (or RStudio), feel free to check out my An R Tutorial for Beginners: First Steps blog post.

Once you have opened RStudio, click on “File” and choose “Open…”. Navigate to the folder where your data is saved and open it. You can now begin working with your data as needed.

## Linear Regression Prediction: Splitting The Data

First, let’s begin by splitting our dataset into two parts: one part will be used for training, and the other will be used for testing.

Running a multiple linear regression prediction requires us to split our data into these two sets so that we can see how well we were able to predict the results of what would happen in the future (for example, the crime rate in 2012).

We want our testing set to be similar to our training set so that its results are pretty accurate. A good rule of thumb is to take about 70% of your dataset and use it for training and the other 30% for testing.

Our dataset contains a predictor variable called “Population” and a response variable called “Crime”. We will be using this data to predict crime rates in 2012 based on population sizes in 2003.

So, let’s load the data into RStudio by clicking on “Data” and selecting “Import Dataset…” Select your .csv file with the mouse and click “Open”. Note: If you downloaded the data from your browser, make sure that it is a .csv file or else it won’t work.

Also, check to ensure that any text files are not encoding as anything special (like UTF-16) because most likely they won’t work either.

In our case, we’re going to split our data into training and testing sets by using the R command “train_test_split” from the caret package.

We’re going to need three pieces of information: our data frame, a proportion of how much we want for each set, and a name for each set.

The proportion is going to go between 0.0-1.0 where 0 means none of it will be in testing and 1 means all of it will be in testing (this isn’t likely what you would do but if you wanted to do that then just put 1 as your new argument).

Then, let’s train our model by calling on the *lm* function from the base R installation that comes with every installation of RStudio.

This function takes in 4 arguments:

**formula:** This is our predictor variable (our populations) and our response variable (our crime rates in 2012). They are separated with a ~ symbol.

**data:** This is the name of our data frame that we loaded into RStudio before. It’s important to note that when you write this, you need to include all of the $’s before each column or else your results will be wrong! You can use describe my_dataframe to see what your dataset looks like if you’re confused about how many columns there are.

method : What kind of regression do we want? Since multiple linear regression is a type of regression, we’ll use “lm” which stands for linear model.

**tr.method:** This is a way to specify the methods that RStudio uses during training and testing. In our case, we want to use all of them because then it will be most accurate with how many variables there are (in this case 4 including the intercept). We’re going to set this as c(“boot”, “loo”, “deviance”, “contrast”) which means that RStudio should look for bootstrap samples, LOO cross-validation, deviance information criterion, and contrast analyses.

Then, let’s call on the predict function from the *lm* function inside of train_. The predict function takes in the same arguments as the lm function, so we’ll pass in our formula, data, method, and tr.method values to get predictions of what crime rates will be in 2012.

Now let’s run predict(fit_lm, newdata=testing_set_2, trmethod = c(“boot”, “loo”, “deviance”, “contrast”)) where fit_lm is the name of our model that was returned from calling on train_.

You can then look at our testing set by using testing_set_2$Crime again just to ensure it looks correct.

**Note:** We’re going to use these new predictions to compare with actual crime rates for 2012 (which you’ll see later).

If there are no mistakes in the code, these new predictions should be super close to actual crime rates for 2012.

Now, let’s test our model by looking at the actual crime rates for 2012 using testing_set_.$Crime. Then, let’s compare our predicted crime rate with the actual crime rate for 2012 by using all 1s in place of testing_set_2$Crime and predict(fit_lm, newdata=testing_set_2) where fit_lm is the name of the model that was returned from calling on train__.

We can then see how accurate our model is by doing mean(testing_set_.$Crime) – predict(fit_lm, newdata=testing_set_2).

Our accuracy should be around 0 which means we didn’t make any mistakes, meaning the linear regression prediction we created was quite good.