How to do multiple regression in R is a pretty commonly asked question (especially by those students attempting to do their first statistical analysis!).
First, let’s discuss what a multiple linear regression is. Multiple linear regression is a straightforward generalization model. It allows you to model your continuous response variable in terms of more than one predictor so you can measure the joint effect of several explanatory variables on the response variable.
Multiple Linear Regression R: A Further Breakdown
In general, let’s remember that regression models are used to describe relationships between variables by fitting a line to the observed data. Regression, therefore, allows you to estimate how the dependent variable changes as the independent variable(s) change.
In this case, multiple linear regression is used to estimate the relationship between two or more independent variables and one dependent variable.
Multiple Linear Regression in R: When To Use It?
You most likely use multiple linear regression in R when you want to know the following:
- How strong the relationship is between two or more independent variables and one dependent variables (for example: how rainfall, temperature, and the amount of fertilizr added impact crop growth).
- Th value of the dependent variable at a certain value of the independent variable (for example: the expected yield of a crop at certain levels of rainfall, temperature, and fertilizer addition).
As a further example, imagine you are a public health researcher interested in social factors that influence heart disease. You run a survey of 500 towns and gather data on the percentage of people in each town who smoke, the percentage of people in each town who bike to work, and the percentage of people in each town who have heart disease.
Because you have two independent variables and one dependent variable, and all of your variables are quantitative in nature, you can use multiple linear regression R to analyze the relationship between them.
How To Do Multiple Regression in R: A Hands-On Example
To get started, let’s use the marketing dataset that is part of the datarium package. First, if you do not already have it installed, installed the package and load the marketing data as follows:
devtools::install_github("kassmbara/datarium") data("marketing", package = "datarium")
Once installed, let’s take a peek at the data:
> head(marketing, 4) youtube facebook newspaper sales 1 276.12 45.36 83.04 26.52 2 53.40 47.16 54.12 12.48 3 20.64 55.08 83.16 11.16 4 181.80 49.56 70.20 22.20
The data details the amount of money spent on three advertising media (YouTube, Facebook, and newspapers) by one organization and each respective social media outlet had on sales.
In this case, it is useful to build a multiple linear regression model in R for estimating sales based on the advertising budget invested into YouTube, Facebook, and newspapers.
From a statistical standpoint, the formula for this multiple linear regression models looks like this:
sales = b0 + b1*youtube + b2*facebook + b3*newspaper
However, doing that linear regression model in R is just a tad different because of the coding involved. For R, you want to do the following:
model <- lm(sales ~ youtube + facebook + newspaper, data = marketing) summary(model)
As you can see, you are creating a new feature called model that is the linear regression model (with lm being the base-R function to compute linear regression models).
Afterward, you can pull up the summary of the model using the summary function and included the name of the model you just created.
How To Do Multiple Linear Regression in R: Understanding the Results
Once you run the summary function as mentioned above, you should see results similar to this:
Call: lm(formula = sales ~ youtube + facebook + newspaper, data = marketing) Residuals: Min 1Q Median 3Q Max -10.5932 -1.0690 0.2902 1.4272 3.3951 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.526667 0.374290 9.422 <2e-16 *** youtube 0.045765 0.001395 32.809 <2e-16 *** facebook 0.188530 0.008611 21.893 <2e-16 *** newspaper -0.001037 0.005871 -0.177 0.86 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.023 on 196 degrees of freedom Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956 F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
The first step in interpreting the multiple regression analysis is to examine the F-statistic and the associated p-value, at the bottom of model summary.
In our example, it can be seen that p-value of the F-statistic is < 2.2e-16, which is highly significant. This means that, at least, one of the predictor variables is significantly related to the outcome variable.
To see which predictor variables are significant, you can examine the coefficients table, which shows the estimate of regression beta coefficients and the associated t-statitic p-values:
summary(model)$coefficient Estimate Std. Error t value Pr(>|t|) (Intercept) 3.526667243 0.374289884 9.4222884 1.267295e-17 youtube 0.045764645 0.001394897 32.8086244 1.509960e-81 facebook 0.188530017 0.008611234 21.8934961 1.505339e-54 newspaper -0.001037493 0.005871010 -0.1767146 8.599151e-01
For a given predictor, the t-statistic evaluates whether or not there is a significant association between the predictor and the outcome variable, that is whether the beta coefficient of the predictor is significantly different from zero.
It can be seen that changes in YouTube and Facebook advertising budget are significantly associated with changes in sales while changes in newspaper budget is not significantly associated with sales.
For a given predictor variable, the coefficient can be interpreted as the average effect on a one-unit increase in the predictor, holding all other predictors fixed.
For example, for a fixed amount of YouTube and newspaper advertising budget, spending an additional $1,000 dollars on Facebook advertising leads to an increase in sales by approximately 189 units, on average.
The YouTube coefficient suggests that for every $1,000 dollar increase in YouTube advertising budget, holding all other predictors constant, we can expect an increase of 45 units, on average.
We found that newspaper advertising is not significant in the multiple regression model. This means that, for a fixed amount of YouTube and newspaper advertising budget, changes in the newspaper advertising budget will not significantly affect sales.