Hello and welcome to this tutorial. We are going to learn how to implement a Multiple Linear Regression model in R. This is a bit more complex than Simple Linear Regression but it’s going to be so practical and fun.
Multiple Linear Regression is a data science technique that uses several explanatory variables to predict the outcome of a response variable. A Multiple linear regression model attempts to model the relationship between two or more explanatory variables (independent variables) and a response variable (dependent variable), by fitting a linear equation to observed data. Every value of the independent variable x is associated with a value of the dependent variable y.
We’ll understand this better by using a very practical example.
Dataset + Business Problem description
Our dataset contains data about 50 Startups. The data is about observations of the amount each startup spent (on Research and Development, administration and marketing), the county in which the startup operates and the profit the startup made. Our challenge is to check if there’s any correlation between the independent variables and the profit. Also, how would we go about creating a model to help a Venture Capitalist Fund understand how knowing the independent variables (R&D Spend, Administration, Marketing and Location), would help them predict the Dependent Variable (profit). More than that, we want to help the investors see which independent variable has the highest effect on the profit. And also, what governs the relationship between the profit and those independent variables.
Just a heads up before we dive into this section; there’s a caveat around building regression models. Linear regressions have assumptions.
Assumptions of Linear Regression
- Multivariate normality
- Independence of errors
- Lack of multicollinearity
We won’t focus on the assumptions in this section. However, before you build a linear regression model, always do your research and make sure that these assumptions are true. It’s only after you do make sure that the assumptions are correct, that you can go ahead and follow the steps that I’ll show you in this tutorial.
Unlike in the Simple Linear Regression model where we were dealing with one independent variable and one dependent variable, a multiple linear regression model consists of more than one independent variable. For this reason, we have to remove some columns to make sure our model is more accurate. There are five methods of building a model.
- Backward Elimination
- Forward Selection
- Bidirectional Elimination
- Score Comparison
We’re only going to focus on Backward Elimination in this tutorial because it is the fastest one, and you’re still going to learn how to build a model step-by-step. Without further ado, let’s begin.
Some Really Useful Data Science and Machine Learning Books
Building a Multiple Linear Regression Model in R (Step-by-Step)
As usual, we’re going to start by setting the working directory. Create a folder and name it, ‘Multiple Linear Regression’. Download the dataset and save it in the folder you created.
Open RStudio and navigate to that folder. Set that folder as the working directory. Create a Multiple_Linear_Regression.R file and save it in the same folder. We are going to use the Data Preprocessing template we created in the first part (Data Preprocessing). Copy all the contents of the Data Preprocessing Template into the new Multiple_Linear_Regression.R file. I’m only going to edit the Data Preprocessing code, so make sure you paste it in the ‘Multiple_Linear_Regression.R’ file.
To import our dataset, we’re just going to change the name of the dataset to ‘50_Startups.csv’. Select the line of code and press ‘Ctrl + Enter’. Dataset = read.csv(‘50_Startups.csv’)
Our dataset contains some categorical data. As you can see the County column has three categories, Nairobi, Kisumu and Mombasa. If you can remember from our ‘Dealing with Categorical Data’ tutorial, we can only have numerals in our equations. We have to encode the Categories in the County column into numbers. So, under the Encode categorical data comments, we are going to keep that same code. We won’t change anything.
Press ‘Ctrl + Enter’. Take a look at our dataset now. We’ve encoded the County column with the numbers 1, 2, and 3.
Our next step is going to be splitting our dataset into the training set and the test set. We have 50 observations. A good split ratio would be 40 observations for the training set, and 10 observations for the test set. Under the ‘Splitting the dataset into the Training set and Test set’ comments, we’re going to change the dependent variable, everything else remains the same. Press ‘Ctrl + Enter’. Now we have our training set and our test set.
Next, we would Feature Scale our data. However, as with Simple Linear Regression, R takes care of this automatically, with the function that we’re going to use to fit the Multiple linear Regression model to our training set.
Fitting the Multiple Linear Regression Model to our training set.
First, we have to introduce the Multiple Linear Regressor and call it ‘regressor’. Next, we introduce the lm function, ‘lm()’, and it will take two arguments, the formula and the training set. The formula is going to be; ‘formula = Profit ~ . ,’. The ‘.’ Is used to represent all the independent variables. The second argument is going to be the training set; ‘data = training_set’.
Press ‘Ctrl + Enter’. To see our regressor, go to the console and type, ‘summary(regressor)’.
If you take a look at our regressor, we see that some independent variables have a stronger effect than others on the dependent variable. We’re able to see this by looking at the ‘p-value’ column and the significant level column. That’s why we need to do backward elimination, to remain with the most significant independent variable and have a more accurate model.
All we have to do now is predict the test results. We just need one line for this. “y_pred = predict(regressor, newdata = test_set)”.
Press ‘Ctrl + Enter’. In the console section in RStudio, type y_pred and press Enter. You’ll see that our model’s predictions are not too far from the real observations. It shows that this is not a bad model.
The most important thing here to understand is that the lower the p-value, the more statistically significant an independent variable is. The lower the p-value is, the more the impact it has on the dependent variable.
When we look at our regressor, we can see that only one variable has a high significance level, The R&D Spend. (Shown by the tree stars). This means that we could actually change this into a simple linear regression, and express the profit as a linear expression of the R&D spend only. However, it is better if we remove the less significant variables one by one to get an even more accurate model.
Backward Elimination in Multiple Linear Regression
We’re going to use the same regressor we used, but we’re going to change a few things. First, we need to write down each independent variable in our formula. This is because, backward elimination involves removing them one by one. The second thing we’re going to change is then data, from training set to dataset.
We’re also going to add ‘summary(regressor)‘, so that we can always see the summary of our regressor.
Generally, the best threshold to use is the 5 percent threshold. Which means, if the pr value is lower than 5 percent, then the independent variable would be highly statically significant. Also, the more the pr value is higher than the 5 percent, the less statically significant it will be. We’ll remove the pr values that are higher that 5 percent one by one.
That is going to be your homework. You can check out the solution in the picture below if you get stuck.
That’s it for now. I hope you found this article to be useful. Also check out more hands-on tutorials on Lituptech. I’ll see you in the next one.