Hello and welcome to this tutorial. We have learnt how to create Single and Multiple linear regression models. Now, let’s learn how to create Polynomial regression Models in R and where we would apply it to solve real life problems. According to Wikipedia, Polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modeled as an nth degree polynomial in x. Polynomial regression fits a nonlinear relationship between the value of x and the correspondent conditional mean of y. In this tutorial we are going to be building a nonlinear regression model. For datasets where there’s no linear relationship between the independent variable and the dependent variable, the nonlinear regression models are very useful.
As usual we start by setting the working directory. If you have followed previous tutorials, you should already know how to do this. Download the dataset using the download button below, and save it in your working folder (Preferably an empty folder). Open RStudio and navigate to the folder where you saved the dataset. Create a new R file and call it ‘Polynomial_Regression’. Again, we are going to be using the data preprocessing template we created in the data preprocessing part. You can download this file and copy the contents of the file into the ‘Polynomial_Regression.R’ file you created.
Your Polynomial_Regression.R file should look like this;
Now we just need to change a few things. The first thing we’re going to change is the name of the dataset we want to import. Change ‘Data.csv’ to ‘Position_Salaries.csv’.
Let’s take a look at our dataset.
Our Business Problem
You’re in the Human Resource team in a big company, and you’re about to hire a new employee into the company. You’ve found someone who seems to be great and a very good fit for the job. You are about to make an offer to this person you’re hiring and it’s time to negotiate what his/her salary is going to be. The interviewee tells us that he/she has had 19 years of experience and he/she was receiving a salary of one hundred and sixty thousand in his/her previous job and is asking for nothing less than one hundred and sixty thousand. One of your members in the Human resource team decides to call the interviewee’s previous company and ask if the information the interviewee has provided is true. Unfortunately, the only information the team member gets, is this;
The HR team member also finds out that our interviewee has been a regional manager in the previous company for two years, and it takes an average of four years to move from regional manager to partner. This means that our interviewee was half way to becoming partner. He/she was half way between level six and level seven, we can say level 6.5. The HR team member says that he can build a bluffing detector using regression to detect whether the interviewee is bluffing or not. Let’s build a polynomial regression model to build a detector that will predict is it’s the truth, or a bluff.
We’ll start by seeing if we need the whole dataset to train our machine learning model. From the dataset, we can see that the position column is strictly equivalent to the level column. Since we only need numbers for our equations, we are going to omit the Position column and work with the Level and Salary columns only. Our independent variable is going to be the Level column and the dependent variable is going to be the Salary column. Let’s select the two columns that we are interested in from our dataset. To do this we are going to reset the dataset. Code: dataset = dataset[2:3]
Indexes in R start at one and that is why we used two and three to select the second (Level) and third (Salary) columns. If we take a look at our dataset now, it only has two columns, ‘Levels’ and ‘Salary’;
Just a reminder, ‘Levels’ is the independent variable, while ‘Salary’ is our dependent variable. We are going to use the correlation between the two, to train our nonlinear machine learning model, predict salaries. For example, the salary for an employee in the six and a half level.
The next step would be to split the dataset into the training and test sets. However, this time we won’t do that. We are dealing with a very small dataset of only ten observations, so that we can best understand how machine learning models work. Let’s just comment that whole part out. Select the code and press ‘Command/Ctrl + Shift + C’ to comment it all out at once (Pro Tip).
Linear Regression Vs Polynomial Regression in R
The next step would be feature scaling but we won’t need to do it either. Next, we are going to fit our dataset to a polynomial regression model. However, to best understand how a polynomial regression model is more powerful in our situation, we are going to compare it to a baseline model, a linear regression model. We are going to build two models, the linear regression model and the polynomial regression model, and compare the graphic results and the predictions. You will be more convinced that the polynomial regression model is more appropriate for this kind if problem. The main reason for that is that this is a nonlinear problem.
We begin by creating our regressor; lin_reg. Next, we assign the regressor to the lm() function. The lm() function will take two arguments. The first argument is the formula; formula = Salary ~ .,’. The second argument is the data;‘data = dataset’.
Select the code and Press ‘Ctrl + Enter’. We have built our model. Type lin_reg(summary) inside the console area to view our model
Create a regressor and call it, ‘poly_reg’. Assign the regressor to the lm() function as we did in linear regression. The function takes two arguments. The formula and the data, same way we did in linear regression. To transform this from a linear regression to a polynomial regression model, we need to add some polynomial features. The features, are additional independent variables, and these are going to be the observations in the Levels column, in different powers.
The new independent variables are going to compose the matrix of features that we are going to use to apply on multiple linear regression models to make them a polynomial regression model. We are going to add another column to our dataset and this column is going to be the observations in the level column squared. To do this, we add the following line to on top of the poly_reg regressor; dataset$Level2 = dataset$Level^2’
If we take a look at our dataset now.
Lets add a two more columns, level two cubed and level two to the power of four. The whole code should look like this.
Select code and execute.
Select the code and press ‘Ctrl + Enter’ to execute.
Visualizing our Models
We are going to use the ‘ggplot2’ package. If you have not installed this package, check out our ‘Simple Linear Regression’ tutorial to see how we did it. Alternatively, you can use the line; install.packages(‘ggplot2’). We also need to import the library so that we can use it.
To visualize our data, copy the code below into your polynomial regression file. I explained the code in a previous tutorial, so I’m not going to do it again.
Select the code and execute.
If you look at our graph, you will see that there is no linear relationship between the level and the salary. Most of our observation points are below the line, while others are way above our line. For most of our observations, the predicted results are way off. Let’s take the CEO for example. If we used our linear regression model, we see that the predicted result for the CEO is about 690k. You can imagine how furious someone we were about to hire as a CEO would be, if we told him/her that he was bluffing, by asking for a 1 million salary.
On the other hand, if we use our linear regression model, we would overpay someone who is in the six and a half level. The predicted result is way higher than the actual observation. That’s why we need to apply a polynomial regression model for this situation. We need a model with a curved line that will help us make some more accurate predictions.
Select the code and execute.
As you can see, we don’t have a straight line anymore. We have a curve that fits our observations better and more closely. We can try and make our model smoother by creating a new sequence of levels. Which means, we’re going to predict the salaries of more than ten levels. We do it by building a vector of imaginary levels, which will be levels from one to ten, with increments of 0.1. That is, 1, 1.1, 1.2, 1.3… all the way to 9.9. Instead of ten levels, we’ll have ninety levels. We do that by using the following code.
Select the code and execute.
Predicting the results
Let’s use our models to predict the results for our potential employee in the six and a half level and see if he/she was bluffing.
Let’s start with the linear regression model. To predict the results, use the following line of code
Select the code and execute
If we take a look at the predicted result in the console part of RStudio, we see that our model predicted a salary of 330k, which is way above what our potential employee said he/she used to receive.
Let’s now use the polynomial regression model to predict the result. To do this, use the following lines of code
Select the code and execute.
We can see that the polynomial regression model made a more accurate prediction of 158k, which is close to the 160k that our potential employee gave us. We can see that our model works.
Some Really Useful Data Science and Machine Learning Books
Congratulations, you have created your first nonlinear regression model. We’ll be creating a lot more of these in the future. See you in the next one.