Simple linear regression in r

Hello, and welcome to this tutorial. We’ve finished the Data Preprocessing part and now it’s time to start making Machine Learning Models. We’re are going to start with the Simple Linear Regression Model and I will show you how to do it in R. Before we begin, we need to understand our data and the problem we are trying to solve.

Dataset and Business Problem description

First, you will need to download the dataset using the download link above. If we take a look at our dataset, it’s basically 30 observations, taken from 30 random employees in a company. Each employee was asked how many years of experience they have – not just in that company – but the overall years of experience in the workforce and the amount of Salary that the employee receives. The company has hired you as a Data Scientist, to find out if there is any sort of correlation between the Years of Experience, and the Salary, and if there is a correlation, what type of correlation is it? The company knows that experience matters, and they don’t want to keep assigning salaries randomly.

Your job as the Data Scientist is to create a model, which will show the best-fitting line for the relationship between the Years of Experience and the Salary. You will show the company how they are currently setting salaries and also give them a more accurate model/set-of-rules on how to set salaries for new employees in the future.

We’ll begin by setting the working directory. In your Machine Learning folder, create another folder and give it a name like, ‘Simple Linear Regression’. Move the dataset to that folder and set the folder as the working directory in RStudio. We’re going to use the data in the Salary_Data.csv file to build a Simple Linear Regression Model.

Some Really Useful Data Science and Machine Learning Books

Data Preprocessing

As always, the first step in making a Machine Learning model is Data Preprocessing. We’re going to use the data preprocessing template we created in the previous tutorial. Create a ‘simple_linear_regression.R’ file using RStudio. Copy the code from the data preprocessing template and paste it into the R file you created. We’re just going to change a few things to adapt the code to our current dataset. Change the name of the dataset at the top from Data.csv to Salary_Data.csv.

simple linear regression

 Highlight the line of code and press ‘Ctrl + Enter’. Let’s take a look at our dataset in RStudio.

Data Science using R

Part of our Dataset

Just a reminder of what this dataset is about. The dataset contains information about employees in a company; The number of years of experience each employee has, and the salary each employee receives. We want to see if there is any correlation between the salary and the number of years of experience. We’re trying to see if there is a Linear dependency between the two variables.

Before making any machine learning model, you have to know which ones are independent variables and which ones the dependent variables. In our case, the independent variable is the number of years of experience while the dependent variable is the salary. We’re trying to predict the dependent variable, based on the information of the independent variable.

Split the Dataset into the Training set and the Test set.

We’re going to take twenty observations for the training set and ten observations for the test set. Inside the code snippet that we pasted from our data preprocessing template, we’re only going to change the split ratio to 2/3 and the name of the dependent variable.

But first, we have to install a library known as ‘caTools‘. To do this, just type; install.packages(‘caTools’), highlight the code and press ‘Ctrl + Enter’.

R for Data Science

After the package has been installed, it’ll appear on the bottom right of RStudio’s interface. You activate it by checking the box next to it.

Once it is activated, we going to use the code below to split the dataset into the training set and the test set.

library(caTools)

set.seed(123)

split = sample.split(dataset$Salary, SplitRatio = 2/3)

training_set = subset(dataset, split == TRUE)

test_set = subset(dataset, split == FALSE)

Simple Linear Regression

Code used to split our dataset

Select the lines of code and run it. We have separated the Training set from the Test set

Our Training Set

Training Set

Our Test Set

Test Set

We’re going to use our training set to train our simple linear regression model. Our model will learn correlations between the Years of experience and the Salary using the training set. Then, we’re going to test the model’s power of prediction on the test set.

The next part would be Feature Scaling, but the Simple Linear Regression package we are going to use here in R takes care of this for us. We won’t need to apply feature scaling manually. The data preprocessing phase is done. We are ready start building the Linear Regression Model.

Building a Simple Linear Regression Model In R

Fitting Simple Linear Regression to the Training Set

 We’re going to use what is called the ‘lm()’ function. Let’s create a new variable that is going to be the simple linear regressor and call it regressor, regressor = lm(). The lm function is going to take two arguments.

  • One of them is the formula; Which is going to be, “the dependent variable, expressed as a linear combination of the independent variable. formula = Salary ~ YearsExperience,
  • Second, the data. In this case, we want the training set; data = training_set

The whole code should look like this;

Simple Linear Regression in R
Fitting Simple Linear Regression to the Training Set

Predicting the Test set results

We’ve trained our model and now we want to see how well it would predict new observations. To do this, we are going to create our vector of prediction, y_pred. this is the vector that will contain predicted values of the test set observations and we are going to use the predict function, ‘predict()’. The predict function is going o take two arguments.

  • First, the regressor,
  • Second, newdata = test_set

The whole line of code should be;

Simple Linear Regression in R
Predicting the Test set results

Select the line of code and run it. If you type, y-pred and press ‘Enter’ inside the console, You will see the predicted Salaries for the ten observations in the test set. For presentation purposes, you will need to display this results on a graph. Let’s do that;

Visualizing the training set results in Graphs

The first thing we need to do is install and import the ggplot2 library package. It is a really good way of plotting something in R. To install it, write the code; install.packages(‘ggplot2’). After the package has been installed, you can comment out that line of code, as we won’t need to install it again. We just import it using the line of code; library(ggplot2).

We’re going to take a step by step approach to plotting our graph. First we’re going to plot all the observation points in the training set, then we’re going to plot the regression line, then we add the title and finally the labels to the x and y axis.

The different components we’re going to plot are going to be separated by a Plus (+) sign.

Visualizing the Training set results

We can now see our graph;

Simple Linear Regression

let’s do the same for the test set results. Just copy the code above and edit the first line to change it fro training_set to test_set. The block of code should look like this

Visualizing the Test set results

Now we can see the Test set results on a Graph

We have seen the correlation. Generally, the more the years of experience, the more the salary. We’ve seen that in some cases employees received less/more than they should be getting. We’ve also given the company the best-fitting-line and the model they should use to set salaries in future. Mission Accomplished.

Congratulations, now you know how to create a Simple Linear Regression Model in R. In the next tutorial, we are going to learn how to do Multiple Linear Regression in R. See you then.

Dataset and Business Problem description

First, you will need to download the dataset using the download link above. If we take a look at our dataset, it’s basically 30 observations, taken from 30 random employees in a company. Each employee was asked how many years of experience they have – not just in that company – but the overall years of experience in the workforce and the amount of Salary that the employee receives. The company has hired you as a Data Scientist, to find out if there is any sort of correlation between the Years of Experience, and the Salary, and if there is a correlation, what type of correlation is it? The company knows that experience matters, and they don’t want to keep assigning salaries randomly.

Your job as the Data Scientist is to create a model, which will show the best-fitting line for the relationship between the Years of Experience and the Salary. You will show the company how they are currently setting salaries and also give them a more accurate model/set-of-rules on how to set salaries for new employees in the future.

We’ll begin by setting the working directory. In your Machine Learning folder, create another folder and give it a name like, ‘Simple Linear Regression’. Move the dataset to that folder and set the folder as the working directory in RStudio. We’re going to use the data in the Salary_Data.csv file to build a Simple Linear Regression Model.

Some Really Useful Data Science and Machine Learning Books

Data Preprocessing

As always, the first step in making a Machine Learning model is Data Preprocessing. We’re going to use the data preprocessing template we created in the previous tutorial. Create a ‘simple_linear_regression.R’ file using RStudio. Copy the code from the data preprocessing template and paste it into the R file you created. We’re just going to change a few things to adapt the code to our current dataset. Change the name of the dataset at the top from Data.csv to Salary_Data.csv.

simple linear regression

 Highlight the line of code and press ‘Ctrl + Enter’. Let’s take a look at our dataset in RStudio.

Data Science using R

Part of our Dataset

Just a reminder of what this dataset is about. The dataset contains information about employees in a company; The number of years of experience each employee has, and the salary each employee receives. We want to see if there is any correlation between the salary and the number of years of experience. We’re trying to see if there is a Linear dependency between the two variables.

Before making any machine learning model, you have to know which ones are independent variables and which ones the dependent variables. In our case, the independent variable is the number of years of experience while the dependent variable is the salary. We’re trying to predict the dependent variable, based on the information of the independent variable.

Split the Dataset into the Training set and the Test set.

We’re going to take twenty observations for the training set and ten observations for the test set. Inside the code snippet that we pasted from our data preprocessing template, we’re only going to change the split ratio to 2/3 and the name of the dependent variable.

But first, we have to install a library known as ‘caTools‘. To do this, just type; install.packages(‘caTools’), highlight the code and press ‘Ctrl + Enter’.

R for Data Science

After the package has been installed, it’ll appear on the bottom right of RStudio’s interface. You activate it by checking the box next to it.

Once it is activated, we going to use the code below to split the dataset into the training set and the test set.

library(caTools)

set.seed(123)

split = sample.split(dataset$Salary, SplitRatio = 2/3)

training_set = subset(dataset, split == TRUE)

test_set = subset(dataset, split == FALSE)

Simple Linear Regression

Code used to split our dataset

Select the lines of code and run it. We have separated the Training set from the Test set

Our Training Set

Training Set

Our Test Set

Test Set

We’re going to use our training set to train our simple linear regression model. Our model will learn correlations between the Years of experience and the Salary using the training set. Then, we’re going to test the model’s power of prediction on the test set.

The next part would be Feature Scaling, but the Simple Linear Regression package we are going to use here in R takes care of this for us. We won’t need to apply feature scaling manually. The data preprocessing phase is done. We are ready start building the Linear Regression Model.

Building a Simple Linear Regression Model In R

Fitting Simple Linear Regression to the Training Set

 We’re going to use what is called the ‘lm()’ function. Let’s create a new variable that is going to be the simple linear regressor and call it regressor, regressor = lm(). The lm function is going to take two arguments.

  • One of them is the formula; Which is going to be, “the dependent variable, expressed as a linear combination of the independent variable. formula = Salary ~ YearsExperience,
  • Second, the data. In this case, we want the training set; data = training_set

The whole code should look like this;

Simple Linear Regression in R
Fitting Simple Linear Regression to the Training Set

Predicting the Test set results

We’ve trained our model and now we want to see how well it would predict new observations. To do this, we are going to create our vector of prediction, y_pred. this is the vector that will contain predicted values of the test set observations and we are going to use the predict function, ‘predict()’. The predict function is going o take two arguments.

  • First, the regressor,
  • Second, newdata = test_set

The whole line of code should be;

Simple Linear Regression in R
Predicting the Test set results

Select the line of code and run it. If you type, y-pred and press ‘Enter’ inside the console, You will see the predicted Salaries for the ten observations in the test set. For presentation purposes, you will need to display this results on a graph. Let’s do that;

Visualizing the training set results in Graphs

The first thing we need to do is install and import the ggplot2 library package. It is a really good way of plotting something in R. To install it, write the code; install.packages(‘ggplot2’). After the package has been installed, you can comment out that line of code, as we won’t need to install it again. We just import it using the line of code; library(ggplot2).

We’re going to take a step by step approach to plotting our graph. First we’re going to plot all the observation points in the training set, then we’re going to plot the regression line, then we add the title and finally the labels to the x and y axis.

The different components we’re going to plot are going to be separated by a Plus (+) sign.

Visualizing the Training set results

We can now see our graph;

Simple Linear Regression

let’s do the same for the test set results. Just copy the code above and edit the first line to change it fro training_set to test_set. The block of code should look like this

Visualizing the Test set results

Now we can see the Test set results on a Graph

We have seen the correlation. Generally, the more the years of experience, the more the salary. We’ve seen that in some cases employees received less/more than they should be getting. We’ve also given the company the best-fitting-line and the model they should use to set salaries in future. Mission Accomplished.

Congratulations, now you know how to create a Simple Linear Regression Model in R. In the next tutorial, we are going to learn how to do Multiple Linear Regression in R. See you then.

What's your reaction?

Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0

You may also like

3 Comments

  1. […] to learn how to implement a Multiple Linear Regression model in R. This is a bit more complex than Simple Linear Regression but it’s going to be so practical and […]

  2. […] and welcome to this tutorial. We have learnt how to create Single and Multiple linear regression models. Now, let’s learn how to create Polynomial regression […]

  3. […] and welcome to this tutorial. We have learnt how to create Single and Multiple linear regression models. Now, let’s learn how to create Polynomial regression […]

Leave a reply

Your email address will not be published.

More in:Data Science