Data ScienceR

# Simple Linear Regression in R for Data Science and Machine Learning

Hello, and welcome to this tutorial. We’ve finished the Data Preprocessing part and now it’s time to start making Machine Learning Models. We’re are going to start with the Simple Linear Regression Model and I will show you how to do it in R. Before we begin, we need to understand our data and the problem we are trying to solve.

### Dataset and Business Problem description

First, you will need to download the dataset using the download link above. If we take a look at our dataset, it’s basically 30 observations, taken from 30 random employees in a company. Each employee was asked how many years of experience they have – not just in that company – but the overall years of experience in the workforce and the amount of Salary that the employee receives. The company has hired you as a Data Scientist, to find out if there is any sort of correlation between the Years of Experience, and the Salary, and if there is a correlation, what type of correlation is it? The company knows that experience matters, and they don’t want to keep assigning salaries randomly.

Your job as the Data Scientist is to create a model, which will show the best-fitting line for the relationship between the Years of Experience and the Salary. You will show the company how they are currently setting salaries and also give them a more accurate model/set-of-rules on how to set salaries for new employees in the future.

Weâ€™ll begin by setting the working directory. In your Machine Learning folder, create another folder and give it a name like, â€˜Simple Linear Regressionâ€™. Move the dataset to that folder and set the folder as the working directory in RStudio. Weâ€™re going to use the data in the Salary_Data.csv file to build a Simple Linear Regression Model.

### Data Preprocessing

As always, the first step in making a Machine Learning model is Data Preprocessing. Weâ€™re going to use the data preprocessing template we created in the previous tutorial. Create a â€˜simple_linear_regression.Râ€™ file using RStudio. Copy the code from the data preprocessing template and paste it into the R file you created. Weâ€™re just going to change a few things to adapt the code to our current dataset. Change the name of the dataset at the top from Data.csv to Salary_Data.csv.

Importing the dataset

Highlight the line of code and press â€˜Ctrl + Enterâ€™. Letâ€™s take a look at our dataset in RStudio.

Part of our Dataset

Just a reminder of what this dataset is about. The dataset contains information about employees in a company; The number of years of experience each employee has, and the salary each employee receives. We want to see if there is any correlation between the salary and the number of years of experience. Weâ€™re trying to see if there is a Linear dependency between the two variables.

Before making any machine learning model, you have to know which ones are independent variables and which ones the dependent variables. In our case, the independent variable is the number of years of experience while the dependent variable is the salary. Weâ€™re trying to predict the dependent variable, based on the information of the independent variable.

#### Split the Dataset into the Training set and the Test set.

Weâ€™re going to take twenty observations for the training set and ten observations for the test set. Inside the code snippet that we pasted from our data preprocessing template, weâ€™re only going to change the split ratio to 2/3 and the name of the dependent variable.

But first, we have to install a library known as ‘caTools‘. To do this, just type; install.packages(‘caTools’), highlight the code and press ‘Ctrl + Enter’.

Install caTools

After the package has been installed, it’ll appear on the bottom right of RStudio’s interface. You activate it by checking the box next to it.

Once it is activated, we going to use the code below to split the dataset into the training set and the test set.

library(caTools)

set.seed(123)

split = sample.split(dataset\$Salary, SplitRatio = 2/3)

training_set = subset(dataset, split == TRUE)

test_set = subset(dataset, split == FALSE)

Select the lines of code and run it. We have separated the Training set from the Test set

Weâ€™re going to use our training set to train our simple linear regression model. Our model will learn correlations between the Years of experience and the Salary using the training set. Then, weâ€™re going to test the modelâ€™s power of prediction on the test set.

The next part would be Feature Scaling, but the Simple Linear Regression package we are going to use here in R takes care of this for us. We wonâ€™t need to apply feature scaling manually. The data preprocessing phase is done. We are ready start building the Linear Regression Model.

### Building a Simple Linear Regression Model In R

#### Fitting Simple Linear Regression to the Training Set

Â Weâ€™re going to use what is called the â€˜lm()â€™ function.

Just type lm and then press F1 to get info about the â€˜lm()â€™ function and the arguments.

• One of them is the formula; Which is going to be, â€œthe dependent variable, expressed as a linear combination of the independent variable. formula = Salary ~ YearsExperience,
• The second one is the Data; This is the data on which we want to train our Simple Linear Regression Model. In our case this is the training set that we created earlier.

There are some other arguments that would go into the lm() function but they are optional and we donâ€™t needÂ  them in this case.

Letâ€™s create a new variable that is going to be the simple linear regressor and call it regressor, regressor = lm(). The lm function is going to take our two arguments.

• First; Which is going to be, â€œthe dependent variable, expressed as a linear combination of the independent variable. formula = Salary ~ YearsExperience,
• Second, the data. In this case, we want the training set; data = training_set

The whole code should look like this;

Fitting Simple Linear Regression to the Training Set

Select the line of code and execute

As you can see below our regressor is ready.

If you want to get any information about our regressor, the best way to do it, is to go to the console section and type; summary(regressor) and press enter. Youâ€™ll see some really good info about our Simple Linear Regression model.

###### Explanation
• First it shows you our formula; The Salary being proportional to the Years of Experience. Also, it tells you that the model is built on the training set
• Then we have info about the residuals. We wonâ€™t discuss that for now
• The most important section is the Coefficients section. Not only does it tell us the value of the coefficients in the Simple Linear Regression model, but also the statistical significance of the coefficients. We have 3 stars which means the YearsExperience independent variable is highly statistically significant. You can either have No Stars, which means there is no statistical significance. Three stars mean that thereâ€™s a high statistical significance. Thatâ€™s our first hint of what is going to happen. There will be a strong linear relationship between the Salary and the Years of Experience.
• The last is the P-Value. This is another indication of the statistical significance. The lower the P-Value is, the more significant the independent variable is going to be. i.e. the more impact/effect the dependent variable is going to have on the dependent variable.

Normally, a good threshold for the P-Value is 5%. When weâ€™re below 5%, the independent variable is highly significant and when above 5%, the independent variable is less significant.

Thatâ€™s how you get the information.

Weâ€™re done fitting our Simple Linear Regression to our training set. Itâ€™s now time to predict the Test set results, to see how our Simple Linear Regression model behaves on a new set of data

#### Predicting the Test set results

Weâ€™ve trained our model and now we want to see how well it would predict new observations. To do this, we are going to create our vector of prediction, y_pred.

We called it â€˜y_predâ€™ as it will contain the predicted results of the dependent variable which is Salary. (Itâ€™s on the Y-Axis). We are going to use the predict function; predict()

Weâ€™re only going to have two arguments in our function. The first one is our regressor (the simple linear regression model we fitted earlier). So; y_pred = predict(regressor)

Our second argument will be; â€˜newdataâ€™. Thatâ€™s the name of the argument. And this is the data that contains the observations of which we want to predict the results. i.e. the Test set. So; y_pred = predict(regressor, newdata = test_set)

The whole line of code should be;

Predicting the Test set results

Highlight the line of code and execute. The Vector of Prediction has been created. Inside the console, type â€˜y_predâ€™ and press â€˜Enterâ€™.

The Predicted Results

Our simple linear regression has predicted the salary for each of the Test set observations. The salary is not exactly the same as the ones we have in the test set. However, since we saw a strong linear dependency between the Years of experience and the Salary, most of the results are pretty close to the real Salaries.

For presentation purposes, you will need to display this results on a graph. Let’s do that;

#### Visualizing the training set results in Graphs

The first thing we need to do is install and import the ggplot2 library package. It is a really good way of plotting something in R. To install it, write the code; install.packages(‘ggplot2’). After the package has been installed, you can comment out that line of code, as we won’t need to install it again. We just import it using the line of code; library(ggplot2).

Weâ€™re going to take a step by step approach to plotting our graph. First weâ€™re going to plot all the observation points in the training set, then weâ€™re going to plot the regression line, then we add the title and finally the labels to the x and y axis.

The different components we’re going to plot are going to be separated by a Plus (+) sign.

Visualizing the Training Set Results

We can now see our graph;

Let’s do the same for the test set results. Just copy the code above and edit the first line to change it fro training_set to test_set. The block of code should look like this

Visualizing the Test set results

Now we can see the Test set results on a Graph

We have seen the correlation. Generally, the more the years of experience, the more the salary. We’ve seen that in some cases employees received less/more than they should be getting. We’ve also given the company the best-fitting-line and the model they should use to set salaries in future. Mission Accomplished.

Congratulations, now you know how to create a Simple Linear Regression Model in R. In the next tutorial, we are going to learn how to do Multiple Linear Regression in R. See you then.

### Dataset and Business Problem description

First, you will need to download the dataset using the download link above. If we take a look at our dataset, it’s basically 30 observations, taken from 30 random employees in a company. Each employee was asked how many years of experience they have – not just in that company – but the overall years of experience in the workforce and the amount of Salary that the employee receives. The company has hired you as a Data Scientist, to find out if there is any sort of correlation between the Years of Experience, and the Salary, and if there is a correlation, what type of correlation is it? The company knows that experience matters, and they don’t want to keep assigning salaries randomly.

Your job as the Data Scientist is to create a model, which will show the best-fitting line for the relationship between the Years of Experience and the Salary. You will show the company how they are currently setting salaries and also give them a more accurate model/set-of-rules on how to set salaries for new employees in the future.

Weâ€™ll begin by setting the working directory. In your Machine Learning folder, create another folder and give it a name like, â€˜Simple Linear Regressionâ€™. Move the dataset to that folder and set the folder as the working directory in RStudio. Weâ€™re going to use the data in the Salary_Data.csv file to build a Simple Linear Regression Model.

### Data Preprocessing

As always, the first step in making a Machine Learning model is Data Preprocessing. Weâ€™re going to use the data preprocessing template we created in the previous tutorial. Create a â€˜simple_linear_regression.Râ€™ file using RStudio. Copy the code from the data preprocessing template and paste it into the R file you created. Weâ€™re just going to change a few things to adapt the code to our current dataset. Change the name of the dataset at the top from Data.csv to Salary_Data.csv.

Importing the dataset

Highlight the line of code and press â€˜Ctrl + Enterâ€™. Letâ€™s take a look at our dataset in RStudio.

Part of our Dataset

Just a reminder of what this dataset is about. The dataset contains information about employees in a company; The number of years of experience each employee has, and the salary each employee receives. We want to see if there is any correlation between the salary and the number of years of experience. Weâ€™re trying to see if there is a Linear dependency between the two variables.

Before making any machine learning model, you have to know which ones are independent variables and which ones the dependent variables. In our case, the independent variable is the number of years of experience while the dependent variable is the salary. Weâ€™re trying to predict the dependent variable, based on the information of the independent variable.

#### Split the Dataset into the Training set and the Test set.

Weâ€™re going to take twenty observations for the training set and ten observations for the test set. Inside the code snippet that we pasted from our data preprocessing template, weâ€™re only going to change the split ratio to 2/3 and the name of the dependent variable.

But first, we have to install a library known as ‘caTools‘. To do this, just type; install.packages(‘caTools’), highlight the code and press ‘Ctrl + Enter’.

Install caTools

After the package has been installed, it’ll appear on the bottom right of RStudio’s interface. You activate it by checking the box next to it.

Once it is activated, we going to use the code below to split the dataset into the training set and the test set.

library(caTools)

set.seed(123)

split = sample.split(dataset\$Salary, SplitRatio = 2/3)

training_set = subset(dataset, split == TRUE)

test_set = subset(dataset, split == FALSE)

Select the lines of code and run it. We have separated the Training set from the Test set

Weâ€™re going to use our training set to train our simple linear regression model. Our model will learn correlations between the Years of experience and the Salary using the training set. Then, weâ€™re going to test the modelâ€™s power of prediction on the test set.

The next part would be Feature Scaling, but the Simple Linear Regression package we are going to use here in R takes care of this for us. We wonâ€™t need to apply feature scaling manually. The data preprocessing phase is done. We are ready start building the Linear Regression Model.

### Building a Simple Linear Regression Model In R

#### Fitting Simple Linear Regression to the Training Set

Â Weâ€™re going to use what is called the â€˜lm()â€™ function.

Just type lm and then press F1 to get info about the â€˜lm()â€™ function and the arguments.

• One of them is the formula; Which is going to be, â€œthe dependent variable, expressed as a linear combination of the independent variable. formula = Salary ~ YearsExperience,
• The second one is the Data; This is the data on which we want to train our Simple Linear Regression Model. In our case this is the training set that we created earlier.

There are some other arguments that would go into the lm() function but they are optional and we donâ€™t needÂ  them in this case.

Letâ€™s create a new variable that is going to be the simple linear regressor and call it regressor, regressor = lm(). The lm function is going to take our two arguments.

• First; Which is going to be, â€œthe dependent variable, expressed as a linear combination of the independent variable. formula = Salary ~ YearsExperience,
• Second, the data. In this case, we want the training set; data = training_set

The whole code should look like this;

Fitting Simple Linear Regression to the Training Set

Select the line of code and execute

As you can see below our regressor is ready.

If you want to get any information about our regressor, the best way to do it, is to go to the console section and type; summary(regressor) and press enter. Youâ€™ll see some really good info about our Simple Linear Regression model.

###### Explanation
• First it shows you our formula; The Salary being proportional to the Years of Experience. Also, it tells you that the model is built on the training set
• Then we have info about the residuals. We wonâ€™t discuss that for now
• The most important section is the Coefficients section. Not only does it tell us the value of the coefficients in the Simple Linear Regression model, but also the statistical significance of the coefficients. We have 3 stars which means the YearsExperience independent variable is highly statistically significant. You can either have No Stars, which means there is no statistical significance. Three stars mean that thereâ€™s a high statistical significance. Thatâ€™s our first hint of what is going to happen. There will be a strong linear relationship between the Salary and the Years of Experience.
• The last is the P-Value. This is another indication of the statistical significance. The lower the P-Value is, the more significant the independent variable is going to be. i.e. the more impact/effect the dependent variable is going to have on the dependent variable.

Normally, a good threshold for the P-Value is 5%. When weâ€™re below 5%, the independent variable is highly significant and when above 5%, the independent variable is less significant.

Thatâ€™s how you get the information.

Weâ€™re done fitting our Simple Linear Regression to our training set. Itâ€™s now time to predict the Test set results, to see how our Simple Linear Regression model behaves on a new set of data

#### Predicting the Test set results

Weâ€™ve trained our model and now we want to see how well it would predict new observations. To do this, we are going to create our vector of prediction, y_pred.

We called it â€˜y_predâ€™ as it will contain the predicted results of the dependent variable which is Salary. (Itâ€™s on the Y-Axis). We are going to use the predict function; predict()

Weâ€™re only going to have two arguments in our function. The first one is our regressor (the simple linear regression model we fitted earlier). So; y_pred = predict(regressor)

Our second argument will be; â€˜newdataâ€™. Thatâ€™s the name of the argument. And this is the data that contains the observations of which we want to predict the results. i.e. the Test set. So; y_pred = predict(regressor, newdata = test_set)

The whole line of code should be;

Predicting the Test set results

Highlight the line of code and execute. The Vector of Prediction has been created. Inside the console, type â€˜y_predâ€™ and press â€˜Enterâ€™.

The Predicted Results

Our simple linear regression has predicted the salary for each of the Test set observations. The salary is not exactly the same as the ones we have in the test set. However, since we saw a strong linear dependency between the Years of experience and the Salary, most of the results are pretty close to the real Salaries.

For presentation purposes, you will need to display this results on a graph. Let’s do that;

#### Visualizing the training set results in Graphs

The first thing we need to do is install and import the ggplot2 library package. It is a really good way of plotting something in R. To install it, write the code; install.packages(‘ggplot2’). After the package has been installed, you can comment out that line of code, as we won’t need to install it again. We just import it using the line of code; library(ggplot2).

Weâ€™re going to take a step by step approach to plotting our graph. First weâ€™re going to plot all the observation points in the training set, then weâ€™re going to plot the regression line, then we add the title and finally the labels to the x and y axis.

The different components we’re going to plot are going to be separated by a Plus (+) sign.

Visualizing the Training Set Results

We can now see our graph;

Let’s do the same for the test set results. Just copy the code above and edit the first line to change it fro training_set to test_set. The block of code should look like this

Visualizing the Test set results

Now we can see the Test set results on a Graph

We have seen the correlation. Generally, the more the years of experience, the more the salary. We’ve seen that in some cases employees received less/more than they should be getting. We’ve also given the company the best-fitting-line and the model they should use to set salaries in future. Mission Accomplished.

Congratulations, now you know how to create a Simple Linear Regression Model in R. In the next tutorial, we are going to learn how to do Multiple Linear Regression in R. See you then.

Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0

Data Science

Data Science

Data Science

Data Science

## 3 Comments

1. […] to learn how to implement a Multiple Linear Regression model in R. This is a bit more complex than Simple Linear Regression but itâ€™s going to be so practical and […]

2. […] and welcome to this tutorial. We have learnt how to create Single and Multiple linear regression models. Now, letâ€™s learn how to create Polynomial regression […]

3. […] and welcome to this tutorial. We have learnt how to create Single and Multiple linear regression models. Now, letâ€™s learn how to create Polynomial regression […]