Hello, and welcome to this tutorial. We’ve finished the **Data Preprocessing** part and now it’s time to start making Machine Learning Models. We’re are going to start with the Simple Linear Regression Model and I will show you how to do it in R. Before we begin, we need to understand our data and the problem we are trying to solve.

### Dataset and Business Problem description

First, you will need to download the dataset using the download link above. If we take a look at our dataset, it’s basically 30 observations, taken from 30 random employees in a company. Each employee was asked how many years of experience they have – not just in that company – but the overall years of experience in the workforce and the amount of Salary that the employee receives. The company has hired you as a Data Scientist, to find out if there is any sort of correlation between the Years of Experience, and the Salary, and if there is a correlation, what type of correlation is it? The company knows that experience matters, and they don’t want to keep assigning salaries randomly.

Your job as the Data Scientist is to create a model, which will show the best-fitting line for the relationship between the Years of Experience and the Salary. You will show the company how they are currently setting salaries and also give them a more accurate model/set-of-rules on how to set salaries for new employees in the future.

Weâ€™ll begin by setting the working directory. In your Machine Learning folder, create another folder and give it a name like, â€˜*Simple Linear Regression*â€™. Move the dataset to that folder and set the folder as the working directory in **RStudio**. Weâ€™re going to use the data in the Salary_Data.csv file to build a Simple Linear Regression Model.

###### Some Really Useful Data Science and Machine Learning Books

### Data Preprocessing

As always, the first step in making a Machine Learning model is Data Preprocessing. Weâ€™re going to use the **data preprocessing template** we created in the previous tutorial. Create a â€˜simple_linear_regression.Râ€™ file using **RStudio**. Copy the code from the data preprocessing template and paste it into the R file you created. Weâ€™re just going to change a few things to adapt the code to our current dataset. Change the name of the dataset at the top from Data.csv to Salary_Data.csv.

Â Highlight the line of code and press â€˜*Ctrl + Enter*â€™. Letâ€™s take a look at our dataset in RStudio.

Just a reminder of what this dataset is about. The dataset contains information about employees in a company; The number of years of experience each employee has, and the salary each employee receives. We want to see if there is any correlation between the salary and the number of years of experience. Weâ€™re trying to see if there is a Linear dependency between the two variables.

Before making any machine learning model, you have to know which ones are independent variables and which ones the dependent variables. In our case, the independent variable is the number of years of experience while the dependent variable is the salary. Weâ€™re trying to predict the dependent variable, based on the information of the independent variable.

#### Split the Dataset into the Training set and the Test set.

Weâ€™re going to take twenty observations for the training set and ten observations for the test set. Inside the code snippet that we pasted from our data preprocessing template, weâ€™re only going to change the split ratio to 2/3 and the name of the dependent variable.

But first, we have to install a library known as ‘**caTools**‘. To do this, just type; **install.packages(‘caTools’)**, highlight the code and press ‘Ctrl + Enter’.

After the package has been installed, it’ll appear on the bottom right of RStudio’s interface. You activate it by checking the box next to it.

Once it is activated, we going to use the code below to split the dataset into the training set and the test set.

**library(caTools)**

s**et.seed(123)**

**split = sample.split(dataset$Salary, SplitRatio = 2/3)**

**training_set = subset(dataset, split == TRUE)**

**test_set = subset(dataset, split == FALSE)**

Select the lines of code and run it. We have separated the Training set from the Test set

Weâ€™re going to use our training set to train our simple linear regression model. Our model will learn correlations between the Years of experience and the Salary using the training set. Then, weâ€™re going to test the modelâ€™s power of prediction on the test set.

The next part would be Feature Scaling, but the Simple Linear Regression package we are going to use here in R takes care of this for us. We wonâ€™t need to apply feature scaling manually. The data preprocessing phase is done. We are ready start building the Linear Regression Model.

### Building a Simple Linear Regression Model In R

#### Fitting Simple Linear Regression to the Training Set

Â Weâ€™re going to use what is called the â€˜**lm()**â€™ function. Letâ€™s create a new variable that is going to be the simple linear regressor and call it regressor, **regressor = lm(). **The lm function is going to take two arguments.

- One of them is the formula; Which is going to be, â€œthe dependent variable, expressed as a linear combination of the independent variable.
**formula = Salary ~ YearsExperience**, - Second, the
**data.**In this case, we want the training set;**data = training_set**

The whole code should look like this;

#### Predicting the Test set results

Weâ€™ve trained our model and now we want to see how well it would predict new observations. To do this, we are going to create our vector of prediction, **y_pred. **this is the vector that will contain predicted values of the test set observations and we are going to use the predict function, â€˜**predict()**â€™. The predict function is going o take two arguments.

- First, the
**regressor**, - Second,
**newdata = test_set**

The whole line of code should be;

Select the line of code and run it. If you type, **y-pred **and press ‘Enter’ inside the console, You will see the predicted Salaries for the ten observations in the test set. For presentation purposes, you will need to display this results on a graph. Let’s do that;

#### Visualizing the training set results in Graphs

The first thing we need to do is install and import the **ggplot2** library package. It is a really good way of plotting something in R. To install it, write the code; **install.packages(‘ggplot2’)**. After the package has been installed, you can comment out that line of code, as we won’t need to install it again. We just import it using the line of code; **library(ggplot2)**.

Weâ€™re going to take a step by step approach to plotting our graph. First weâ€™re going to plot all the observation points in the training set, then weâ€™re going to plot the regression line, then we add the title and finally the labels to the x and y axis.

The different components we’re going to plot are going to be separated by a Plus (+) sign.

We can now see our graph;

let’s do the same for the test set results. Just copy the code above and edit the first line to change it fro training_set to test_set. The block of code should look like this

Now we can see the Test set results on a Graph

We have seen the correlation. Generally, the more the years of experience, the more the salary. We’ve seen that in some cases employees received less/more than they should be getting. We’ve also given the company the best-fitting-line and the model they should use to set salaries in future. Mission Accomplished.

Congratulations, now you know how to create a Simple Linear Regression Model in **R**. In the next tutorial, we are going to learn how to do Multiple Linear Regression in R. See you then.

### Dataset and Business Problem description

First, you will need to download the dataset using the download link above. If we take a look at our dataset, it’s basically 30 observations, taken from 30 random employees in a company. Each employee was asked how many years of experience they have – not just in that company – but the overall years of experience in the workforce and the amount of Salary that the employee receives. The company has hired you as a Data Scientist, to find out if there is any sort of correlation between the Years of Experience, and the Salary, and if there is a correlation, what type of correlation is it? The company knows that experience matters, and they don’t want to keep assigning salaries randomly.

Your job as the Data Scientist is to create a model, which will show the best-fitting line for the relationship between the Years of Experience and the Salary. You will show the company how they are currently setting salaries and also give them a more accurate model/set-of-rules on how to set salaries for new employees in the future.

Weâ€™ll begin by setting the working directory. In your Machine Learning folder, create another folder and give it a name like, â€˜*Simple Linear Regression*â€™. Move the dataset to that folder and set the folder as the working directory in **RStudio**. Weâ€™re going to use the data in the Salary_Data.csv file to build a Simple Linear Regression Model.

###### Some Really Useful Data Science and Machine Learning Books

### Data Preprocessing

As always, the first step in making a Machine Learning model is Data Preprocessing. Weâ€™re going to use the **data preprocessing template** we created in the previous tutorial. Create a â€˜simple_linear_regression.Râ€™ file using **RStudio**. Copy the code from the data preprocessing template and paste it into the R file you created. Weâ€™re just going to change a few things to adapt the code to our current dataset. Change the name of the dataset at the top from Data.csv to Salary_Data.csv.

Â Highlight the line of code and press â€˜*Ctrl + Enter*â€™. Letâ€™s take a look at our dataset in RStudio.

Just a reminder of what this dataset is about. The dataset contains information about employees in a company; The number of years of experience each employee has, and the salary each employee receives. We want to see if there is any correlation between the salary and the number of years of experience. Weâ€™re trying to see if there is a Linear dependency between the two variables.

Before making any machine learning model, you have to know which ones are independent variables and which ones the dependent variables. In our case, the independent variable is the number of years of experience while the dependent variable is the salary. Weâ€™re trying to predict the dependent variable, based on the information of the independent variable.

#### Split the Dataset into the Training set and the Test set.

Weâ€™re going to take twenty observations for the training set and ten observations for the test set. Inside the code snippet that we pasted from our data preprocessing template, weâ€™re only going to change the split ratio to 2/3 and the name of the dependent variable.

But first, we have to install a library known as ‘**caTools**‘. To do this, just type; **install.packages(‘caTools’)**, highlight the code and press ‘Ctrl + Enter’.

After the package has been installed, it’ll appear on the bottom right of RStudio’s interface. You activate it by checking the box next to it.

Once it is activated, we going to use the code below to split the dataset into the training set and the test set.

**library(caTools)**

s**et.seed(123)**

**split = sample.split(dataset$Salary, SplitRatio = 2/3)**

**training_set = subset(dataset, split == TRUE)**

**test_set = subset(dataset, split == FALSE)**

Select the lines of code and run it. We have separated the Training set from the Test set

Weâ€™re going to use our training set to train our simple linear regression model. Our model will learn correlations between the Years of experience and the Salary using the training set. Then, weâ€™re going to test the modelâ€™s power of prediction on the test set.

The next part would be Feature Scaling, but the Simple Linear Regression package we are going to use here in R takes care of this for us. We wonâ€™t need to apply feature scaling manually. The data preprocessing phase is done. We are ready start building the Linear Regression Model.

### Building a Simple Linear Regression Model In R

#### Fitting Simple Linear Regression to the Training Set

Â Weâ€™re going to use what is called the â€˜**lm()**â€™ function. Letâ€™s create a new variable that is going to be the simple linear regressor and call it regressor, **regressor = lm(). **The lm function is going to take two arguments.

- One of them is the formula; Which is going to be, â€œthe dependent variable, expressed as a linear combination of the independent variable.
**formula = Salary ~ YearsExperience**, - Second, the
**data.**In this case, we want the training set;**data = training_set**

The whole code should look like this;

#### Predicting the Test set results

Weâ€™ve trained our model and now we want to see how well it would predict new observations. To do this, we are going to create our vector of prediction, **y_pred. **this is the vector that will contain predicted values of the test set observations and we are going to use the predict function, â€˜**predict()**â€™. The predict function is going o take two arguments.

- First, the
**regressor**, - Second,
**newdata = test_set**

The whole line of code should be;

Select the line of code and run it. If you type, **y-pred **and press ‘Enter’ inside the console, You will see the predicted Salaries for the ten observations in the test set. For presentation purposes, you will need to display this results on a graph. Let’s do that;

#### Visualizing the training set results in Graphs

The first thing we need to do is install and import the **ggplot2** library package. It is a really good way of plotting something in R. To install it, write the code; **install.packages(‘ggplot2’)**. After the package has been installed, you can comment out that line of code, as we won’t need to install it again. We just import it using the line of code; **library(ggplot2)**.

Weâ€™re going to take a step by step approach to plotting our graph. First weâ€™re going to plot all the observation points in the training set, then weâ€™re going to plot the regression line, then we add the title and finally the labels to the x and y axis.

The different components we’re going to plot are going to be separated by a Plus (+) sign.

We can now see our graph;

let’s do the same for the test set results. Just copy the code above and edit the first line to change it fro training_set to test_set. The block of code should look like this

Now we can see the Test set results on a Graph

We have seen the correlation. Generally, the more the years of experience, the more the salary. We’ve seen that in some cases employees received less/more than they should be getting. We’ve also given the company the best-fitting-line and the model they should use to set salaries in future. Mission Accomplished.

Congratulations, now you know how to create a Simple Linear Regression Model in **R**. In the next tutorial, we are going to learn how to do Multiple Linear Regression in R. See you then.

[…] to learn how to implement a Multiple Linear Regression model in R. This is a bit more complex than Simple Linear Regression but itâ€™s going to be so practical and […]

[…] and welcome to this tutorial. We have learnt how to create Single and Multiple linear regression models. Now, letâ€™s learn how to create Polynomial regression […]

[…] and welcome to this tutorial. We have learnt how to create Single and Multiple linear regression models. Now, letâ€™s learn how to create Polynomial regression […]