Data ScienceR

Splitting the Dataset into the Training set and the Test Set

We’re almost done with Data Preprocessing. We’re about to begin making Machine Learning Models. We just need three more steps to make our dataset perfectly prepared for making Machine Learning Models.

One very important thing we need to do is split the dataset into a Training Set and a Test Set. Why do we need to do this?

If you take a look at our dataset, we have ten observations. What we should do with any Machine Learning models, is split our data into two; the Training set and the Test set.

Encoding Categorical data in r

If you take a look at the name itself, Machine Learning is all about teaching a machine how to do something. Your algorithm is going to learn from your data, to make predictions and perform other machine learning objectives.

Your machine learning model is going to learn to do something from your dataset, by understand the correlation that might be in that dataset. Now, imagine if the machine learning model learns too much about the correlation in one dataset. How would it behave with another dataset with slightly different correlations?

We’re going to build our Machine Learning model on one dataset, but then we have to test it on a new set, which is going to be slightly different from the dataset on which we built the Machine Learning Model.

(Okay, No more ‘too lengthy’ sentences)

That’s why we need to make two different sets; A Training set on which we build and train the Machine learning model, and a Test set on which we test the performance of the Machine Learning Model.

Preferably, the performance of the Test set, shouldn’t be that different form the performance of the training set. This would mean that the Machine Learning (ML) models understood well the correlation and can adapt to new sets and situations.

That’s the whole idea behind Splitting the Dataset into the Training Set and the Test Set. Now, we’re going to learn how to do it in R.

We have to import a library. We’re going to import the library that is going to make a good split of the dataset. This library is called ‘caTools’. To import it, just type; install.packages(‘caTools’). After you’re done installing, just delete the line or leave it as a comment as you won’t need to install it again. It will appear listed in the packages section of R Studio.


We’ve installed the ‘caTools’ package but we still need to activate it to use it. You can activate it by checking the box next to it in the packages section. Alternatively, if you feel like flexing your script-skills, just type; library(caTools)

Finally, we’re good to go.

In Python, we used the ‘Random State’ equals Zero, to get the same results. Well, here it’s going to be the same. We’re going to set a seed to get the same results. In R Studio type; set.seed(123)

123 is the seed that we planted to get similar results.

It’s not as simple as in Python where we made it in one line. Here we’re going to prepare the method that we’re going to call, ‘split’. This is the method that’s going to make a split of your dataset into the Training set and the Test set. In R Studio type; split = sample.split()

We’re going to have a few arguments. The first one is ‘Y’. Unlike in Python, we just put the dependent variable vector Y. So; split = sample.split(dataset$Purchased)

The second parameter is going to be the split ratio. This is just the percentage of the observations that you want to put into your Training set. And we want this to be 80%. So, split = sample.split(dataset$Purchased, SplitRatio = 0.8)

This will return ‘True or False’ for each of your observations. It will be ‘True’ if an observation was taken to the Training set, and ‘False’ if the observation was taken to the test set. Run the code.

Now go to the console and write ‘split’ and press ‘Enter’. You’ll see that you have ten values some ‘TRUE’ and others ‘FALSE’

All we need to do now is create the Training set and the Test set separately.

To create the Training set type; training_set = subset(dataset, split == TRUE)

To create the Test set type; test_set = subset(dataset, split == FALSE)


Run the code. You’ll that now we have created both sets of data.


Training Set

Test Set

Congratulations you’re almost there. Now we perform Feature Scaling. I’ll tell you why it is important and show you how to do it.

What's your reaction?

In Love
Not Sure
Gizmophile, coz geeks and nerds are just too mainstream

You may also like

More in:Data Science

Leave a reply

Your email address will not be published. Required fields are marked *