Hello, and welcome to this tutorial. We are almost done with the first part, Data Preprocessing. I’m looking forward to being well prepared with our data so that we can start making our machine learning models in the second part. Just this one tutorial and we’re ready. In this tutorial we’re going to learn about Feature Scaling in R, and why it is very important in Data Science and Machine Learning.
What is Feature Scaling and why do we need to do it?
If you take a look at our dataset, you’ll see that the Age and Salary variables are not on the same scale. They’re not on the same scale because, the ‘Age’ is going from 27 – 50, while the ‘Salary’ is going from 48,000 to 83,000. This would cause some issues in our Machine Learning Models. Why’s that? It’s because, a lot of machine learning models are based on what is called, the Euclidean Distance. If you can remember it from back in High School, the Euclidean Distance between two points is the square root of the sum of the square coordinates. Did you catch that? Maybe a diagram will help
Picture ‘Age’ as the ‘x’ coordinate and the ‘Salary’ as the ‘y’ coordinate. In some machine learning models, the Euclidean Distance between observation points, is computed using the two x and y coordinates. Since the salary column has a wide range of values, it would dominate the ‘Age’ column. Let’s use an example from our dataset so that we can understand this more.
Take France in the 2nd row and Netherlands in the 8th row. The Euclidean Distance between those two observation points would be.
The difference between the x – coordinates is; 48 – 27 = 21. The square of 21 is 441.
The difference between the y- coordinates is; 79000 – 48000 = 31000. The square of 31000 is 961,000,000
You can see very clearly how the square difference of the ‘Salary’ dominates the square difference of the ’Age’. In the Machine learning equations, it will be like the ‘Age’ doesn’t exist, because it will be dominated by the ‘Salary’.
This is why we need to put the variables on the same scale. We’re going to transform those variables and give them values in the same range. For example, they are going to have values from a negative one to plus one, to eliminate huge numbers that dominate small numbers.
Feature Scaling in R for Data Science and Machine Learning.
There are several ways of scaling your data. One of the most common ones is standardization. Standardization means that for each observation and each feature, you subtract the mean value of the feature and you divide it by the standard deviation.
The other way of scaling your data is Normalization. Normalization means that you subtract your observation feature ‘X’ by the minimum value of all the feature values and you divide it by the difference between the maximum and the minimum of your feature values.
Don’t worry. The theory and ‘mazematic’ might be a little confusing but generally what we are doing is that we’re putting our variables on the same scale so that no variable is dominated by the other.
Now that you understand why feature scaling is important, let’s apply it to our training set and the test set. Unlike in Python, we just need two lines of code to feature scale in R. These are;
training_set = scale(training_set)
test_set = scale(test_set)
That is the feature scaling block of code that we will use in our template. However, if you run that code, you will get an error.
The County and Purchased columns are not numeric, they are factors (See encoding Categorical Data), and we cannot scale factors, we can only scale numeric values. A factor in R is not a numeric number. This means we have to specify which columns we want to scale. Indexes in R start at 1, so if we count from the County column, the ‘Age’ and the ‘Salary’ will be in the 2nd and 3rd indexes respectively.
To specify those two columns in the block of code;
training_set[, 2:3] = scale(training_set[, 2:3])
test_set[, 2:3] = scale(test_set[, 2:3])
The code should look like this in R Studio
If you press ‘Ctrl + Enter’ to run the code now.
Take a look at our training set and the test set now.
Perfect. Now you know how to do Feature Scaling in R and why it is important to Data Science and Machine Learning. Our data is ready to offer good precision and accuracy. Our machine learning models will converge rapidly and work faster.
Some Really Useful Data Science and Machine Learning Books
Congratulations, you now know how to Feature Scale in R. Even more congratulations on finishing all the required steps in Preprocessing our data. Next, I’m going to explain how we are going to use our Data Preprocessing template in our Machine Learning Models. You’ve passed the most boring part and now it’s time to have fun. It’s time to start making the models. See you at the next one.