encoding categorical data in R

Hello and welcome to this tutorial. We’re halfway in our data preprocessing phase. We’ve learned how to install R and RStudio, import the dataset, and take care of missing data using the R language. Now I’m going you show you how to encode categorical data in R.

If you take a look at our dataset, you’ll see that we have two categorical variables. We have the county variables – Nairobi, Kisumu, and Mombasa – and we have the Purchased variables – Yes and No. They’re categorical variables, obviously because they have categories. Since machine learning models are based on mathematical/numerical equations, keeping the text in the categorical variables would definitely cause us some problems. We want to have ‘numbers only’ in our equations. That is why we need to encode the Text into Numbers so that our machine learning models can work with them.

Encoding categorical data in R

We are going to use the factor function. The factor function transforms your categorical variables into numeric categories but still sees them as factors. Even more, the form factor allows you to choose the labels/names of those factors. Let’s take a look at our dataset, then get straight to encoding our categories.

encoding categorical data in R
Our Dataset before we encode it

County column

We will transform the county column into a column of factors, and specify what those factors are.

We just take the column country – dataset$County – then we use the factor function – factor () – and in the factor function we are going to specify 3 things

  • First, the dataset we want to transform; – ‘dataset$County,
  • Second, we’re going to specify the levels, and that’s the names of the categories in the County column; ‘levels = c(‘Nairobi’, ‘Mombasa’, ‘Kisumu’),
  • Third, we specify the labels. Which number are we going to assign to Nairobi, Kisumu, and Germany, each (You can use any numbers you want); ‘labels = c(1, 2, 3)

That’s it. The whole function should look something like this;

dataset$County = factor(dataset$County,

                                                levels = c(‘Nairobi’, ‘Mombasa’, ‘Kisumu’),

                                                labels = c(1, 2, 3))

Categorical Data in R for Data Science
Encoding the County Column

Now, if you take a look at our dataset, the names – Nairobi, Kisumu, and Mombasa – have been encoded with the numbers – 1, 2, 3 respectively.

Purchased Column

We are going to do the same for the Purchased column. Just copy everything from the above function. Replace the ‘dataset$County’ with ‘dataset$Purchased’ and the Levels with ‘levels = c(‘No’, ‘Yes’),’. Also, replace the labels with ‘labels = c(0, 1)’. The whole function

dataset$County = factor(dataset$County,

                                                levels = c(‘No’, ‘Yes’),

                                                labels = c(0, 1))

Categorical Data in R for Data Science
Encoding the Purchased Column

That’s it. Select the code and press ‘Ctrl + Enter’. Likewise, if you look at the dataset, the names – yes and no – have been replaced with 1 and 0 respectively. Let’s take a look at our dataset now.

categorical Data in R
Both categories encoded

Some Really Useful Data Science and Machine Learning Books

You’ve seen how you can encode categorical data in R. Next, we are going to separate the Training dataset from the Testing dataset. See you then.

If you found this post useful, please share. Thank you

What's your reaction?

In Love
Not Sure

You may also like

1 Comment

  1. […] Congratulations, now you know how to take care of missing data using R in Data Science. I look forward to seeing you in the next tutorial- we will talk about dealing with Categorical Data. […]

Leave a reply

Your email address will not be published.

More in:Data Science