taking care of missing data in R

Hello, and welcome to this tutorial. In the previous tutorial, we learned how to import the Dataset and import the libraries. Now, we’re finally going to start preparing the data so that our machine learning models run correctly. In most cases, you are going to have to deal with the problem of dealing with missing data. In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. It happens really often so you need to know to take care of missing data.

In our dataset, we have two missing data entries – we have one missing in the age column in Kisumu and another in the salary column in Mombasa.

How can we handle this problem? Basically, we have two main options.

taking care of missing data in r
We have two missing data values denoted by ‘N/A’

One, we could remove the rows with the missing data. However, this is a very dangerous practice because this dataset could contain very crucial information. It would not make sense to remove an observation.

Two – and this is actually the most common idea to handle missing data – is to use the mean of the columns with the missing data. This is the method we are going to use in this tutorial.

Taking care of Missing Data in R using the Mean/Average

In R we are going to find the mean of the two missing data entries separately.

Age Column.

We take the ‘age’ column of the dataset, dataset$Age. Then, we use an if-else statement which will take three parameters. The first parameter is your condition. This is a condition that is used to check if a value in the column is missing or not. The condition is going to be;

dataset$Age = ifelse(is.na(dataset$Age),

The second is the value you want to be returned if the condition above is true. If the condition is true, it means we have a missing value and we have to replace the missing value with the average/mean of the column. To compute the average, we use the mean function in R;

 ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)),

The third parameter is the value you want to be returned if the condition is not true. If the condition is not true, it means we have no missing values in our column. We simply want to return the ‘age’ column;

dataset$Age

That’s done. Our complete function should look like this;

dataset$Age = ifelse(is.na(dataset$Age),

ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)),

dataset$Age)

Function to find the Mean in the ‘Age’ column

Select all the lines of code we just added and press ‘Ctrl + Enter’. If you take a look at our dataset, you’ll see that the missing value has been replaced with the mean of the values in the ‘age’ column.

Salary Column

We are going to do the same for the ‘salary’ column. Just replace the dataset$Age with dataset$Salary. Make sure the lines of code are properly aligned. The complete function should look like this;

 

Select the lines of code and press ‘Ctrl + Enter’. Our missing value has been replaced by the mean of the values in the ‘Salary’ column.

Some Really Useful Data Science and Machine Learning Books

Congratulations, now you know how to take care of missing data using R in Data Science. I look forward to seeing you in the next tutorial- we will talk about dealing with Categorical Data.

What's your reaction?

Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0

You may also like

1 Comment

  1. […] have imported our Dataset. Let’s continue with Taking care of Missing Data in the next article. Please share this article. Remember to subscribe, to get notified every time […]

Leave a reply

Your email address will not be published.

More in:Data Science