Machine Learning Basic-Handling Missing Values in Dataset

Ansuman Bhujabal
4 min readSep 11, 2023

--

How to handle missing values in datasets from the pov of ML

Panda Thumbnail By Ansuman

Datasets are essential element for training and testing any Machine Learning Model and before we feed the data to the model the data has to go through the pre-processing so, that our model can take the maximum accuracy and be more predictive .

What is Missing Values:

Generally we import datasets from any data collecting organization or any data managing site or database.

And we have to make sure that the dataset is complete and does not contain any Missing Value.

Missing Value- When inside a datasets some features or labels does not have any values they can create mis-calculation in our ML model.

example- Suppose there is a dataset which contains data about Rocks and Mines. It has some features and the end labels says whether the object is Rock or Mine . Suppose in that dataset 50 Rock and 50 Mine data are there.

If there are some empty or missing value features field in Rock data and the labels are Rocks as usual. So when we will train our model based on this , the model will take empty as a special value and will lead to miss-prediction of model ultimately failing the model.

How to handle Missing Values:

step-1: Find Missing Values

#dataset = your dataset in pandas framework
dataset.isnull.sum()

this line of code will find each and every missing value field n the dataset.

step-2: Handle

Handling Missing Values

There are 2 major ways we can handle missing values once we are sure.

  1. Removing (Dropping):

Just simply removing those particular rows whose features are null or value is missing .

Better for larger datasets as it does not affect the dataset overall.

Not recommended for smaller datasets.

code#

#drops all rows that has any feature missing values
dataset=dataset.dropna(how='any')

2. Imputation:

Replacing the missing values with some Statistical values(mathematically and logically relevant) that won’t disturb and change the pattern of the dataset.

Generally we consider 3 statistical values in imputation method.

Mean

Median

Mode

Now let’s recall all these concepts from high school with the help of this image

Mean By Ansuman
Median By Ansuman
Mode By Ansuman

To know exactly what value to replace with, first we need to have a look at distribution pattern of missing value.

code#

sns.displot(dataset.feature_name)
  1. Mean:
  • When to Use: The mean is used to impute missing values when dealing with continuous or numeric data, such as age, income, or temperature.
  • How it Works: Calculate the mean (average) of the available data points in the feature and use this value to fill in the missing entries.
  • Example: If you have a dataset of ages and some entries are missing, you can calculate the mean age of the individuals with known ages and use that value to impute the missing ages.
  1. Mode:
  • When to Use: The mode is used for imputation when dealing with categorical data or data with a limited set of distinct categories, like car types, colors, or city names.
  • How it Works: Find the mode, which is the most frequently occurring value in the available data, and use it to replace the missing values.
  • Example: If you have a dataset of car colors and some entries are missing, you can find the most common color among the known entries and use that color to impute the missing values.
  1. Median:
  • When to Use: The median is a good choice for imputation when dealing with skewed data or outliers in continuous or numeric data. It’s less sensitive to extreme values than the mean.
  • How it Works: Arrange the available data in ascending order and find the middle value. Use this middle value as the imputed value for missing entries.
  • Example: If you have a dataset of incomes, and there are some extreme outliers (very high or low incomes), the median can provide a more robust imputation method than the mean.

In the sns plot if the outliers(Different from general value pattern) are more significant, it is known as skewed.

And the mean does not become significant enough to show relation as it is highly manipulated by outliers.

So median and modes are choices we left with.

code#

dataset['feature_name'].fillna(dataset['feature_name'].median(), inplace=True)

Conclusion:

🤖🤣 In conclusion, handling missing values in your datasets is like trying to find your car keys in a pile of laundry — you know they’re in there somewhere, but it can be a real challenge! But fear not, brave data explorer, armed with the power of mean, mode, and median, you can turn those missing values into found treasures for your machine learning models.

If you enjoyed this data-driven comedy show, don’t forget to give it a star on my GitHub repository and connect with me on LinkedIn to stay updated with more hilarious data adventures. Remember, in the world of machine learning, a good sense of humor can be just as important as a well-imputed dataset! 😄

Follow me on GitHub: GitHub:

Connect with me on LinkedIn: LinkedIn:

--

--

No responses yet