CloudyML

DATA SCIENCE WORLD

Outlier Detection and Treatment

Data manipulation is not an easy task, especially when you have outliers to deal with. It is important to clean the data sample in the modeling process to ensure that the observations best represent the problem. The presence of outliers in a dataset can result in a poor fit and lower predictive modeling performance.

What are Outliers?

Outliers are nothing but data points that differ significantly from other observations. They are the points that lie outside the overall distribution of the dataset. Outliers, if not treated, can cause serious problems in statistical analyses.

Types of Outliers

Outliers are generally classified into two types: Univariate and Multivariate.

Univariate Outliers – These outliers are found in the distribution of values in a single feature space.

Multivariate Outliers – These outliers are found in the distribution of values in a n-dimensional space (n-features).

Why do outliers exist?

  • Variability in data (Natural errors due to few exceptional data readings)
  • Data Entry errors (Human errors)
  • Experimental errors (Execution errors)
  • Measurement errors (instrument errors)
  • Sampling errors (extracting/mixing data from wrong or different source)
  • Intentional (dummy outliers made to test detection models)

What impact do outliers have on dataset?

  • Cause problems during statistical analysis.
  • Cause significant impact on the mean and standard deviation of the data.
  • Non-random distribution of outliers can cause decrease in Normality.
  • Can impact the basic assumptions of Regression, ANOVA and other statistical models.
  • Can bias estimates of substantive interest.

Outlier Sensitive Machine Learning Models

Not all machine learning models have issues with outliers. There are just few of them which are sensitive to outliers and are affected by them. Sensitivity of models to outliers is intuitive and can be understood by the statistics hidden behind each of these models. Such outlier sensitive machine learning models are listed below-

  • Linear Regression
  • Logistic Regression
  • KNN (for small value of K, KNN is sensitive to outliers)
  • K Means
  • Hierarchical
  • PCA
  • Neural Networks

Outlier Detection Methods

Outliers can be detected by implementing mathematical formulas, using statistical approaches or by using visualization tools.

Some of the most popular methods for outlier detection are mentioned below:

  • Scatter plots – Scatter plots often have a pattern. Although there is no special rule in scatter plots that tells us whether a point is outlier or not, but we call a data point an outlier if it doesn’t fit the pattern.
  • Histograms – Outliers are often easy to spot in histograms. The data points that lie extremely away from the majority of data points are termed as outliers.
  • Box plots – Box plots help us visualize the outliers with the help of IQR(Interquartile) range. They are the standardized way of displaying the distribution of data based on “minimum”, “first quartile(Q1)”, “median(Q2)”, “third quartile(Q3)” and “maximum”. All the data points below “minimum” and above “maximum” are considered as outliers.

Where,

    

IQR = Q3 – Q1   
minimum = Q1 – 1.5*IQR
maximum = Q3 + 1.5*IQR   

  • Z-score – Z-score indicates how far the data point is from the mean in the standard deviation. All the observations whose z-score is greater than three times standard deviation i.e. z > 3, are considered as outliers.
  • IQR range – Interquartile Range is just a mathematical way to find outliers. Box plots are based on these calculations only. All the data points from first quartile to third quartile are said to lie in the interquartile range. We subtract 1.5*IQR to find the minimum value below which all data points are considered as outliers whereas, we add 1.5*IQR to find the maximum value above which all the data points are considered as outliers.
  • DBSCAN (Density Based Spatial Clustering of Applications with Noise) – This method is very intuitive and effective when the distribution of values cannot be assumed in the feature space. It works well with multidimensional feature space (3D or more). Visualizing the results is pretty easy with this method.
  • Isolation Forests – The concept of isolation forests is based on binary decision trees. Its basic principle is that outliers are few and far from the rest of the observations. This method is fairly robust with only few parameters and easy optimization. This method is very effective when distributions cannot be assumed.

Some general cases you should know…

Case 1: When the data is Normal/Gaussian Distributed

This is the case when we use standard deviation to find the outliers. The data is normally distributed and we unnecessarily don’t use IQR everywhere to deal with outliers. In this method, we calculate the “upper” and “lower” boundaries, out of which all the data points are considered as outliers.

upper = mean + 3 * standard deviation

lower = mean – 3 * standard deviation

For example, let’s consider a feature ‘X’ and now calculate its boundaries in python.

upper = df[‘X’].mean() + 3*df[‘X’].std()

lower = df[‘X’].mean() – 3*df[‘X’].std()

Almost 99.8% data is expected to be non-outlier using this method. We remove the unnecessary data points(outliers) or treat them using appropriate method.

Case 2: When the data is skewed (left/right)

In this case, we deal with imbalanced dataset. We use IQR to find the “upper” and “lower” boundaries, outside which all the data points are considered as outliers.

upper = Third quartile(Q3) + 1.5 * IQR

lower = First quartile(Q1) – 1.5 * IQR

But, in practice the above formula gives a lot of points as outliers which are generally not. So, we usually eliminate/treat just the extreme outliers by some rectification in the above formula.

upper = Third quartile(Q3) + 3 * IQR

lower = First quartile(Q1) – 3 * IQR

For example, let’s consider a feature ‘Y’ and now calculate its boundaries in python.

upper = df[‘Y’].quartile(0.75) + 1.5*IQR

lower = df[‘Y’]. quartile(0.25) – 1.5*IQR

OR

upper = df[‘Y’].quartile(0.75) + 3*IQR

lower = df[‘Y’]. quartile(0.25) – 3*IQR

The outliers are thus easily found and treated appropriately in the imbalanced dataset using above method.

Outlier Treatment methods

There are several methods to treat outliers. Few of them are listed below-

  1. Deletion – We delete outlier values if it is due to data entry error, data preprocessing error or if outlier observations are very less in number. We can also trim at both ends to remove outliers from the dataset.
  1. Transformation – Transforming variables is also one of the ways to eliminate outliers. Natural log of a value reduces the variation caused by extreme values.
  1. Mean/Median/Mode Imputation – Imputation of outliers is similar to handling missing values with this method. We can use mean, median, mode imputation methods to treat the outliers. But before any imputation we must make sure if the outlier is artificial or no. If it is artificial, we can perform imputation easily but if it is natural, we must not impute the values. We can also use statistical models to predict the values of outliers and later impute them with the predicted values.
  1. Quantile based flooring & capping – In this technique, the outliers are capped at a certain value above 90th percentile or floored at a factor below the 10th percentile. The data points lesser than the 10th percentile are replaced with the 10th percentile value and the data points greater than the 90th percentile are replaced with the 90th percentile value.
  1. Separate treatment – We can use separate statistical models to treat significant number of outliers. The approach is to treat both groups (outlier and non-outlier) as two different groups and build individual model for both the groups and then combine their outputs.
  1. Insensitive Machine Learning Models – We can also use outlier insensitive machine learning models to deal with outliers. Such models are listed below-
  • Naive Bayes
  • SVM
  • Decision Tree
  • RandomForest
  • XGBoost
  • Gradient Boosting
  • KNN (for large value of K, KNN is robust to outliers)

Ahh! That was quiet informative right? I can feel the satisfaction you might be having after reading this blog. I am pretty sure that all your regular headaches of handling outliers might now be reduced, and you can enjoy your work by exploring deeper in the domain. Outliers are hard to handle when you just mug up the process and cake walk when you understand the concept and be more intuitive.

For better understanding and to see how things actually work, check out the code.

I hope the blog was helpful to you and it fulfilled your purpose of looking it at a glance for better understanding. Stay tuned with me for more such blogs.

Happy Reading😊

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top