Missed your data?
Let's retrieve it
We often see a lot of missing values in the real-world data these days. The cause of data missing may be insufficient information, failure to record data or data corruption. Thus, handling missing values is the most important and the primary step of data pre-processing failing which may lead to inaccurate training of the model and making false predictions. Missingness may be in any of the columns – be it the independent feature(X) or the dependent feature(Y).
Missingness is broadly categorized in 3 categories-
1) Missing completely at Random (MCAR) – Missingness has no relation to dataset variables X or Y.
2) Missing at Random (MAR) –Missingness has relation to dataset variable X but not with Y.
3) Missing not at Random (MNAR) – Missingness has direct relation to variable Y.
We’ll see how missing data looks and what are the ways to handle it with an example-
There are generally two types of methods to handle missing values which are listed as below-
A) Discard Data
1. List-wise (Complete Case Analysis) deletion
- Removes all data for a case that has one or more missing values.
- Often disadvantageous as the assumptions of MCAR are typically rare to support.
- Produces bias parameters and estimates.
2. Pairwise (Avoidable Case Analysis) deletion
- Minimizes the loss that occurs in list-wise deletion.
- Increases power in analyses by maximizing all data available by an analysis-by-analysis basis.
- Disadvantageous at times as it produces standard of errors that are underestimated or overestimated.
3. Dropping Variables
- Favorable when more than 60% data is missing from the row/column.
- Should only be preferred when the variable is insignificant.
- Imputation must be preferred over dropping variables in all other cases.
B) Retain All Data
1. Mean, Median and Mode
- Most common method of data imputation where all the missing values can easily be replaced by the mean, median and mode of the column.
- Must be avoided as it will warp your results.
- Beware using it when your data is MNAR.
2. Last Observation Carried Forward (LOCF)
- Method of imputing data in vertical manner where the information is fetched from the cell above in the same column.
- LOCF is used to maintain the sample size and reduce the bias(if any) in the data.
- A familiar example of this would be the consideration result of previous semester exams for the current semesters in order to avoid examinations during the COVID pandemic.
3. Next Observation Carried Backward (NOCB)
- This is a similar approach to LOCF but works in the opposite direction.
- It takes the first observation after the missing value in the same column and carries it backward.
- NOCB may be useful in handling Real World Data(RWD), Electronic Health Records(HER) etc where the outcome data collection is usually not structured.
4. Linear Interpolation
- This is the default method used by interpolation and assumes linear relationship between the datapoints.
- Utilizes non-missing values from adjacent data points to compute a value for missing data point.
- Estimates the unknown value in the same increasing order from previous values.
- Interprets values by connecting points in a straight line not by using the index.
5. Common Point Imputation
- Takes the middle point or the most commonly chosen value for handling missing values.
- Brings less distortion compared to zero-fill (missing data replaced with zeros).
- Elbow method of K-means algorithm can be used to check the distortion.
6. Adding a variable to capture NaN
- Works with all type categorical columns with no prior assumptions.
- ReplacesNaN categories with most occurred values in the actual column.
- Adds a new feature to introduce some weight/importance to all the observations.
7. Frequent category Imputation
- Replaces all occurrences of missing values within a variable with mode(the most frequent value).
- Mostly preferred with categorical variables but suitable for both numerical and categorical variables.
- Used with MCAR with no more 5% variable containing missing data.
8. Arbitrary Value Imputation
- Replaces all occurrences of missing values within a variable with an arbitrary value.
- Should be different from mean or median and shouldn’t be within the normal values of the variable.
- Suitablefor both numerical and categorical variables.
9. Missing Category Imputation
- Most widely used method of missing data imputation for categorical variables.
- Treats missing data as additional label or category of the variable.
- Creates a new label or category by filling the missing observations with Missing category.
10. Random Sampling Imputation
- Takes random observation from the available observations of the variable.
- Uses the randomly selected values to fill in the missing values.
- Suitable for both numerical and categorical variables.
11. Multiple Imputation (MI)
- MI is an advanced missing data handling method.
- Each missing value is replaced by several different values and consequently several different completed datasets are generated.
- It is performed in three steps.
- Firstly, the incomplete dataset is copied several times.
- Then, the missing values are replaced with the imputed values in each copy of the dataset resulting in multiple imputed datasets.
- Finally, the imputed datasets are each analyzed and the study results are then pooled into the final study result.
Some Statistical/Predictive models that impute the missing data
1. Linear Regression
- Estimates the missing values by Regression using other variables as parameters.
- Commonly, first the regression model is estimated in the observed data and subsequently using the regression weights the missing values are predicted and replaced.
- Stochastic regression imputation is an improvement to regression imputation where a random error is added to the predicted value from the regression.
- Neighbour based imputation
- KNN imputer is simple, flexible and easy to interpret.
- KNN imputer gives much better interpretation than ad-hoc approaches like mode imputation.
- For a discrete variable, KNN imputer uses the most frequent value among the K nearest neighbours and for a continuous variable, uses the mean or the mode.
- For large dataset, using a KNN imputer could be slow.
3. Random Forest
- Unlike standard imputation approaches, Random Forest based imputation methods do not assume normality or require specification of parametric models.
- It can not be conclusively said that these methods perform for non-normally distributed data.
- They should not be indiscriminately recommended for imputing missing data, especially when data are highly skewed and/or outcome-dependent MAR.
- For correct analysis, a careful critique of missing data mechanism and inter-relationships between the variables in the data are required.
4. Maximum Likelihood Multiple Imputation
- Similar but advantageous over multiple imputation.
- Simpler to implement than multiple imputation.
- Unlike multiple imputation, maximum likelihood has no potential incompatibility between an imputation model and an analysis model.
- It produces a deterministic result rather than a different result every time.
- It’s statistical inference is based on t-stats which adjusts for small sample size.
5. Expectation – Maximization
- Likelihood based procedure.
- Most likely values of the regression coefficients are estimated given the data and subsequently used to impute the missing value.
- Gives the same results as first performing a simple regression analysis in the dataset and subsequently estimate the missing values from the regression equation.
6. Sensitivity Analysis
- Imputations are generated according to one or more scenarios(not equally likely).
- The goal of the sensitivity analysis is to explore the result of the analysis under alternative scenarios for the missing data.
- An educated guess based on external information beyond the data is made about both the direction and magnitude of the missing data had they been observed.
We have come a long way handling missing values and now I can bet that they won’t trouble you anymore. Missing few points/values/information is a part of our day to day lives and it happens more often when you handle that at a large scale. Hence, it is necessary for everyone to know how to get those missing values. Various methods can be used on different features to overcome the issue depending on how and what the data is about. All we need to have is a little domain knowledge about the dataset, which can give an insight into how to preprocess the data and handle missing values.
Our Popular Data Science Course