
One of the major aspects of training your machine learning model is avoiding overfitting. In machine learning, regularization is a method to solve over-fitting problem by adding a penalty term with the cost function. When we use regression models to train some data, there is a good chance that the model will overfit the given training data set. Regularization is one such technique used to overcome overfitting issue in regression problems. It constraints large weights/coefficients of features to avoid overfitting. It restricts the degrees of freedom of a given equation by reducing their corresponding weights. Before going into details, let’s first understand what is overfitting and why is it a problem?
Overfitting and its repercussions
Overfitting refers to the scenario where a machine learning model can’t generalize or fit well on unseen dataset. It is a modelling error in statistics that occurs when a function is too closely aligned to a limited set of data points. A clear sign of machine learning overfitting is if its error on the testing or validation dataset is much greater than the error on training dataset.
Overfitting the model generally takes the form of making an overly complex model to explain idiosyncrasies in the data under study. A model learns the detail and noise in the training dataset to the extent that it negatively impacts the performance of the model on a new dataset. This means that the noise or random fluctuations in the training dataset is picked up and learned as concepts by the model. In reality, the data often studied has some degree of error or random noise within it due to which these concepts do not apply to new datasets and negatively impact the model’s ability to generalize. Thus, attempting to make the model conform too closely to slightly inaccurate data can infect the model with substantial errors and reduce its predictive power. As a result, over fitting may fail to fit additional data, and this may affect the accuracy of predicting future observations.
Why use Regularization?
- It reduces the variance of the model, without a substantial increase in the bias.
- The tuning parameter λ controls the bias and variance trade-off.
- When the value of λ is increased up to a certain limit, it reduces the variance without losing any important properties in the data.
- After a certain limit, the model will start losing some important properties which will increase the bias in the data.
- The value of λ is selected using cross validation methods.
- A set of λ is selected and cross-validation error is calculated for each value of λ.
- The value of λ with minimum cross-validation error is finally chosen.
Types of Regularization
1)LASSO(LeastAbsolute Shrinkage and Selection Operator) Regression
- It is also termed as L1 Norm.
- It penalizes the model based on sum of magnitude of the coefficients.
- The regularization term is given by 𝜆∗∑|𝛽𝑗|where, 𝜆 is the shrinkage factor.
- It reduces some coefficients exactly to zero when we use a sufficiently large tuning parameter 𝜆.
- Hence, in addition to regularization, it also performs feature selection.

2) Ridge Regression
- It is also termed as L2 Norm.
- It penalizes the model based on sum of squares of magnitude of the coefficients.
- The regularization term is given by 𝜆∗∑|𝛽2𝑗|where, λ is the shrinkage factor.
- It shrinks the coefficients for those predictors which contribute very less in the model but have huge weights, very close to zero. But it never makes them exactly zero.

Let’s consider 𝛽1 and 𝛽2 be coefficients of a linear regression and λ = 1,
For Lasso,

For Ridge,

Where, Max is the maximum value the equations can achieve.
If we plot the above equations, we get the following graph:

The red ellipse represents the cost function of the model, whereas the square (left side) represents the Lasso regression and the circle (right side) represents the Ridge regression.
Linear regression is the standard algorithm for regression that assumes a linear relationship between inputs and the target variable. An extension to linear regression involves adding penalties to the loss function during training that encourage simpler models that have smaller coefficient values. These extensions are referred to as regularized linear regression or penalized linear regression. Besides L1 and L2 norm, here is an important concept that you can’t miss.
Elastic Net
- Elastic net is a popular type of regularized linear regression that combines two popular regularization penalties, specifically the L1 and L2 penalty functions.
- It is an extension of linear regression that adds regularization penalties to the loss function during training.
- It is a middle ground between the L1 norm and the L2 norm.
- The regularization term is a simple mix of both Ridge and Lasso’s regularization terms, and you can control the mix ratio α.
- αis the mixing parameter between ridge (α = 0) and lasso (α = 1).

Generally, it is almost always preferable to have at least a little bit of regularization. Ridge is a good default, but if you suspect that only a few features are actually useful, you should prefer Lasso or Elastic Net since they tend to reduce the useless features’ weights down to zero as discussed earlier. In general, Elastic Net is preferred over Lasso since Lasso may behave erratically when the number of features are greater than the number of training instances or when several features are strongly correlated.
For hands on experience on regularization, Please Click Here to get the complete code on GitHub.
Our Popular Data Science Course