Boosting Technique in Data Science
YOU MUST KNOW
To understand this article you must have knowledge about Regression Tasks, model training, bias and variance, and some concepts of machine learning Boosting technique.
Boosting is one of the ensemble techniques, ensemble in English means group of things. An ML ensemble means when you have multiple models being combined together and used together, so as to build a more robust model.
In the ensemble model, we take the different types of models and combine them, In boosting we learn how to combine the models and how to train the models.
What is boosting?
Boosting is an iterative technique that adjusts the weight of an observation based on the last classification If an observation was classified incorrectly then it tries to increase the weight of this observation and vice versa.
boosting algorithms seek to improve the prediction power by training a sequence of weak models so basically boosting is taking the weak models (learners) converted into the Single strong models (learners).
For example, if a dog-identifying model has been trained only on images of black dogs, it may occasionally misidentify a white dog. Boosting tries to overcome this issue by training multiple models sequentially to improve the accuracy of the overall model.
Take A Look At Our Popular Data Science Course
Why do we need to boosting?
- Boosting gives power to machine learning models to improve their accuracy of prediction, boosting algorithms are one of the most widely used algorithms in data science, many of Kaggle’s competitors use the boosting algorithm to improve their performance.
- To combine the set of weak learners into the strong learner to minimize the training error we need some technique so we use the boosting technique to combine the weak learners.
- Weak learner means that types of models have low prediction accuracy or overfitted models.
- Sometimes your model is under-fitted or has high bias and if you want to reduce the bias of the model without impacting the variance then we use the boosting models to reduce the bias without impacting the variance, so our model is the well-fitted model.
In this section, we will learn all stages behind the boosting.
As in the above section, we said boosting reduces the bias without impacting variances, here we will learn how this works.
Train a model using the whole dataset, this model should have high bias and low variance, DT with shallow depth is high bias and low variance model, and find the error between the predicting and the actual values.
Now we have our dataset(X), target values(Y) and error(E).
In step1, we train a model on the error(E) of the previous model at stage(0).
F1(x) = a0*h0(x) + a1*h1(x)
h0(x) ⇒ Prediction values of stage 0
a0 ⇒ constant values(alpha values)
Here we follow the same step for the K times and we train K models.
Stage k: Fk(x) = summation (alpha_i * hi(x)): additive weighted model
Each of the models at each stage is trained to fit the residual error at the end of the previous stage. By doing this we reduce the error and reducing the error means reducing the bias.
Types of boosting
So boosting is just a method to combine the weak learners into strong learners, here we have various types of boosting algorithms, and all these types of models have different combination methods of weak learners.
Mainly we use two types of boosting
- Gradient Boosting
Here we understand how the gradient boosting works, this is similarly working as the boosting.
Output is Fm(x)
Gradient boosting is a general idea, with a boosting called gradient boosting decision tree where each of the weak learns or base model is a decision tree(DT) with a very small depth.
To perform the gradient boosting decision tree we have one good library in python that is XGBoost
We will learn later in this part how to use this library.
Data Science Course With projects.Please take a look of Our Course
AdaBoost - Geometric Intuition
Adaboost is not a very popular method in the real world. This is Used in Computer Vision for face detection
Adaptive boosting: At every stage, you are adapting to errors that were made before; more weight is given to misclassified points
Let’s suppose box 1 is my dataset, at stage 0 let’s assume I train DT with the depth 1
When I separated the class, I drew an axis parallel plane like D1 and I got the error. In Box 1 there are three error points(+), before the going to stage 2, I will increase the weight of miss classify points(+)
Now stage 2 in Box 2 I have three errors (-) so l will give more weightage to these three points(-).
After this, my model looks like Box 3.
Now I have three planes D1 D2 D3 and now we combine these three planes and our final model looks like Box 4.
So here you give more importance to those points which have been miss classified. This is the core idea of ADA-Boost.
Code snippet of Gradient Boosting
Here we follow the three steps as shown in the image.
Now our input set is (Xi, Yi) and we have some loss function L(y, f(x)). And the number of iterations is M.
Initialize model with some constant values:
F0(x) = argmax i = 1nL(Yi, )
Find the constant gamma which minimizes the loss function. The loss function could be any loss function e.g. squared loss.
For m = 1 to M
- Compute the pseudo residual(error) by following this formula
- Fit a base learner (or weak learner, e.g. tree) closed under scaling hm(x) to pseudo-residuals(error) that is train it using the training set (X, rim)
- Compute multiplier m by solving the following term.
- Update the model by following the term
In this code snippet of gradient boosting we use the weak learn/base-line model as the DT model
To perform the gradient boosting with the decision tree we have a very good library in python which is XGBoost(XGB). With the help of XGB, we train our model.
To understand more we take the one dataset and we perform the modeling here, we choose the Boston house prices dataset.
So here we have to predict the price of the house by training the XGB model.
Overview of dataset
In our dataset, we have 13 features, and using all features we have to predict the price of the house. And features are as follows.
- CRIM⇒ per capita crime rate by town
- ZN ⇒ residential land is zoned for lots over 25,000 square feet.
- INDUS ⇒ non-retail business acres per town
- CHAS ⇒ Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX ⇒ nitric oxides concentration (parts per 10 million)
- RM⇒ average number of rooms per dwelling
- AGE ⇒ owner-occupied units built prior to 1940
- DIS ⇒ weighted distances to five Boston employment centers
- RAD ⇒ index of accessibility to radial highways
- TAX ⇒ full-value property-tax rate per $10,000
- PTRATIO ⇒ pupil-teacher ratio by town
- B 1000⇒ (Bk — 0.63)² where Bk is the proportion of blacks by town
- LSTAT ⇒ % lower status of the population
- MEDV ⇒ Median value of owner-occupied homes is $1000’s
Now we have an idea about the dataset we can build the XGB model.
Import the bunch of libraries
- import pandas as pd
- import xgboost as xgb
- from sklearn.datasets import load_boston
- from sklearn.model_selection import train_test_split
- from sklearn.metrics import mean_squared_error
Load the dataset from the SKLearn
Boston = load_boston()
X = pd.DataFrame(Boston.data, columns=Boston.feature_names)
y = pd.Series(Boston.target)
Split the data set into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y)
Next initialize an instance of the XGBRegressor class
regressor = xgb.XGBRegressor(
Here n_estimators is the number of the baseline models and, max_depth is the depth of the tree in the DT model, both are the hyper parameters.
As we know n_estimators and max_depth both are hyperparameters, if we increase the n_estimators values that means we increase the number of base models, as base models increase the bias of the model decrease, and if we reduce the n_estimators values then our model bias is not decreased as we expect and max_depth values are depth of individual model depth of DT, as we
know our GBDT model need high bias base model, that means each DT model depths as low as possible so we take max_depth as low as possible.
Now fit the train data to XGBRegressor class and train the model
Now mode is ready to use, we are good to go ready to predict the price of the house on test data.
y_pred = regressor.predict(X_test)
We use the mean squared error to measure the model performance. The mean squared error is the average of the differences between the predictions and the actual values squared.
Output is:– 8.71858977
To find the output of all cells and if you want to play with code follow this notebook… Click here
Take A Look At Our Popular Data Science Course
Here We Offers You a Data Science Course With projects.Please take a look of Our Course