CloudyML

MACHINE LEARNING ALGORITHMS

Decision Tree Pruning Techniques In Python

Decision tree algorithm is one amongst the foremost versatile algorithms in machine learning which can perform both classification and regression analysis. When coupled with ensemble techniques it performs even better. The algorithm works by dividing the entire dataset into a tree-like structure supported by some rules and conditions. Then it gives predictions based on those conditions. It empowers predictive modeling with higher accuracy, better stability and provides ease of interpretation.

We will use Iris dataset to get a better understanding of the concept and the process.

Generally, while using decision trees, there is a high chance of overfitting the model. Because it gets very complex with greater depth and greater number of splits. Thus increasing the variance of the model. This reduces the training error but the prediction on new data points is relatively poor. Thus, the pruning process comes to the rescue.

We can clearly see that the model gets overfitted when fitted with an Unpruned Decision Tree. There is sufficient difference between the training and the testing accuracy. For this reason, the tree needs to be pruned before fitting into the model.

In general, pruning is a process to remove selected parts of a plant such as bud, branches or roots. Similarly, Decision Tree pruning ensures trimming down a full tree to reduce the complexity and variance of the model. It makes the decision tree versatile enough to adapt any kind of new data fed to it, thereby fixing the problem of overfitting. It reduces the size of the decision tree which might slightly increase the training error but drastically decrease the testing error. For this, it become more adaptable.

The above example clearly depicts the difference of an unpruned and a pruned tree. The unpruned tree looks denser and complex with high variance and hence overfitting the model. Whereas, the pruned tree is optimally dense, less complex with reduced variance and more accuracy in prediction on unseen data points.
Tree pruning is generally performed in two ways – by Pre-pruning or by Post-pruning.

    Pre-pruning

 

Pre-pruning, also known as forward pruning, stops the non-significant branches from generating. We usually apply this technique before the construction of a decision tree. It uses a condition to decide when it should terminate splitting of some of the branches prematurely as the tree is generated.

Hyperparameter tuning can be used to find best fit values for parameters like ’max_depth’, ‘max_samples_leaf’, ‘max_samples_split’, etc.

    Post-pruning

 

Post-pruning, also known as backward pruning. It is the process where the decision tree is generated first and then the non-significant branches are removed. We use this technique after the construction of the decision tree. It is used when decision tree has very large or infinite depth and shows overfitting of the model. In Pre-pruning, we use parameters like ‘max_depth’ and ‘max_samples_split’.  But here we prune the branches of decision tree using cost_complexity_pruning technique. ccp_alpha, the cost complexity parameter, parameterizes this pruning technique.

ccp_alpha gives minimum leaf value of decision tree and each ccp_alpha will create different – different classifier and choose the best out of it. More number of nodes are pruned with greater values of ccp_alpha.

After appending the list for each alpha to our model, we will plot Accuracy vs alpha graph. This is to know the value of alpha for which we will get maximum training accuracy.

 

We can choose cpp_alpha = 0.05 as we get the maximum Test Accuracy = 0.93 along with optimum train accuracy with it. Although our Train Accuracy has decreased to 0.96.Now, our model is now more generalized and it will perform better on unseen data.

We can see now that our model is not Overfitting and performance on test data has improved much.
Also, it can be inferred that:
  • Pruning plays an important role in fitting models using the Decision Tree algorithm.
  • Post-pruning is more efficient than pre-pruning.
  • Selecting the correct value of cpp_alpha is the key factor in the Post-pruning process.
  • Hyperparameter tuning is an important step in the Pre-pruning process.

Code:  Get code on Github

Date : 16/05/2021

Our Popular Data Science Course

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top