Decision Tree Pruning Techniques In Python
Decision tree algorithm is one amongst the foremost versatile algorithms in machine learning which can perform bothclassification and regression analysis. When coupled with ensemble techniques it performs even better. The algorithm works by dividing the entire dataset into a tree-like structure supported by some rules and conditions and then gives predictions based on those conditions. It empowers predictive modeling with higher accuracy, better stability and provides ease of interpretation. We will use Iris dataset to get a better understanding of the concept and the process.
Generally, while using decision trees, there is a high chance of overfitting the model as it gets very complex with greaterdepth and greater number of splits, hence increasing the variance of the model. This reduces the training error but the prediction on new data points is relatively poor. Thus, the pruning process comes to the rescue.
We can clearly see that the model gets overfitted when fitted with an Unpruned Decision Tree. There is sufficient difference between the training and the testing accuracy and hence the tree needs to be pruned before fitting into the model.
In general, pruning is a process to remove selected parts of a plant such as bud, branches or roots. Similarly, Decision Tree pruning ensures trimming down a full tree to reduce the complexity and variance of the model. It makes the decision tree versatile enough to adapt any kind of new data fed to it, thereby fixing the problem of overfitting. It reduces the size of the decision tree which might slightly increase the training error but drastically decrease the testing error, hence making it more adaptable.
The above example clearly depicts the difference of an unpruned and a pruned tree. The unpruned tree looks denser and complex with high variance and hence overfitting the model. Whereas, the pruned tree is optimally dense, less complex with reduced variance and more accuracy in prediction on unseen data points.
Tree pruning is generally performed in two ways – by Pre-pruning or by Post-pruning.
Pre-pruning, also known as forward pruning, stops the non-significant branches from generating. This technique is used before the construction of a decision tree. It uses a condition to decide when it should terminate splitting of some of the branches prematurely as the tree is generated.
Hyperparameter tuning can be used to find best fit values for parameters like ’max_depth’, ‘max_samples_leaf’, ‘max_samples_split’, etc.
Post-pruning, also known as backward pruning, is the process where the decision tree is generated first and then the non-significant branches are removed. This technique is used after the construction of the decision tree. It is used when decision tree has very large or infinite depth and shows overfitting of the model. In Pre-pruning, we used parameters like ‘max_depth’ and ‘max_samples_split’ but here we prune the branches of decision tree using cost_complexity_pruning technique. ccp_alpha, the cost complexity parameter, parameterizes this pruning technique.ccp_alpha gives minimum leaf value of decision tree and each ccp_alpha will create different – different classifier and choose the best out of it. More number of nodes are pruned with greater values of ccp_alpha.
After appending the list for each alpha to our model, we will plot Accuracy vs alpha graph to know the value of alpha for which we will get maximum training accuracy.
We can choosecpp_alpha = 0.05 as we get the maximum Test Accuracy = 0.93 along with optimum train accuracy with it. Although our Train Accuracy has decreased to 0.96, our model is now more generalized and it will perform better on unseen data.
Also, it can be inferred that:
- Pruning plays an important role in fitting models using the Decision Tree algorithm.
- Post-pruning is more efficient than pre-pruning.
- Selecting the correct value of cpp_alpha is the key factor in the Post-pruning process.
- Hyperparameter tuning is an important step in the Pre-pruning process.