Decision Tree
Decision Tree
Created on Aug 03, 2025, Last Updated on Sep 17, 2025, By a Developer
Basic Concept
Decision tree is very different from normal machine learning model. It construct a tree, where each node split dataset into two part based one feature. And each leaf represents a set of possible results and corresponding probability.
During Training:
- For each node, iterate on available feature and try split the dataset into two.
-
For each feature, calculate the purity and information gain:
-
Select the feature has greatest positive information gain.
-
To calculate the purity of each node, where p is the correctness for current node:
- Entropy measures uncertainty:
- Gini measures impurity:
- Entropy measures uncertainty:
-
- Stop Slitting when:
- Maximum depth reached.
- A node is 100% one class
- When improvement of purity score are below a threshold.
- When number of examples in a node is below a threshold.
- For Continuous valued feature:
- Treat several or all possible on a feature as potential splitting points.
Decision Tree Regression
The major different between using decision tree for classification and regression is the metrics measure model performance during tree splitting. When performing regression tasks, Mean Square Error (MSE) is been used.
Prevent Overfitting
Decision tress are very easy to overfit. There are several tricks to help optimize this.
- Early Stopping: There are multiple hyperparameters are available to stop the tree from splitting:
max_depth,min_samples_split,min_samples_leaf,min_weight_fraction_leaf,min_impurity_decrease,max_featuresand etc. - Pruning: Instead of stopping the tree from growing, another approach is letting it grow at will, and prune some branches.
- Minimal Cost-Complexity Pruning: Each node has
, which measures the information gain of splitting certain node in proportion to the tree size. Then all nodes with a smaller than certain threshold would be pruned.
- Minimal Cost-Complexity Pruning: Each node has
- Ensembling
Tree Ensemble
Ensemble is not limited to decision tree algorithm, but a general modeling strategy. It refers to using multiple stump(instance) working together to give the prediction. Each model instance is trained on different set of training samples and even different set of features.
There are two ensemble strategy in general:
- Bagging, all stumps are equivalently important, the final result is derived from a pool of individual result.
- Boosting, all stumps are built sequentially, with each one focusing on the error made by previous models.
- The base stump is trained as normal, and for each sample we initialize and keep a residual value
. - And for each later iteration, fit the tree and reduce model residual with a factor constant lambda, by fitting to model to residual instead of y number.
- The inference utilize all stumps, where h refers to individual instance
- The base stump is trained as normal, and for each sample we initialize and keep a residual value
Random Forest
Random Forest is a ensemble bagging model whose base model is decision tree. As the name indicating, the “forest” is build out of multiple trees, each trained on different random set of samples, and different random set of features.
AdaBoost
Different from general ensemble methods having a same
- Initialize sample weights
. - Repeat for certain iteration, in each iteration b:
- Fit a tree
to training data (different from general ensemble methods fitting residual) with sample weights . - Calculate error
. - And assign the stump with factor
- Update sample wights
- Fit a tree
- Output
, where signis literally output 1 or -1.
Gradient Boosting
Gradient Boosting is a generalized boosting algorithm. Instead of fitting the model to a residual got kept track for each sample, it fit the model to the negative gradient.
Which to Pick?
When number of features is big (more than hundred), Random Forest performs better. While Gradient Boosting almost always performs better than AdaBoost.