ZANE.C

Decision Tree

Decision Tree

Created on Aug 03, 2025, Last Updated on Sep 17, 2025, By a Developer

Basic Concept


Decision tree is very different from normal machine learning model. It construct a tree, where each node split dataset into two part based one feature. And each leaf represents a set of possible results and corresponding probability.

During Training:

  • For each node, iterate on available feature and try split the dataset into two.
    • For each feature, calculate the purity and information gain:

    • Select the feature has greatest positive information gain.

    • To calculate the purity of each node, where p is the correctness for current node:

      • Entropy measures uncertainty:
      • Gini measures impurity:
  • Stop Slitting when:
    • Maximum depth reached.
    • A node is 100% one class
    • When improvement of purity score are below a threshold.
    • When number of examples in a node is below a threshold.
  • For Continuous valued feature:
    • Treat several or all possible on a feature as potential splitting points.

Decision Tree Regression


The major different between using decision tree for classification and regression is the metrics measure model performance during tree splitting. When performing regression tasks, Mean Square Error (MSE) is been used.

Prevent Overfitting


Decision tress are very easy to overfit. There are several tricks to help optimize this.

  • Early Stopping: There are multiple hyperparameters are available to stop the tree from splitting: max_depth, min_samples_split, min_samples_leaf, min_weight_fraction_leaf, min_impurity_decrease, max_features and etc.
  • Pruning: Instead of stopping the tree from growing, another approach is letting it grow at will, and prune some branches.
    • Minimal Cost-Complexity Pruning: Each node has , which measures the information gain of splitting certain node in proportion to the tree size. Then all nodes with a smaller than certain threshold would be pruned.
  • Ensembling

Tree Ensemble


Ensemble is not limited to decision tree algorithm, but a general modeling strategy. It refers to using multiple stump(instance) working together to give the prediction. Each model instance is trained on different set of training samples and even different set of features.

There are two ensemble strategy in general:

  • Bagging, all stumps are equivalently important, the final result is derived from a pool of individual result.
  • Boosting, all stumps are built sequentially, with each one focusing on the error made by previous models.
    • The base stump is trained as normal, and for each sample we initialize and keep a residual value .
    • And for each later iteration, fit the tree and reduce model residual with a factor constant lambda, by fitting to model to residual instead of y number.
    • The inference utilize all stumps, where h refers to individual instance

Random Forest


Random Forest is a ensemble bagging model whose base model is decision tree. As the name indicating, the “forest” is build out of multiple trees, each trained on different random set of samples, and different random set of features.

AdaBoost


Different from general ensemble methods having a same shrinkage factor for all stumps, it calculating a different one for each stump during training. And it also give different weights to different sample data. For each sample, if the previous stump cannot predict it correctly, it will have a heavier weight in next stump. For each stump, if it has a high accuracy, it will have a shrinkage factor .

  • Initialize sample weights .
  • Repeat for certain iteration, in each iteration b:
    • Fit a tree to training data (different from general ensemble methods fitting residual) with sample weights .
    • Calculate error .
    • And assign the stump with factor
    • Update sample wights
  • Output , where sign is literally output 1 or -1.

Gradient Boosting


Gradient Boosting is a generalized boosting algorithm. Instead of fitting the model to a residual got kept track for each sample, it fit the model to the negative gradient.

Which to Pick?


When number of features is big (more than hundred), Random Forest performs better. While Gradient Boosting almost always performs better than AdaBoost.

© 2024-present Zane Chen. All Rights Reserved.