Article From:


2. Ensemble Learning Bagging

4. Ensemble Learning Adaboost

5. Ensemble Learning GBDT

1. AdaBoost Vs GBDT

  • identical
  1. AdaBoostBoth GBDT and GBDT repeatedly select a general model and adjust each time based on the performance of the previous model.
  • difference
  1. AdaBoostThe inadequacy of the model is located by increasing the weight of the wrong data points.
  2. BDTIt is iterated by fitting the residuals of the gradient.
  3. GBDTIt is a general algorithm that can use more kinds of objective functions.
  4. AdaboostGenerally used for classification, GBDT for regression


  1. Selection of base classifier: Traditional GBDT uses CART as base classifier and XGBoost supports linear classifier. At this time, XGBoost is equivalent to logistic regression (classification problem) or linear regression (regression problem) with L1 and L2 regularization terms.
  2. Second-order Taylor expansion: Traditional GBDT only uses first-order derivative information in optimization, while XGBoost does second-order Taylor expansion for cost function, using first-order and second-order derivatives at the same time. By the way, the XGBoost tool supports custom loss functions as long as they can derive first and second orders.。
  3. XGBoostRegular terms are added to the objective function to control the complexity of the model.
  4. Column subsampling: XGBoost uses the random forest approach to support column sampling, which not only reduces over-fitting, but also reduces computation. This is also a characteristic of XGBoost which is different from traditional GBDT.
  5. Missing Value Processing: XGBoost takes into account the situation that the training data is sparse. It can specify the default direction of the branch for missing value or specified value, which can greatly improve the efficiency of the algorithm. Paper mentions 50 times. That is to say, XGBoost can learn automatically for samples with missing feature values.Give its split direction.
  6. XGBoostTools support parallelism: Isn’t Boosting a serial structure? How does it work? Note that XGBoost parallelism is not tree-granularity parallelism, and XGBoost is an iteration complete before the next iteration (the loss function of the first iteration contains the previous iteration’s).Predicted value). The parallelism of XGBoost is on feature granularity. As we know, one of the most time-consuming steps in decision tree learning is to sort the values of features (because the best segmentation point is to be determined). Before training, XGBoost pre-ranks the data and then saves it as a bloc.K (block) structure, which is used repeatedly in subsequent iterations, greatly reduces the amount of computation. This block structure also makes parallelism possible. When splitting nodes, we need to calculate the gain of each feature. Finally, we choose the feature with the largest gain to split. Then the gain meters of each feature are used.Calculations can be done in multiple threads.

3. GBDT and lightGBM

  1. xgboostLevel-wise splitting strategy is adopted in lightGBM, while leaf-wise strategy is adopted in lightGBM. The difference is that xgboost splits all nodes in each layer indiscriminately. Some nodes may gain very little and have little influence on the results, but xgboost splits all nodes indiscriminately.It has also split up, bringing the necessary overhead. Leaf-wise’s method is to select the node with the greatest benefit of splitting among all the leaf nodes to split. It is obvious that leaf-wise’s method is easy to over-fit because it is easy to fall into a higher depth.Therefore, it is necessary to limit the maximum depth so as to avoid over-fitting.
  2. lightgbmHistogram-based decision tree algorithm is used, which is different from exact algorithm in xgboost. histogram algorithm has many advantages in memory and computing cost. Histogram algorithm: https://blog.csdn.neT/jasonwang_/article/details/80833001
  3. Histogram difference acceleration: The histogram of a child node can be obtained by subtracting the histogram of the parent node from the histogram of the sibling node, thus accelerating the calculation.

(Welcome to reprint, reprint please indicate the source. Welcome to communication: