Gradient boosting: Difference between revisions

Content deleted Content added

Inline

Revision as of 18:49, 9 August 2012

Gradient boosting is a machine learning technique for regression problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function. Gradient boosting method can be also used for classification problems by reducing them to regression with a suitable loss function.

The method was invented by Jerome H. Friedman in 1999 and was published in a series of two papers, the first of which^[1] introduced the method, and the second one^[2] described an important tweak to the algorithm, which improves its accuracy and performance.

Gradient boosting

In many supervised learning problems one has an output variable y and a vector of input variables x connected together via a joint probability distribution P(x, y). Using a training set $\!(x_{1},y_{1}),\ldots ,(x_{n},y_{n})$ of known values of x and corresponding values of y, the goal is to find an approximation ${\hat {F}}(x)$ to a function $\!F^{*}(x)$ that minimizes the expected value of some specified loss function L(y, F(x)):

F^{*}={\underset {F}{\operatorname {arg\,min} }}E_{x,y}L(y,F(x)).

Gradient boosting method assumes a real-valued y and seeks an approximation ${\hat {F}}(x)$ in the form of a weighted sum of functions $\!h_{i}(x)$ from some class $\!{\mathcal {H}}$ , called base (or weak) learners:

F(x)=\sum _{i=1}^{M}\gamma _{i}h_{i}(x)+{\mbox{const}}.

In accordance with the empirical risk minimization principle, the method tries to find an approximation ${\hat {F}}(x)$ that minimizes the average value of the loss function on the training set. It does so by starting with a model, consisting of a constant function $\!F_{0}(x)$ , and incrementally expanding it in a greedy fashion:

F_{0}(x)={\underset {\gamma }{\operatorname {arg\,min} }}\sum _{i=1}^{n}L(y_{i},\gamma ),

F_{m}(x)=F_{m-1}(x)+{\underset {f\in {\mathcal {H}}}{\operatorname {arg\,min} }}\sum _{i=1}^{n}L(y_{i},F_{m-1}(x_{i})+f(x_{i})),

where f is restricted to be a function from the class ${\mathcal {H}}$ of base learner functions.

However, the problem of choosing at each step the best f for an arbitrary loss function L is a hard optimization problem in general, and so we'll "cheat" by solving a much easier problem instead.

The idea is to apply a steepest descent step to this minimization problem. If we only cared about predictions at the points of the training set, and f were unrestricted, we'd update the model per the following equation, where we view L(y, f) not as a functional of f, but as a function of a vector of values $\!f(x_{1}),\ldots ,f(x_{n})$ :

F_{m}(x)=F_{m-1}(x)-\gamma _{m}\sum _{i=1}^{n}\nabla _{f}L(y_{i},F_{m-1}(x_{i})),

\gamma _{m}={\underset {\gamma }{\operatorname {arg\,min} }}\sum _{i=1}^{n}L\left(y_{i},F_{m-1}(x_{i})-\gamma {\frac {\partial L(y_{i},F_{m-1}(x_{i}))}{\partial f(x_{i})}}\right).

But as f must come from a restricted class of functions (that's what allows us to generalize), we'll just choose the one that most closely approximates the gradient of L. Having chosen f, the multiplier γ is then selected using line search just as shown in the second equation above.

In pseudocode, the generic gradient boosting method is:^[1]^[3]

Input: training set $\!\{(x_{i},y_{i})\}_{i=1}^{n},$ a differentiable loss function $\!L(y,F(x)),$ number of iterations $\!M.$

Algorithm:

Initialize model with a constant value:
$F_{0}(x)={\underset {\gamma }{\operatorname {arg\,min} }}\sum _{i=1}^{n}L(y_{i},\gamma ).$
For m = 1 to M:
1. Compute so-called pseudo-residuals:
  $r_{im}=-\left[{\frac {\partial L(y_{i},F(x_{i}))}{\partial F(x_{i})}}\right]_{F(x)=F_{m-1}(x)}\quad {\mbox{for }}i=1,\ldots ,n.$
2. Fit a base learner $\!h_{m}(x)$ to pseudo-residuals, i.e. train it using the training set $\{(x_{i},r_{im})\}_{i=1}^{n}$ .
3. Compute multiplier $\!\gamma _{m}$ by solving the following one-dimensional optimization problem:
  $\gamma _{m}={\underset {\gamma }{\operatorname {arg\,min} }}\sum _{i=1}^{n}L\left(y_{i},F_{m-1}(x_{i})+\gamma h_{m}(x_{i})\right).$
4. Update the model:
  $\!F_{m}(x)=F_{m-1}(x)+\gamma _{m}h_{m}(x).$
Output $\!F_{M}(x).$

Gradient tree boosting

Gradient boosting is typically used with decision trees (especially CART trees) of a fixed size as base learners. For this special case Friedman proposes a modification to gradient boosting method which improves the quality of fit of each base learner.

Generic gradient boosting at the m-th step would fit a decision tree $\!h_{m}(x)$ to pseudo-residuals. Let $\!J$ be the number of its leaves. The tree partitions the input space into $\!J$ disjoint regions $\!R_{1m},\ldots ,R_{Jm}$ and predicts a constant value in each region. Using the indicator notation, the output of $\!h_{m}(x)$ for input x can be written as the sum:

h_{m}(x)=\sum _{j=1}^{J}b_{jm}I(x\in R_{jm}),

where $\!b_{jm}$ is the value predicted in the region $\!R_{jm}$ .^[4]

Then the coefficients $\!b_{jm}$ are multiplied by some value $\!\gamma _{m}$ , chosen using line search so as to minimize the loss function, and the model is updated as follows:

F_{m}(x)=F_{m-1}(x)+\gamma _{m}h_{m}(x),\quad \gamma _{m}={\underset {\gamma }{\operatorname {arg\,min} }}\sum _{i=1}^{n}L(y_{i},F_{m-1}(x_{i})+\gamma h_{m}(x_{i})).

Friedman proposes to modify this algorithm so that it chooses a separate optimal value $\!\gamma _{jm}$ for each of the tree's regions, instead of a single $\!\gamma _{m}$ for the whole tree. He calls the modified algorithm "TreeBoost". The coefficients $\!b_{jm}$ from the tree-fitting procedure can be then simply discarded and the model update rule becomes:

F_{m}(x)=F_{m-1}(x)+\sum _{j=1}^{J}\gamma _{jm}I(x\in R_{jm}),\quad \gamma _{jm}={\underset {\gamma }{\operatorname {arg\,min} }}\sum _{x_{i}\in R_{jm}}L(y_{i},F_{m-1}(x_{i})+\gamma h_{m}(x_{i})).

Size of trees

$\!J$ , the number of terminal nodes in trees, is the method's parameter which can be adjusted for a data set at hand. It controls the maximum allowed level of interaction between variables in the model. With $\!J=2$ (decision stumps), no interaction between variables is allowed. With $\!J=3$ the model may include effects of the interaction between up to two variables, and so on.

Hastie et al.^[3] comment that typically $\!4\leq J\leq 8$ work well for boosting and results are fairly insensitive to the choice of $\!J$ in this range, $\!J=2$ is insufficient for many applications, and $\!J>10$ is unlikely to be required.

Regularization

Fitting the training set too closely can lead to degradation of the model's generalization ability. Several so-called regularization techniques reduce this overfitting effect by constraining the fitting procedure.

One natural regularization parameter is the number of gradient boosting iterations M (i.e. the number of trees in the model when the base learner is a decision tree). Increasing M reduces the error on training set, but setting it too high may lead to overfitting. An optimal value of M is often selected by monitoring prediction error on a separate validation data set. Besides controlling M, several other regularization techniques are used.

Shrinkage

An important part of gradient boosting method is regularization by shrinkage which consists in modifying the update rule as follows:

F_{m}(x)=F_{m-1}(x)+\nu \cdot \gamma _{m}h_{m}(x),\quad 0<\nu \leq 1,

where parameter $\nu$ is called the "learning rate".

Empirically it has been found that using small learning rates (such as $\nu <0.1$ ) yields dramatic improvements in model's generalization ability over gradient boosting without shrinking ( $\nu =1$ ).^[3] However, it comes at the price of increasing computational time both during training and querying: lower learning rate requires more iterations.

Stochastic gradient boosting

Soon after the introduction of gradient boosting Friedman proposed a minor modification to the algorithm, motivated by Breiman's bagging method.^[2] Specifically, he proposed that at each iteration of the algorithm, a base learner should be fit on a subsample of the training set drawn at random without replacement.^[5] Friedman observed a substantional improvement in gradient boosting's accuracy with this modification.

Subsample size is some constant fraction f of the size of the training set. When f = 1, the algorithm is deterministic and identical to the one described above. Smaller values of f introduce randomness into the algorithm and help prevent overfitting, acting as a kind of regularization. The algorithm also becomes faster, because regression trees have to be fit to smaller datasets at each iteration. Typically, f is set to 0.5, meaning that one half of the training set is used to build each base learner.

Also, like in bagging, subsampling allows one to define an out-of-bag estimate of the prediction performance improvement by evaluating predictions on those observations which were not used in the building of the next base learner. Out-of-bag estimates help avoid the need for an independent validation dataset, but often underestimate actual performance improvement and the optimal number of iterations.^[6]

Number of observations in leaves

Gradient tree boosting implementations often also use regularization by limiting the minimum number of observations in trees' terminal nodes (this parameter is called n.minobsinnode in gbm package^[6]). It's used in the tree building process by ignoring any splits that lead to nodes containing fewer than this number of training set instances.

Imposing this limit helps to reduce variance in predictions at leaves.

Usage

Recently, gradient boosting method has gained some popularity in learning to rank field. Commercial web search engines Yahoo^[7] and Yandex^[8] use variants of gradient boosting method in their machine-learned ranking engines.

Names

The method goes by a wide variety of names. Title of the original publication^[1] refers to it as a "Gradient Boosting Machine" (GBM). That same publication and a later one^[2] by J. Friedman also use the names "Gradient Boost", "Stochastic Gradient Boosting" (emphasizing the random subsampling technique), "Gradient Tree Boosting" and "TreeBoost" (for specialization of the method to the case of decision trees as base learners.)

A popular open-source implementation^[6] calls it "Generalized Boosting Model". Sometimes the method is referred to as "functional gradient boosting", "Gradient Boosted Models" and its tree version is also called "Gradient Boosted Decision Trees" (GBDT) or "Gradient Boosted Regression Trees" (GBRT). Commercial implementations from Salford Systems use the names "Multiple Additive Regression Trees" (MART) and TreeNet, both trademarked.

Software

Open-source

Several open-source R packages are available: gbm,^[6] mboost, gbev.
Weka has an incomplete implementation in weka.classifiers.meta.AdditiveRegression.
elf-project
TMVA: Toolkit for Multivariate Data Analysis
RT-Rank: Large Scale Multi-Threaded C++ Implementation of Gradient Boosting
OpenCV contains the implementation since the version 2.2^[9]
pGBRT: C++/MPI-based parallel/distributed implementation, released September 2011
scikit-learn

Proprietary

TreeNet is a commercial implementation from Salford Systems, possibly "equipped with patent-pending extensions."
DTREG TreeBoost
An implementation of stochastic gradient boosting is available in STATISTICA.
Yahoo and Google have published papers describing an MPI- and a MapReduce-based parallel implementations of gradient boosting.^[10]^[11] However, they have not made the code publicly available.

References

^ ^a ^b ^c Friedman, J. H. "Greedy Function Approximation: A Gradient Boosting Machine." (February 1999)
^ ^a ^b ^c Friedman, J. H. "Stochastic Gradient Boosting." (March 1999)
^ ^a ^b ^c Hastie, T.; Tibshirani, R.; Friedman, J. H. (2009). "10. Boosting and Additive Trees". The Elements of Statistical Learning (2nd ed.). New York: Springer. pp. 337–384. ISBN 0-387-84857-6.
^ Note: in case of usual CART trees, the trees are fitted using least-squares loss, and so the coefficient $b_{jm}$ for the region $R_{jm}$ is equal to just the value of output variable, averaged over all training instances in $R_{jm}$ .
^ Note that this is different from bagging, which samples with replacement because it uses samples of the same size as the training set.
^ ^a ^b ^c ^d Ridgeway, Greg (2007). Generalized Boosted Models: A guide to the gbm package.
^ Cossock, David and Zhang, Tong (2008). Statistical Analysis of Bayes Optimal Subset Ranking, page 14.
^ Yandex corporate blog entry about new ranking model "Snezhinsk" (in Russian)
^ OpenCV change logs
^ B. Panda, et al. (2009). PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce.
^ Jerry Ye, et al. (2009). Stochastic gradient boosted distributed decision trees.

[Friedman1999a-1] Friedman, J. H. "Greedy Function Approximation: A Gradient Boosting Machine." (February 1999)

[Friedman1999b-2] Friedman, J. H. "Stochastic Gradient Boosting." (March 1999)

[hastie-3] Hastie, T.; Tibshirani, R.; Friedman, J. H. (2009). "10. Boosting and Additive Trees". The Elements of Statistical Learning (2nd ed.). New York: Springer. pp. 337–384. ISBN 0-387-84857-6.

[4] Note: in case of usual CART trees, the trees are fitted using least-squares loss, and so the coefficient $b_{jm}$ for the region $R_{jm}$ is equal to just the value of output variable, averaged over all training instances in $R_{jm}$ .

[5] Note that this is different from bagging, which samples with replacement because it uses samples of the same size as the training set.

[gbm-vignette-6] Ridgeway, Greg (2007). Generalized Boosted Models: A guide to the gbm package.

[7] Cossock, David and Zhang, Tong (2008). Statistical Analysis of Bayes Optimal Subset Ranking, page 14.

[snezhinsk-8] Yandex corporate blog entry about new ranking model "Snezhinsk" (in Russian)

[9] OpenCV change logs

[10] B. Panda, et al. (2009). PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce.

[11] Jerry Ye, et al. (2009). Stochastic gradient boosted distributed decision trees.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

@@ Line 63: / Line 63: @@
 : <math>
     F_m(x) = F_{m-1}(x) + \sum_{j=1}^J \gamma_{jm} I(x \in R_{jm}), \quad
-    \gamma_{jm} = \underset{\gamma}{\operatorname{arg\,min}} \sum_{x_i \in R_{jm}} L(y_i, F_{m-1}(x_i) + \gamma h_m(x_i).
+    \gamma_{jm} = \underset{\gamma}{\operatorname{arg\,min}} \sum_{x_i \in R_{jm}} L(y_i, F_{m-1}(x_i) + \gamma h_m(x_i)).
   </math>