Greedy Function Approxiation: A Gradient Boosting Machine
Greedy Function Approxiation: A Gradient Boosting Machine
1. AdaBoost.M1
Algorithm: AdaBoost.M1 |
---|
1. Initialize the observation weights wi=1/N,i=1,2,...,N. |
2. For m = 1 to M: |
(a) Fit a classifier Gm(x) to the training data using weights wi. |
(b) Compute errm=∑Ni=1wiI(yi≠Gm(xi))∑Ni=1wi |
(c) Compute am=log((1−errm)/errm) |
(d) Set wi←wi⋅exp[am⋅I(yi≠Gm(xi))],i=1,2...,N |
3. Output G(x)=sign[∑Mi=1amGm(x)]. |
Line 2.d updates the error-classified instance by increasing their weights, the correctly classified instances remain unchanged. However, when normalize for all instances' weights, the correctly classified's weight do decrease!
See LCY Couple's Exponential Loss: AdaBoost.M1 for more information about AdaBoost.M1.
See chapter 10 of The Elements of Statistical Learning for more information about the equivalent between AdaBoost method and Additive Stage-wise Modeling Method.
The insight lies in Exponential Loss function and one-half of log-odds which concludes that AdaBoost.M1 minimizes the exponential loss criterion via a forward-stagewise additive modeling method and the additive expansion produced by AdaBoost.m1 is estimating the log-odds of P(Y=1|x) respectively.
2. Function estimation
Function estimation/aproximation is viewed from the perspective of numerical optimization in function space, rather than parameter space.
Using a "training" sample {yi, xi}Ni of known (y, x)-values, the goal is to obtain an estimate or approximation F(x), of the true function F∗(x) mapping x to y, that minimizes the expected value of some specified loss function L(y,F(x)) over the jiont distribution of all (y, x) values:
This paper focus on the Additive form:
Optimizing on training data, this equals to:
However, this is computational costive to optimize on the approximate function in additive form ∑Mm=1βmF(xi;γm) and the empirical loss on training data ∑Ni=1L(yi,∑Mm=1βmF(xi;γm)) simultaneously.
alternatively, it optimizes via a greedy strategy, once for one base learner:
2.1 Numerical optimization
In general, choosing a paramenterd model F(x;γ) changes the function optimization problem to one of parameter optimization:
The goal is to minimize L(f) with respect to f, where here f(x) is constrained to be a weighted sum of base learners. Ingoring this constraint, minimizing F(f) can be viewed as a numerical optimization:
Numerical optimization procedures solve f∗ as a sum of component vectors:
For most F(x;γ) and L, numerical optimization methods must be applied to slove the above formula. This often involves expressing the solution for parameters in the following form
where p0 is an initial guess and {pm}M1 are successive increments (steps or boosts), each based on the sequence of preceding steps. The prescription for computing each step pm is defined by the optimization method.
References
Exponential Loss: AdaBoost.M1
The Elements of Statistical Learning. by Trevor Hastie / Robert Tibshirani / Jerome Friedman