Greedy Function Approxiation: A Gradient Boosting Machine

1. AdaBoost.M1

Algorithm: AdaBoost.M1
1. Initialize the observation weights wi=1/N,i=1,2,...,N.
2. For m = 1 to M:
(a) Fit a classifier Gm(x) to the training data using weights wi.
(b) Compute errm=∑Ni=1wiI(yi≠Gm(xi))∑Ni=1wi
(c) Compute am=log((1−errm)/errm)
(d) Set wi←wi⋅exp[am⋅I(yi≠Gm(xi))],i=1,2...,N
3. Output G(x)=sign[∑Mi=1amGm(x)].

Line 2.d updates the error-classified instance by increasing their weights, the correctly classified instances remain unchanged. However, when normalize for all instances' weights, the correctly classified's weight do decrease!

See LCY Couple's Exponential Loss: AdaBoost.M1 for more information about AdaBoost.M1.
See chapter 10 of The Elements of Statistical Learning for more information about the equivalent between AdaBoost method and Additive Stage-wise Modeling Method.
The insight lies in Exponential Loss function and one-half of log-odds which concludes that AdaBoost.M1 minimizes the exponential loss criterion via a forward-stagewise additive modeling method and the additive expansion produced by AdaBoost.m1 is estimating the log-odds of P(Y=1|x) respectively.

2. Function estimation

Function estimation/aproximation is viewed from the perspective of numerical optimization in function space, rather than parameter space.

Using a "training" sample {yi, xi}Ni of known (y, x)-values, the goal is to obtain an estimate or approximation F(x), of the true function F∗(x) mapping x to y, that minimizes the expected value of some specified loss function L(y,F(x)) over the jiont distribution of all (y, x) values:

F∗=argminF(x)=Ey,x[L(y,F(x))]=argminF(x)Ex[Ey(L(y,F(x))|x]

This paper focus on the Additive form:

F(x;{βm,γm})M1)=∑Mm=1βmh(x;γm)

Optimizing on training data, this equals to:

F∗=argminβ,γM1∑Ni=1L(yi,∑Mm=1βmF(xi;γm))

However, this is computational costive to optimize on the approximate function in additive form ∑Mm=1βmF(xi;γm) and the empirical loss on training data ∑Ni=1L(yi,∑Mm=1βmF(xi;γm)) simultaneously.

alternatively, it optimizes via a greedy strategy, once for one base learner:

f∗m(x;γ)=argminβ,γM1∑Ni=1L(yi,β⋅F(xi;γ))

2.1 Numerical optimization

In general, choosing a paramenterd model F(x;γ) changes the function optimization problem to one of parameter optimization:

L(f)=∑Ni=1L(yi,f(xi))

The goal is to minimize L(f) with respect to f, where here f(x) is constrained to be a weighted sum of base learners. Ingoring this constraint, minimizing F(f) can be viewed as a numerical optimization:

f∗=argminfL(f)f∈IRNf={f(x1),f(x2),...,f(xN)}

Numerical optimization procedures solve f∗ as a sum of component vectors:

fM=∑Mm=0hm,hm∈IRN

For most F(x;γ) and L, numerical optimization methods must be applied to slove the above formula. This often involves expressing the solution for parameters in the following form

P∗=∑Mm=0pm

where p0 is an initial guess and {pm}M1 are successive increments (steps or boosts), each based on the sequence of preceding steps. The prescription for computing each step pm is defined by the optimization method.

References

Exponential Loss: AdaBoost.M1
The Elements of Statistical Learning. by Trevor Hastie / Robert Tibshirani / Jerome Friedman