Greedy Function Approxiation: A Gradient Boosting Machine

Greedy Function Approxiation: A Gradient Boosting Machine

1. AdaBoost.M1

Algorithm: AdaBoost.M1
1. Initialize the observation weights wi=1/N,i=1,2,...,N.
2. For m = 1 to M:
(a) Fit a classifier Gm(x) to the training data using weights wi.
(b) Compute errm=Ni=1wiI(yiGm(xi))Ni=1wi
(c) Compute am=log((1errm)/errm)
(d) Set wiwiexp[amI(yiGm(xi))],i=1,2...,N
3. Output G(x)=sign[Mi=1amGm(x)].

Line 2.d updates the error-classified instance by increasing their weights, the correctly classified instances remain unchanged. However, when normalize for all instances' weights, the correctly classified's weight do decrease!

See LCY Couple's Exponential Loss: AdaBoost.M1 for more information about AdaBoost.M1.
See chapter 10 of The Elements of Statistical Learning for more information about the equivalent between AdaBoost method and Additive Stage-wise Modeling Method.
The insight lies in Exponential Loss function and one-half of log-odds which concludes that AdaBoost.M1 minimizes the exponential loss criterion via a forward-stagewise additive modeling method and the additive expansion produced by AdaBoost.m1 is estimating the log-odds of P(Y=1|x) respectively.

2. Function estimation

Function estimation/aproximation is viewed from the perspective of numerical optimization in function space, rather than parameter space.

Using a "training" sample {yi, xi}Ni of known (y, x)-values, the goal is to obtain an estimate or approximation F(x), of the true function F(x) mapping x to y, that minimizes the expected value of some specified loss function L(y,F(x)) over the jiont distribution of all (y, x) values:

F=argminF(x)=Ey,x[L(y,F(x))]=argminF(x)Ex[Ey(L(y,F(x))|x]

This paper focus on the Additive form:

F(x;{βm,γm})M1)=Mm=1βmh(x;γm)

Optimizing on training data, this equals to:

F=argminβ,γM1Ni=1L(yi,Mm=1βmF(xi;γm))

However, this is computational costive to optimize on the approximate function in additive form Mm=1βmF(xi;γm) and the empirical loss on training data Ni=1L(yi,Mm=1βmF(xi;γm)) simultaneously.

alternatively, it optimizes via a greedy strategy, once for one base learner:

fm(x;γ)=argminβ,γM1Ni=1L(yi,βF(xi;γ))

2.1 Numerical optimization

In general, choosing a paramenterd model F(x;γ) changes the function optimization problem to one of parameter optimization:

L(f)=Ni=1L(yi,f(xi))

The goal is to minimize L(f) with respect to f, where here f(x) is constrained to be a weighted sum of base learners. Ingoring this constraint, minimizing F(f) can be viewed as a numerical optimization:

f=argminfL(f)fIRNf={f(x1),f(x2),...,f(xN)}

Numerical optimization procedures solve f as a sum of component vectors:

fM=Mm=0hm,hmIRN

For most F(x;γ) and L, numerical optimization methods must be applied to slove the above formula. This often involves expressing the solution for parameters in the following form

P=Mm=0pm

where p0 is an initial guess and {pm}M1 are successive increments (steps or boosts), each based on the sequence of preceding steps. The prescription for computing each step pm is defined by the optimization method.

References

Exponential Loss: AdaBoost.M1
The Elements of Statistical Learning. by Trevor Hastie / Robert Tibshirani / Jerome Friedman

Greedy%20Function%20Approxiation%3A%20A%20Gradient%20Boosting%20Machine%20%20%0A%3D%3D%3D%20%20%0A@%5Bpublished%7Cmachine%20learning%5D%0A%23%23%231.%20AdaBoost.M1%20%20%20%20%0A%7CAlgorithm%3A%20AdaBoost.M1%7C%0A%7C%3A-----%20%7C%0A%7C1.%20Initialize%20the%20observation%20weights%20%24w_i%3D1/N%20%2Ci%3D1%2C2%2C...%2CN.%24%7C%0A%7C2.%20For%20%24m%24%20%3D%201%20to%20%24M%24%3A%7C%0A%7C%28a%29%20Fit%20a%20classifier%20%24G_m%28x%29%24%20to%20the%20training%20data%20using%20weights%20%24w_i%24.%7C%0A%7C%28b%29%20Compute%20%24err_m%20%3D%20%5Cfrac%7B%5Csum_%7Bi%3D1%7D%5E%7BN%7Dw_iI%28y_i%5Cneq%20G_m%28x_i%29%29%7D%7B%5Csum_%7Bi%3D1%7D%5E%7BN%7Dw_i%7D%24%7C%0A%7C%28c%29%20Compute%20%24a_m%20%3D%20log%28%281-err_m%29/err_m%29%24%7C%0A%7C%28d%29%20Set%20%24w_i%20%5Cleftarrow%20w_i%5Ccenterdot%20exp%5Ba_m%5Ccenterdot%20I%28y_i%20%5Cneq%20G_m%28x_i%29%29%5D%2C%20i%20%3D%201%2C2...%2CN%24%20%7C%0A%7C3.%20Output%20%24G%28x%29%3Dsign%5B%5Csum_%7Bi%3D1%7D%5E%7BM%7Da_mG_m%28x%29%5D%24.%7C%20%20%0ALine%202.d%20updates%20the%20error-classified%20instance%20by%20increasing%20their%20weights%2C%20the%20correctly%20classified%20instances%20remain%20unchanged.%20However%2C%20when%20normalize%20for%20all%20instances%27%20weights%2C%20the%20correctly%20classified%27s%20weight%20do%20decrease%21%20%20%0A%0ASee%20LCY%20Couple%27s%20%5BExponential%20Loss%3A%20AdaBoost.M1%5D%28http%3A//www.lovecaoying.com/blog/%3Fp%3D105%29%20for%20more%20information%20about%20AdaBoost.M1.%20%20%0ASee%20chapter%2010%20of%20The%20Elements%20of%20Statistical%20Learning%20for%20more%20information%20about%20the%20equivalent%20between%20AdaBoost%20method%20and%20Additive%20Stage-wise%20Modeling%20Method.%20%20%0AThe%20insight%20lies%20in%20Exponential%20Loss%20function%20and%20one-half%20of%20log-odds%20which%20concludes%20that%20AdaBoost.M1%20minimizes%20the%20exponential%20loss%20criterion%20via%20a%20forward-stagewise%20additive%20modeling%20method%20and%20the%20additive%20expansion%20produced%20by%20AdaBoost.m1%20is%20estimating%20the%20log-odds%20of%20%24P%28Y%3D1%7Cx%29%24%20respectively.%20%20%0A%0A%23%23%232.%20Function%20estimation%0AFunction%20estimation/aproximation%20is%20viewed%20from%20the%20perspective%20of%20numerical%20optimization%20in%20function%20space%2C%20rather%20than%20parameter%20space.%20%20%20%0A%0AUsing%20a%20%22training%22%20sample%20%7B%24y_i%24%2C%20%24x_i%24%7D%24_i%5EN%24%20of%20known%20%28y%2C%20**x**%29-values%2C%20the%20goal%20is%20to%20obtain%20an%20estimate%20or%20approximation%20%24F%28x%29%24%2C%20of%20the%20true%20function%20%24F%5E*%28x%29%24%20mapping%20**x**%20to%20y%2C%20that%20minimizes%20the%20expected%20value%20of%20some%20specified%20loss%20function%20%24L%28y%2C%20F%28x%29%29%24%20over%20the%20jiont%20distribution%20of%20all%20%28y%2C%20**x**%29%20values%3A%20%20%20%0A%24%24%0A%20%20%20%20F%5E*%20%3D%20argmin_%7BF%28x%29%7D%20%3D%20E_%7By%2Cx%7D%5BL%28y%2C%20F%28x%29%29%5D%20%3D%20argmin_%7BF%28x%29%7DE_x%5BE_y%28L%28y%2C%20F%28x%29%29%7Cx%5D%0A%24%24%20%20%20%0A%0AThis%20paper%20focus%20on%20the%20Additive%20form%3A%20%20%0A%24%24%0AF%28x%3B%5C%7B%5Cbeta_m%2C%20%5Cgamma_m%5C%7D%29_%7B1%7D%5E%7BM%7D%29%20%3D%20%5Csum_%7Bm%3D1%7D%5EM%5Cbeta_mh%28x%3B%5Cgamma_m%29%0A%24%24%20%20%20%20%0AOptimizing%20on%20training%20data%2C%20this%20equals%20to%3A%20%20%0A%24%24%0AF%5E*%20%3D%20argmin_%7B%5Cbeta%2C%20%5Cgamma%7D%7B_1%5EM%7D%5Csum_%7Bi%3D1%7D%5ENL%28y_i%2C%20%5Csum_%7Bm%3D1%7D%5EM%5Cbeta_mF%28x_i%3B%5Cgamma_m%29%29%0A%24%24%20%20%0AHowever%2C%20this%20is%20computational%20costive%20to%20optimize%20on%20the%20approximate%20function%20in%20additive%20form%20%24%5Csum_%7Bm%3D1%7D%5EM%5Cbeta_mF%28x_i%3B%5Cgamma_m%29%24%20and%20the%20empirical%20loss%20on%20training%20data%20%24%5Csum_%7Bi%3D1%7D%5ENL%28y_i%2C%20%5Csum_%7Bm%3D1%7D%5EM%5Cbeta_mF%28x_i%3B%5Cgamma_m%29%29%24%20simultaneously.%20%20%0A%0Aalternatively%2C%20it%20optimizes%20via%20a%20greedy%20strategy%2C%20once%20for%20one%20base%20learner%3A%20%20%0A%24%24%0A%20%20%20f%5E*_m%28x%3B%5Cgamma%29%20%3D%20%20argmin_%7B%5Cbeta%2C%20%5Cgamma%7D%7B_1%5EM%7D%5Csum_%7Bi%3D1%7D%5ENL%28y_i%2C%20%5Cbeta%20%5Ccenterdot%20F%28x_i%3B%5Cgamma%29%29%0A%24%24%0A%0A%0A%23%23%23%232.1%20Numerical%20optimization%20%20%0AIn%20general%2C%20choosing%20a%20paramenterd%20model%20%24F%28x%3B%20%5Cgamma%29%24%20changes%20the%20function%20optimization%20problem%20to%20one%20of%20parameter%20optimization%3A%20%20%0A%24%24%0AL%28f%29%20%3D%20%5Csum_%7Bi%3D1%7D%5ENL%28y_i%2C%20f%28x_i%29%29%0A%24%24%20%20%0AThe%20goal%20is%20to%20minimize%20%24L%28f%29%24%20with%20respect%20to%20%24f%24%2C%20where%20here%20%24f%28x%29%24%20is%20constrained%20to%20be%20a%20weighted%20sum%20of%20base%20learners.%20Ingoring%20this%20constraint%2C%20minimizing%20%24F%28f%29%24%20can%20be%20viewed%20as%20a%20numerical%20optimization%3A%20%20%0A%24%24%0A%20%20%20%20f%5E*%20%3D%20argmin_fL%28f%29%20%20%20%5C%5C%0A%20%20%20%20f%20%5Cin%20IR%5EN%20%20%5C%5C%0A%20%20%20%20f%20%3D%20%5C%7Bf%28x_1%29%2C%20f%28x_2%29%2C%20...%2C%20f%28x_N%29%5C%7D%0A%24%24%20%20%20%0ANumerical%20optimization%20%20procedures%20solve%20%24f%5E*%24%20as%20a%20sum%20of%20component%20vectors%3A%20%20%0A%24%24%0A%20%20%20%20f_M%20%3D%20%5Csum_%7Bm%3D0%7D%5EMh_m%2C%20h_m%20%5Cin%20IR%5EN%0A%24%24%0A%20%20%0A%0AFor%20most%20%24F%28x%3B%5Cgamma%29%24%20and%20%24L%24%2C%20numerical%20optimization%20methods%20must%20be%20applied%20to%20slove%20the%20above%20formula.%20This%20often%20involves%20expressing%20the%20solution%20for%20parameters%20in%20the%20following%20form%20%20%20%0A%24%24%0AP%5E*%20%3D%20%5Csum_%7Bm%3D0%7D%5EMp_m%0A%24%24%20%20%0Awhere%20%24p_0%24%20is%20an%20initial%20guess%20and%20%24%5C%7Bp_m%5C%7D_1%5EM%24%20are%20successive%20increments%20%28steps%20or%20boosts%29%2C%20each%20based%20on%20the%20sequence%20of%20preceding%20steps.%20The%20prescription%20for%20computing%20each%20step%20%24p_m%24%20is%20defined%20by%20the%20optimization%20method.%20%20%20%20%0A%0A%23%23%23References%20%20%20%0A%5BExponential%20Loss%3A%20AdaBoost.M1%5D%28http%3A//www.lovecaoying.com/blog/%3Fp%3D105%29%0A%5BThe%20Elements%20of%20Statistical%20Learning.%20by%20Trevor%20Hastie%20/%20Robert%20Tibshirani%20/%20Jerome%20Friedman%5D%28http%3A//book.douban.com/subject/3294335/%29%0A%0A%0A%0A


comments powered by Disqus