The Formula of FTRL-proximal

The Formula of FTRL-proximal

1. The online framework of LR with SGD

On rount t, LR is used to predict an instance described by a feature vector xtRd, given model parameter vector wt: pt=σ(wt,xt), where σ(α)=1/(1+exp(α)) is the sigmoid function. when defined label as yi{0,1}, the empirical loss over training data given as:

lt(wt)=ytlogpt(1yt)log(1pt)t

This is the negative log-likelihood of yi given p. It's a convex problem, setting it's gradient to zero, one can obtain lt(w)=(σ(wTxt)yt)Txt=(ptyt)Txt

When one use a common learning rate for all features, like ηt=1(t).
The Online gradient descent(OSD) performs the update step:

wt+1=wtηtgt

where gt is lt(wt).
It's straightforward to show that this formula is equals to wt+1=ts=1ηsgs if w0 is initialized as 0 by expanding the formula. This means the updated wt+1 is a linear combination of successive gi.
The above close-form solution has an equivalent constrained minimization optimizatoin form

wt+1=argminw(gT1:tw+12ηtwwt22)

by setting the gradient respect to w, the close-form and minimization optimization problem are equals.

Once the minimization optimizatoin form is obtained, it's natually to add some regularization terms.

in wt+1=argminw(gT1:tw+12ηtwwt22), gT1:tw is the loss and 12ηtwwt22 is L2 regularization to constrain the w is close to center of the sequence wt, actually, when w=Mean(wt) achieving the minimization. This's what Proximal means in FTRL-proximal.
Intuitively, one could use Bregman divergence to replace L2 regularization in optimization problems.

2. FTRL-proximal

The FTRL-proximal algorithm uses the simply update:

wt+1=argminw(gT1:tw+12ts=1σswws22+λ1w1)

where ts=1σs=σ1:t=1ηt.
when λ1=0, this is the same as OGD's update step. However, when FTRL-proximal update with λ1>0 does an excellent job of inducing sparsity for w.
Expanding this formula, it's straightforward to re-write the update:

wt+1=argminw(g1:tts=1σsws)tw+12ηtw22+λ1w1+12ts=1σswTsws

where 12ts=1σswTsws is a constant as ws and σs are known parameter on rount t+1.
If one has stored zt1=g1:tt1s=1σsws, at the beginning of rount t, one update by setting zt=zt1+(1ηt1ηt1)wt:

wt+1=argminwzTtw+12ηtw22+λ1w1=argminw12ηt(2ηtzt+w)Tw+λ1w1=argminw12ηtηtzt+w22+λ1w1=argminw12ηtzt+w22+ηtλ1w1

For a convex function g, proximal operator is defined as

proxg(x)=argminug(u)+12ux22

when g is L1, proximal operator has close-form solution called soft-thresholding or shrinkage operator:

soft(x,α)i=sgn(xi)max{|xi|α,0}

Then:

wt+1,i=sgn(ηtzt,i)max{|ηtzt,i|ηtλ1,0}=ηtsgn(zt,i)max{|zt,i|λ1,0}

Reference:

  1. McMahan H B, Holt G, Sculley D, et al. Ad click prediction: a view from the trenches[C]//Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2013: 1222-1230.
The%20Formula%20of%20FTRL-proximal%20%20%20%0A%3D%3D%3D%3D%3D%20%20%0A@%28ir%29%5Bpublished%7Cmachine%20learning%5D%20%20%0A%23%23%23%231.%20%20The%20online%20framework%20of%20LR%20with%20SGD%20%20%0AOn%20rount%20*t*%2C%20LR%20is%20used%20to%20predict%20an%20instance%20described%20by%20a%20feature%20vector%20%24x_t%20%5Cin%20R%5Ed%24%2C%20given%20model%20parameter%20vector%20%24w_t%24%3A%20%24p_t%20%3D%20%5Csigma%28w_t%2C%20x_t%29%24%2C%20where%20%24%5Csigma%28%5Calpha%29%3D1/%281+exp%28-%5Calpha%29%29%24%20is%20the%20sigmoid%20function.%20when%20defined%20label%20as%20%24y_i%20%5Cin%20%5C%7B0%2C%201%5C%7D%24%2C%20the%20empirical%20loss%20over%20training%20data%20given%20as%3A%20%20%0A%24%24%0A%20%20%20%20l_t%28w_t%29%20%3D%20-y_t%5Ctext%7Blog%7Dp_t%20-%20%281-y_t%29%5Ctext%7Blog%7D%281-p_t%29t%0A%24%24%20%20%0AThis%20is%20the%20negative%20log-likelihood%20of%20%24y_i%24%20given%20*p*.%20It%27s%20a%20convex%20problem%2C%20setting%20it%27s%20gradient%20to%20zero%2C%20one%20can%20obtain%20%24%5Ctriangledown%20l_t%28w%29%3D%28%5Csigma%28w%5ETx_t%29-y_t%29%5ETx_t%3D%28p_t-y_t%29%5ETx_t%24%20%20%20%0A%0AWhen%20one%20use%20a%20common%20learning%20rate%20for%20all%20features%2C%20like%20%24%5Ceta_t%3D%5Cfrac%7B1%7D%7B%5Csqrt%28t%29%7D%24.%20%20%0AThe%20Online%20gradient%20descent%28OSD%29%20performs%20the%20update%20step%3A%20%20%0A%24%24%0A%20%20%20%20w_%7Bt+1%7D%20%3D%20w_t%20-%20%5Ceta_tg_t%0A%24%24%20%20%0Awhere%20%24g_t%24%20is%20%24%5Ctriangledown%20l_t%28w_t%29%24.%20%20%20%0AIt%27s%20straightforward%20to%20show%20that%20this%20formula%20is%20equals%20to%20%24w_%7Bt+1%7D%20%3D%20%5Csum_%7Bs%3D1%7D%5Et%5Ceta_sg_s%24%20if%20%24w_0%24%20is%20initialized%20as%200%20by%20expanding%20the%20formula.%20This%20means%20the%20updated%20%24w_%7Bt+1%7D%24%20is%20a%20linear%20combination%20of%20successive%20%24g_i%24.%20%20%0AThe%20above%20close-form%20solution%20has%20an%20equivalent%20constrained%20minimization%20optimizatoin%20form%20%20%0A%24%24%0A%20%20%20%20w_%7Bt+1%7D%20%3D%20%5Ctext%7Bargmin%7D_w%20%28g_%7B1%3At%7D%5ETw+%5Cfrac%7B1%7D%7B2%5Ceta_t%7D%5Cparallel%20w-w_t%5Cparallel_2%5E2%29%0A%24%24%20%20%0Aby%20setting%20the%20gradient%20respect%20to%20%24w%24%2C%20the%20close-form%20and%20minimization%20optimization%20problem%20are%20equals.%20%20%0A%0AOnce%20the%20minimization%20optimizatoin%20form%20is%20obtained%2C%20it%27s%20natually%20to%20add%20some%20regularization%20terms.%20%20%0A%0Ain%20%24w_%7Bt+1%7D%20%3D%20%5Ctext%7Bargmin%7D_w%20%28g_%7B1%3At%7D%5ETw+%5Cfrac%7B1%7D%7B2%5Ceta_t%7D%5Cparallel%20w-w_t%5Cparallel_2%5E2%29%24%2C%20%24g_%7B1%3At%7D%5ETw%24%20is%20the%20loss%20and%20%24%5Cfrac%7B1%7D%7B2%5Ceta_t%7D%5Cparallel%20w-w_t%5Cparallel_2%5E2%24%20is%20%24L_2%24%20regularization%20to%20constrain%20the%20%24w%24%20is%20close%20to%20center%20of%20the%20sequence%20%24w_t%24%2C%20actually%2C%20when%20%24w%3D%5Ctext%7BMean%7D%28w_t%29%24%20achieving%20the%20minimization.%20This%27s%20what%20***Proximal***%20means%20in%20***FTRL-proximal***.%20%20%0A*Intuitively%2C%20one%20could%20use%20%5BBregman%20divergence%5D%28http%3A//en.wikipedia.org/wiki/Bregman_divergence%29%20to%20replace%20%24L_2%24%20regularization%20in%20optimization%20problems.*%20%20%0A%0A%23%23%23%232.%20FTRL-proximal%20%20%0AThe%20FTRL-proximal%20algorithm%20uses%20the%20simply%20update%3A%20%20%20%0A%24%24%0A%20%20%20%20w_%7Bt+1%7D%20%3D%20%5Ctext%7Bargmin%7D_w%20%28g_%7B1%3At%7D%5ETw+%5Cfrac%7B1%7D%7B2%7D%5Csum_%7Bs%3D1%7D%5Et%5Csigma_s%5Cparallel%20w-w_s%5Cparallel_2%5E2%20+%20%5Clambda_1%20%5Cparallel%20w%20%5Cparallel_1%29%0A%24%24%20%20%20%0Awhere%20%24%5Csum_%7Bs%3D1%7D%5Et%5Csigma_s%3D%5Csigma_%7B1%3At%7D%3D%5Cfrac%7B1%7D%7B%5Ceta_t%7D%24.%20%20%0Awhen%20%24%5Clambda_1%3D0%24%2C%20this%20is%20the%20same%20as%20OGD%27s%20update%20step.%20However%2C%20when%20FTRL-proximal%20update%20with%20%24%5Clambda_1%20%5Cgt%200%24%20does%20an%20excellent%20job%20of%20inducing%20sparsity%20for%20%24w%24.%20%20%0AExpanding%20this%20formula%2C%20it%27s%20straightforward%20to%20re-write%20the%20update%3A%20%20%0A%24%24%0A%20%20%20%20w_%7Bt+1%7D%20%3D%20%5Ctext%7Bargmin%7D_w%20%28g_%7B1%3At%7D%20-%20%5Csum_%7Bs%3D1%7D%5Et%5Csigma_sw_s%29%5Etw+%5Cfrac%7B1%7D%7B2%5Ceta_t%7D%5Cparallel%20w%20%5Cparallel_2%5E2%20+%20%5Clambda_1%5Cparallel%20w%20%5Cparallel_1%20+%20%5Cfrac%7B1%7D%7B2%7D%5Csum_%7Bs%3D1%7D%5Et%5Csigma_sw_s%5ETw_s%0A%24%24%20%20%0Awhere%20%24%5Cfrac%7B1%7D%7B2%7D%5Csum_%7Bs%3D1%7D%5Et%5Csigma_sw_s%5ETw_s%24%20is%20a%20constant%20as%20%24w_s%24%20and%20%24%5Csigma_s%24%20are%20known%20parameter%20on%20rount%20*t+1*.%20%20%20%0AIf%20one%20has%20stored%20%24z_%7Bt-1%7D%20%3D%20g_%7B1%3At%7D-%5Csum_%7Bs%3D1%7D%5E%7Bt-1%7D%5Csigma_sw_s%24%2C%20at%20the%20beginning%20of%20rount%20*t*%2C%20one%20update%20by%20setting%20%24z_t%20%3D%20z_%7Bt-1%7D+%28%5Cfrac%7B1%7D%7B%5Ceta_t%7D-%5Cfrac%7B1%7D%7B%5Ceta_%7Bt-1%7D%7D%29w_t%24%3A%20%20%20%0A%24%24%0A%20%20%20%20w_%7Bt+1%7D%20%3D%20%5Ctext%7Bargmin%7D_w%20z_%7Bt%7D%5ETw+%5Cfrac%7B1%7D%7B2%5Ceta_t%7D%5Cparallel%20w%20%5Cparallel_2%5E2%20+%20%5Clambda_1%5Cparallel%20w%20%5Cparallel_1%20%5C%5C%0A%20%20%20%20%3D%20%5Ctext%7Bargmin%7D_w%20%5Cfrac%7B1%7D%7B2%5Ceta_t%7D%282%5Ceta_tz_%7Bt%7D+w%29%5ETw%20+%20%5Clambda_1%5Cparallel%20w%20%5Cparallel_1%20%5C%5C%0A%20%20%20%20%3D%20%5Ctext%7Bargmin%7D_w%20%5Cfrac%7B1%7D%7B2%5Ceta_t%7D%5Cparallel%5Ceta_tz_%7Bt%7D+w%5Cparallel_2%5E2%20+%20%5Clambda_1%5Cparallel%20w%20%5Cparallel_1%20%5C%5C%0A%20%20%20%20%3D%20%5Ctext%7Bargmin%7D_w%20%5Cfrac%7B1%7D%7B2%7D%5Cparallel%5Ceta_tz_%7Bt%7D+w%5Cparallel_2%5E2%20+%20%5Ceta_t%5Clambda_1%5Cparallel%20w%20%5Cparallel_1%0A%24%24%20%20%20%0A%0AFor%20a%20convex%20function%20%24g%24%2C%20proximal%20operator%20is%20defined%20as%20%0A%24%24%0A%20%20%20%20%5Ctext%7Bprox%7D_g%28x%29%20%3D%20%5Ctext%7Bargmin%7D_u%20g%28u%29%20+%20%5Cfrac%7B1%7D%7B2%7D%5Cparallel%20u-x%20%5Cparallel_2%5E2%0A%24%24%20%0Awhen%20%24g%24%20is%20%24L_1%24%2C%20proximal%20operator%20has%20close-form%20solution%20called%20soft-thresholding%20or%20shrinkage%20operator%3A%20%20%20%0A%24%24%0A%20%20%20%20%5Ctext%7Bsoft%7D%28x%2C%5Calpha%29_i%20%3D%20%5Ctext%7Bsgn%7D%28x_i%29%20%5Ctext%7Bmax%7D%5C%7B%7Cx_i%7C-%5Calpha%2C0%5C%7D%0A%24%24%20%20%0A%0AThen%3A%20%20%0A%24%24%0A%20%20%20%20w_%7Bt+1%2Ci%7D%3D%5Ctext%7Bsgn%7D%28-%5Ceta_tz_%7Bt%2Ci%7D%29%5Ctext%7Bmax%7D%5C%7B%7C%5Ceta_tz_%7Bt%2Ci%7D%7C-%5Ceta_t%5Clambda_1%2C0%5C%7D%20%20%5C%5C%0A%20%20%20%20%3D%20-%5Ceta_t%5Ctext%7Bsgn%7D%28z_%7Bt%2Ci%7D%29%5Ctext%7Bmax%7D%5C%7B%7Cz_%7Bt%2Ci%7D%7C-%5Clambda_1%2C0%5C%7D%0A%24%24%20%20%20%0A%0A%23%23%23%23%20Reference%3A%20%20%0A1.%20%5BMcMahan%20H%20B%2C%20Holt%20G%2C%20Sculley%20D%2C%20et%20al.%20Ad%20click%20prediction%3A%20a%20view%20from%20the%20trenches%5BC%5D//Proceedings%20of%20the%2019th%20ACM%20SIGKDD%20international%20conference%20on%20Knowledge%20discovery%20and%20data%20mining.%20ACM%2C%202013%3A%201222-1230.%5D%28https%3A//static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/41159.pdf%29


comments powered by Disqus