The Formula of FTRL-proximal
The Formula of FTRL-proximal
1. The online framework of LR with SGD
On rount t, LR is used to predict an instance described by a feature vector xt∈Rd, given model parameter vector wt: pt=σ(wt,xt), where σ(α)=1/(1+exp(−α)) is the sigmoid function. when defined label as yi∈{0,1}, the empirical loss over training data given as:
This is the negative log-likelihood of yi given p. It's a convex problem, setting it's gradient to zero, one can obtain ▽lt(w)=(σ(wTxt)−yt)Txt=(pt−yt)Txt
When one use a common learning rate for all features, like ηt=1(√t).
The Online gradient descent(OSD) performs the update step:
where gt is ▽lt(wt).
It's straightforward to show that this formula is equals to wt+1=∑ts=1ηsgs if w0 is initialized as 0 by expanding the formula. This means the updated wt+1 is a linear combination of successive gi.
The above close-form solution has an equivalent constrained minimization optimizatoin form
by setting the gradient respect to w, the close-form and minimization optimization problem are equals.
Once the minimization optimizatoin form is obtained, it's natually to add some regularization terms.
in wt+1=argminw(gT1:tw+12ηt∥w−wt∥22), gT1:tw is the loss and 12ηt∥w−wt∥22 is L2 regularization to constrain the w is close to center of the sequence wt, actually, when w=Mean(wt) achieving the minimization. This's what Proximal means in FTRL-proximal.
Intuitively, one could use Bregman divergence to replace L2 regularization in optimization problems.
2. FTRL-proximal
The FTRL-proximal algorithm uses the simply update:
where ∑ts=1σs=σ1:t=1ηt.
when λ1=0, this is the same as OGD's update step. However, when FTRL-proximal update with λ1>0 does an excellent job of inducing sparsity for w.
Expanding this formula, it's straightforward to re-write the update:
where 12∑ts=1σswTsws is a constant as ws and σs are known parameter on rount t+1.
If one has stored zt−1=g1:t−∑t−1s=1σsws, at the beginning of rount t, one update by setting zt=zt−1+(1ηt−1ηt−1)wt:
For a convex function g, proximal operator is defined as
when g is L1, proximal operator has close-form solution called soft-thresholding or shrinkage operator:
Then:
Reference:
- McMahan H B, Holt G, Sculley D, et al. Ad click prediction: a view from the trenches[C]//Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2013: 1222-1230.