The Formula of FTRL-proximal

1. The online framework of LR with SGD

On rount t, LR is used to predict an instance described by a feature vector xt∈Rd, given model parameter vector wt: pt=σ(wt,xt), where σ(α)=1/(1+exp(−α)) is the sigmoid function. when defined label as yi∈{0,1}, the empirical loss over training data given as:

lt(wt)=−ytlogpt−(1−yt)log(1−pt)t

This is the negative log-likelihood of yi given p. It's a convex problem, setting it's gradient to zero, one can obtain ▽lt(w)=(σ(wTxt)−yt)Txt=(pt−yt)Txt

When one use a common learning rate for all features, like ηt=1(√t).
The Online gradient descent(OSD) performs the update step:

wt+1=wt−ηtgt

where gt is ▽lt(wt).
It's straightforward to show that this formula is equals to wt+1=∑ts=1ηsgs if w0 is initialized as 0 by expanding the formula. This means the updated wt+1 is a linear combination of successive gi.
The above close-form solution has an equivalent constrained minimization optimizatoin form

wt+1=argminw(gT1:tw+12ηt∥w−wt∥22)

by setting the gradient respect to w, the close-form and minimization optimization problem are equals.

Once the minimization optimizatoin form is obtained, it's natually to add some regularization terms.

in wt+1=argminw(gT1:tw+12ηt∥w−wt∥22), gT1:tw is the loss and 12ηt∥w−wt∥22 is L2 regularization to constrain the w is close to center of the sequence wt, actually, when w=Mean(wt) achieving the minimization. This's what Proximal means in FTRL-proximal.
Intuitively, one could use Bregman divergence to replace L2 regularization in optimization problems.

2. FTRL-proximal

The FTRL-proximal algorithm uses the simply update:

wt+1=argminw(gT1:tw+12∑ts=1σs∥w−ws∥22+λ1∥w∥1)

where ∑ts=1σs=σ1:t=1ηt.
when λ1=0, this is the same as OGD's update step. However, when FTRL-proximal update with λ1>0 does an excellent job of inducing sparsity for w.
Expanding this formula, it's straightforward to re-write the update:

wt+1=argminw(g1:t−∑ts=1σsws)tw+12ηt∥w∥22+λ1∥w∥1+12∑ts=1σswTsws

where 12∑ts=1σswTsws is a constant as ws and σs are known parameter on rount t+1.
If one has stored zt−1=g1:t−∑t−1s=1σsws, at the beginning of rount t, one update by setting zt=zt−1+(1ηt−1ηt−1)wt:

wt+1=argminwzTtw+12ηt∥w∥22+λ1∥w∥1=argminw12ηt(2ηtzt+w)Tw+λ1∥w∥1=argminw12ηt∥ηtzt+w∥22+λ1∥w∥1=argminw12∥ηtzt+w∥22+ηtλ1∥w∥1

For a convex function g, proximal operator is defined as

proxg(x)=argminug(u)+12∥u−x∥22

when g is L1, proximal operator has close-form solution called soft-thresholding or shrinkage operator:

soft(x,α)i=sgn(xi)max{|xi|−α,0}

Then:

wt+1,i=sgn(−ηtzt,i)max{|ηtzt,i|−ηtλ1,0}=−ηtsgn(zt,i)max{|zt,i|−λ1,0}

Reference:

McMahan H B, Holt G, Sculley D, et al. Ad click prediction: a view from the trenches[C]//Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2013: 1222-1230.