Stochastic Gradient descent (SGD) vs Alternating least squares (ALS) for Matrix Factorization

@袁全V在微博上推荐了推荐系统的一些开源工具. 后来我选择了LibFM和GraphLab的代码进行精读.

在阅读了Alternating Least Squares、Stochastic Gradient descent做矩阵分解时遇到了相同的问题:

What is the different in ALS, SGD and biased SGD

Danny Bickson 在GraphLab Google Users做了回答(原文地址):

The papers ALS, SGD and bias-SGD are described in the following papers:

Alternating Least Squares (ALS)

Yunhong Zhou, Dennis Wilkinson, Robert Schreiber and Rong Pan. Large-Scale Parallel Collaborative Filtering for the Netflix Prize. Proceedings of the 4th international conference on Algorithmic Aspects in Information and Management. Shanghai, China pp. 337-348, 2008.
Stochastic gradient descent (SGD)
1. Matrix Factorization Techniques for Recommender Systems Yehuda Koren, Robert Bell, Chris Volinsky In IEEE Computer, Vol. 42, No. 8. (07 August 2009), pp. 30-37
2. Takács, G, Pilászy, I., Németh, B. and Tikk, D. (2009). Scalable Collaborative Filtering Approaches for Large Recommender Systems. Journal of Machine Learning Research, 10, 623-656.
Bias stochastic gradient descent (Bias-SGD)

Y. Koren. Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model. ACM SIGKDD 2008. Equation (5).

Both ALS and SGD solve a similar quadratic cost function.
ALS operates by starting with a random guess for the users (or movies) and then computes a least square step to compute the movies (or users) alternating between the two.
SGD works by starting with a random guess and computing the gradient of the cost function and making a step in the direction of the gradient.

Typically, SGD iteration is faster, since ALS involves a least squares solution which is heavier.
However, fewer ALS iterations are typically needed for solving to the same level of accuracy.
Also ALS is more parallel since all movies or users can be computed independently in parallel, while SGD may have slower parallel performance in case two different users have seen the same movie, this results in concurrent updates to the gradient of the same movies which slows down the operation.

ALS significantly exploits linear algebra packages and specifically Intel MKL which can give a speedup of x5 - x10.

Bias-SGD is a slightly improved version of SGD, which also computes a bias for each movie and user, thus slightly improving accuracy at the cost of additional computation.

Regarding parameter tuning, SGD has more parameters to tune since you need to carefully select the gradient step size. Both SGD and ALS have regularization parameters to tune, number of iterations, and feature vector width.

Anyway it is hard to estimate which method will work better on which dataset. That is why it is very convenient to use graphlab/graphchi - once you prepare your data to the right format you can immediately test multiple methods.

推荐阅读:

Stochastic Gradient descent (SGD) vs Alternating least squares (ALS) for Matrix Factorization
Baum J, Cook C, Curtis M, et al. Parallelization of matrix factorization for recommender systems[R]. Technical Report HPCF–2010–22, UMBC High Performance Computing Facility, University of Maryland, Baltimore County, 2010.(Ggraduate assistant).

Stochastic%20Gradient%20descent%20%28SGD%29%20vs%20Alternating%20least%20squares%20%28ALS%29%20for%20Matrix%20Factorization%20%20%0A%3D%3D%3D%3D%0A@%5Brecsys%7Cpublished%5D%20%0A%0A%5B@%u8881%u5168V%5D%28http%3A//weibo.com/quanyuan007%29%u5728%5B%u5FAE%u535A%5D%28http%3A//weibo.com/1403544953/AcbTjs7i9%3Fmod%3Dweibotime%29%u4E0A%u63A8%u8350%u4E86%u63A8%u8350%u7CFB%u7EDF%u7684%u4E00%u4E9B%u5F00%u6E90%u5DE5%u5177.%20%u540E%u6765%u6211%u9009%u62E9%u4E86%5BLibFM%5D%28http%3A//www.libfm.org/%29%u548C%5BGraphLab%5D%28http%3A//graphlab.org/%29%u7684%u4EE3%u7801%u8FDB%u884C%u7CBE%u8BFB.%20%20%20%20%0A%0A%0A%u5728%u9605%u8BFB%u4E86Alternating%20Least%20Squares%u3001Stochastic%20Gradient%20descent%u505A%u77E9%u9635%u5206%u89E3%u65F6%u9047%u5230%u4E86%u76F8%u540C%u7684%u95EE%u9898%3A%20%20%0A%23%23What%20is%20the%20different%20in%20ALS%2C%20SGD%20and%20biased%20SGD%20%20%0ADanny%20Bickson%20%20%u5728GraphLab%20Google%20Users%u505A%u4E86%u56DE%u7B54%28%5B%u539F%u6587%u5730%u5740%5D%28https%3A//groups.google.com/forum/%23%21msg/graphlab-kdd/VaAyR7sul9I/RMnld7NajKgJ%29%29%3A%20%20%0A%0AThe%20papers%20ALS%2C%20SGD%20and%20bias-SGD%20are%20described%20in%20the%20following%20papers%3A%20%20%0A*%20Alternating%20Least%20Squares%20%28ALS%29%20%20%0A%3EYunhong%20Zhou%2C%20Dennis%20Wilkinson%2C%20Robert%20Schreiber%20and%20Rong%20Pan.%20Large-Scale%20Parallel%20Collaborative%20Filtering%20for%20the%20Netflix%20Prize.%20Proceedings%20of%20the%204th%20international%20conference%20on%20Algorithmic%20Aspects%20in%20Information%20and%20Management.%20Shanghai%2C%20China%20pp.%20337-348%2C%202008.%0A%0A*%20Stochastic%20gradient%20descent%20%28SGD%29%0A%3E1.%20Matrix%20Factorization%20Techniques%20for%20Recommender%20Systems%20Yehuda%20Koren%2C%20Robert%20Bell%2C%20Chris%20Volinsky%20In%20IEEE%20Computer%2C%20Vol.%2042%2C%20No.%208.%20%2807%20August%202009%29%2C%20pp.%2030-37%0A%3E2.%20Tak%E1cs%2C%20G%2C%20Pil%E1szy%2C%20I.%2C%20N%E9meth%2C%20B.%20and%20Tikk%2C%20D.%20%282009%29.%20Scalable%20Collaborative%20Filtering%20Approaches%20for%20Large%20Recommender%20Systems.%20Journal%20of%20Machine%20Learning%20Research%2C%2010%2C%20623-656.%0A*%20Bias%20stochastic%20gradient%20descent%20%28Bias-SGD%29%0A%3EY.%20Koren.%20Factorization%20Meets%20the%20Neighborhood%3A%20a%20Multifaceted%20Collaborative%20Filtering%20Model.%20ACM%20SIGKDD%202008.%20Equation%20%285%29.%0A%0A**%0ABoth%20ALS%20and%20SGD%20solve%20a%20similar%20quadratic%20cost%20function.%20%20%20%0AALS%20operates%20by%20starting%20with%20a%20random%20guess%20for%20the%20users%20%28or%20movies%29%20and%20then%20computes%20a%20least%20square%20step%20to%20compute%20the%20movies%20%28or%20users%29%20alternating%20between%20the%20two.%20%20%20%0ASGD%20works%20by%20starting%20with%20a%20random%20guess%20and%20computing%20the%20gradient%20of%20the%20cost%20function%20and%20making%20a%20step%20in%20the%20direction%20of%20the%20gradient.**%0A%0A**Typically%2C%20SGD%20iteration%20is%20faster%2C%20since%20ALS%20involves%20a%20least%20squares%20solution%20which%20is%20heavier.%20%20%0AHowever%2C%20fewer%20ALS%20iterations%20are%20typically%20needed%20for%20solving%20to%20the%20same%20level%20of%20accuracy.%20%20%20%0AAlso%20ALS%20is%20more%20parallel%20since%20all%20movies%20or%20users%20can%20be%20computed%20independently%20in%20parallel%2C%20while%20SGD%20may%20have%20slower%20parallel%20performance%20in%20case%20two%20different%20users%20have%20seen%20the%20same%20movie%2C%20this%20results%20in%20concurrent%20updates%20to%20the%20gradient%20of%20the%20same%20movies%20which%20slows%20down%20the%20operation.**%0A%0A**ALS%20significantly%20exploits%20linear%20algebra%20packages%20and%20specifically%20Intel%20MKL%20which%20can%20give%20a%20speedup%20of%20x5%20-%20x10.**%20%20%0A%0A**Bias-SGD%20is%20a%20slightly%20improved%20version%20of%20SGD%2C%20which%20also%20computes%20a%20bias%20for%20each%20movie%20and%20user%2C%20thus%20slightly%20improving%20accuracy%20at%20the%20cost%20of%20additional%20computation.**%20%0A%0A**Regarding%20parameter%20tuning%2C%20SGD%20has%20more%20parameters%20to%20tune%20since%20you%20need%20to%20carefully%20select%20the%20gradient%20step%20size.%20Both%20SGD%20and%20ALS%20have%20regularization%20parameters%20to%20tune%2C%20number%20of%20iterations%2C%20and%20feature%20vector%20width.**%0A%0A**Anyway%20it%20is%20hard%20to%20estimate%20which%20method%20will%20work%20better%20on%20which%20dataset.%20That%20is%20why%20it%20is%20very%20convenient%20to%20use%20graphlab/graphchi%20-%20once%20you%20prepare%20your%20data%20to%20the%20right%20format%20you%20can%20immediately%20test%20multiple%20methods.**%0A%0A%0A%23%23%u63A8%u8350%u9605%u8BFB%3A%20%20%0A1.%20%5BStochastic%20Gradient%20descent%20%28SGD%29%20vs%20Alternating%20least%20squares%20%28ALS%29%20for%20Matrix%20Factorization%5D%28http%3A//hello-math.appspot.com/sgd-als%29%20%20%0A2.%20%5BBaum%20J%2C%20Cook%20C%2C%20Curtis%20M%2C%20et%20al.%20Parallelization%20of%20matrix%20factorization%20for%20recommender%20systems%5BR%5D.%20Technical%20Report%20HPCF%u20132010%u201322%2C%20UMBC%20High%20Performance%20Computing%20Facility%2C%20University%20of%20Maryland%2C%20Baltimore%20County%2C%202010.%28Ggraduate%20assistant%29.%5D%28http%3A//userpages.umbc.edu/%7Egobbert/papers/REU2010Team2.pdf%29