Evaluation of Recommemder Systems

Evaluation of Recommemder Systems

本节课程内容

一 Prediction vs. Top-N

1. Key distinction between Prediction and Top-N:

-Prediction is mostly about accuracy, possibly decision support; focused locally
-Top-N is mostly about ranking, decision support;focused comparatively

2. Dead vs. Live Recs

-Retrospective (dead data) evaluation looks at how recommender would have predicted or recommended for items already consumed/rated.
-Prospective (live experiment) evaluation looks at how recommendations are actually received

二 Basic Accuracy Metrics:

1. MAE(Mean Absolute Error):

MAE=ratings|PR|#ratings

P is the prediction score, and R is the group-truth rating score.

2. MSE(Mean Squared Error):

MSE=ratings(PR)2#ratings

MSE Penalizes large errors more than small.

3. RMSE(Root Mean Squared Error):

RMSE=ratings(PR)2#ratings

三 Basic Decision Support Metrics

Decision Support Metrics measures how well a recommender helps users make good decisions
examples:

  • For predictions: 4 vs. 2.5 worse than 2.5 vs. 1
  • For recommendations, top of list is what matters most.

1. Precision, Recall and F-score:

Precision is the percentage of selected items that are 'relevant':

Precision=NrsNs

Recall is the percentage of relevant items that are selected:

Recall=NrsNs

F-score is the trade-off for precision and recall:

F1=2PRP+R

2. ROC(Receiver Operating Characteristic):

The ROC curve is a plot of the performance of a classifier or filter at different thresholds.
It plots true-positives against false positives:
see http://en.wikipedia.org/wiki/Receiver_operating_characteristic for detail.
In recommender systems, the curve reflects
trade-offs as you vary the prediction cut-off
for recommending (vs. not).
Area under the curve is often used as a
measure of recommender effectiveness

3. Rank Metrics

Metrics Families:

  • Prediction accuracy: how well does the recommender estimate preference?
  • Decision support: how well does the recommender do at finding good things?
  • Rank accuracy: how well does the recommender estimate relative preference?

1. MRR(Mean Reciprocal Rank)

Reciprocal rank: 1/i, where i is the rank of the first ‘good’ item

  • Similar to precision/recall:
    – P/R measures how good recommender is at only being relevant (precision) and finding things
    (recall)
    – RR measures how far you have to go to find something good

MRR is just average over all test queries

2. Spearman Rank Correlation

SRC=i(r1(i)μ1)(r2(i)μ2)i(r1(i)μ1)2(i(r)2(i)μ2)2

It punishes mainly on misplacement and specially, Punishes all misplacement equally
But, our goal is: Goal: weight things at the top of the list more heavily

3. Discounted Cumulative Gain

  • Measures utility of item at each position in Top-N the list
  • Discount by position, so things at front are more important

If normalized by total achievable utility, It comes to the Normalized Discounted
Cumulative Gain (nDCG)

DGC(r)=disc(r(i))u(i)
nDGC(r)=DCG(r)DCG(rperfect)

DGC is the loss function firstly introduced in Learning to Rank, and nDCG is increasingly common in recsys. MRR also used.


本节课程材料

vedio
slide
code

Evaluation%20of%20Recommemder%20Systems%0A%3D%3D%3D%0A@%5Brecsys%7Cpublished%5D%20%20%0A%20%20%0A%23%23*%u672C%u8282%u8BFE%u7A0B%u5185%u5BB9*%0A%23%23%u4E00%20Prediction%20vs.%20Top-*N*%0A%23%23%231.%20Key%20distinction%20between%20Prediction%20and%20Top-*N*%3A%20%20%0A**-Prediction**%20is%20mostly%20about%20accuracy%2C%20possibly%20decision%20support%3B%20focused%20locally%20%20%20%20%0A**-Top-N**%20is%20mostly%20about%20ranking%2C%20decision%20support%3Bfocused%20comparatively%0A%23%23%232.%20Dead%20vs.%20Live%20Recs%20%20%0A**-Retrospective%20%28dead%20data%29%20evaluation**%20looks%20at%20how%20recommender%20would%20have%20predicted%20or%20recommended%20for%20**items%20already%20consumed/rated**.%20%20%0A**-Prospective%20%28live%20experiment%29%20evaluation**%20looks%20at%20how%20recommendations%20are%20actually%20received%20%20%0A%23%23%u4E8C%20Basic%20Accuracy%20Metrics%3A%20%20%0A%23%23%231.%20MAE%28Mean%20Absolute%20Error%29%3A%20%0A%60%60%60mathjax%0AMAE%20%3D%20-%5Cfrac%7B%5Csum_%7Bratings%7D%7CP-R%7C%7D%7B%5C%23ratings%7D%0A%60%60%60%0A%60%24P%24%60%20is%20the%20prediction%20score%2C%20and%20%60%24R%24%60%20is%20the%20group-truth%20rating%20score.%20%20%0A%0A%23%23%232.%20MSE%28Mean%20Squared%20Error%29%3A%20%20%0A%60%60%60mathjax%0AMSE%3D%5Cfrac%7B%5Csum_%7Bratings%7D%28P-R%29%5E2%7D%7B%5C%23ratings%7D%0A%60%60%60%20%20%0AMSE%20Penalizes%20large%20errors%20more%20than%20small.%20%20%0A%23%23%233.%20RMSE%28Root%20Mean%20Squared%20Error%29%3A%20%20%0A%60%60%60mathjax%0ARMSE%20%3D%20%5Csqrt%7B%5Cfrac%7B%5Csum_%7Bratings%7D%28P-R%29%5E2%7D%7B%5C%23ratings%7D%7D%0A%60%60%60%20%20%0A%23%23%u4E09%20Basic%20Decision%20Support%20Metrics%20%20%0A***Decision%20Support%20Metrics%20***%20measures%20how%20well%20a%20recommender%20helps%20users%20*make*%20good%20*decisions*%20%20%0Aexamples%3A%20%20%0A-%20For%20predictions%3A%204%20vs.%202.5%20worse%20than%202.5%20vs.%201%0A-%20For%20recommendations%2C%20top%20of%20list%20is%20what%20matters%20most.%20%20%0A%0A%23%23%231.%20Precision%2C%20Recall%20and%20F-score%3A%20%20%0A**Precision**%20is%20the%20percentage%20of%20selected%20items%20that%20are%20%27relevant%27%3A%0A%60%60%60mathjax%0A%20%20%20%20Precision%20%3D%20%5Cfrac%7B%5Cmathcal%7BN%7D_%7Brs%7D%7D%7B%5Cmathcal%7BN%7D_%7Bs%7D%7D%0A%60%60%60%20%20%0A**Recall**%20is%20the%20percentage%20of%20relevant%20items%20that%20are%20selected%3A%20%20%0A%60%60%60mathjax%0A%20%20%20%20Recall%20%3D%20%5Cfrac%7B%5Cmathcal%7BN%7D_%7Brs%7D%7D%7B%5Cmathcal%7BN%7D_%7Bs%7D%7D%0A%60%60%60%20%20%0A**F-score**%20is%20the%20trade-off%20for%20precision%20and%20recall%3A%20%20%0A%60%60%60mathjax%0A%20%20%20%20F_1%20%3D%20%5Cfrac%7B2PR%7D%7BP+R%7D%0A%60%60%60%20%20%0A%23%23%232.%20ROC%28Receiver%20Operating%20Characteristic%29%3A%20%20%0AThe%20ROC%20curve%20is%20a%20plot%20of%20the%20performance%20of%20a%20classifier%20or%20filter%20at%20different%20thresholds.%20%20%0AIt%20plots%20true-positives%20against%20false%20positives%3A%0Asee%20http%3A//en.wikipedia.org/wiki/Receiver_operating_characteristic%20for%20detail.%20%20%20%0AIn%20recommender%20systems%2C%20the%20curve%20reflects%0Atrade-offs%20as%20you%20vary%20the%20prediction%20cut-off%0Afor%20recommending%20%28vs.%20not%29.%20%20%0AArea%20under%20the%20curve%20is%20often%20used%20as%20a%0Ameasure%20of%20recommender%20effectiveness%20%20%20%20%0A%0A%23%23%233.%20Rank%20Metrics%20%20%0AMetrics%20Families%3A%20%20%0A-%20**Prediction%20accuracy**%3A%20how%20well%20does%20the%20recommender%20estimate%20preference%3F%20%20%0A-%20**Decision%20support**%3A%20how%20well%20does%20the%20recommender%20do%20at%20finding%20good%20things%3F%0A-%20**Rank%20accuracy**%3A%20how%20well%20does%20the%20recommender%20estimate%20relative%20preference%3F%0A%0A%23%23%23%231.%20MRR%28Mean%20Reciprocal%20Rank%29%20%20%0A**Reciprocal%20rank**%3A%201/i%2C%20where%20i%20is%20the%20rank%20of%20the%20first%20%u2018good%u2019%20item%0A-%20Similar%20to%20precision/recall%3A%20%20%0A%u2013%20P/R%20measures%20how%20good%20recommender%20is%20at%20only%20being%20relevant%20%28precision%29%20and%20finding%20things%0A%28recall%29%20%20%0A%u2013%20RR%20measures%20how%20far%20you%20have%20to%20go%20to%20find%20something%20good%20%20%0A%0A**MRR**%20is%20just%20average%20over%20all%20test%20queries%20%20%0A%0A%23%23%23%232.%20Spearman%20Rank%20Correlation%0A%60%60%60mathjax%0A%20%20%20%20SRC%20%3D%20%5Cfrac%7B%5Csum_i%28%5Cmathcal%7Br%7D_1%28i%29-%5Cmu_1%29%28%5Cmathcal%7Br%7D_2%28i%29-%5Cmu_2%29%7D%7B%5Csqrt%7B%5Csum_i%28%5Cmathcal%7Br%7D_1%28i%29-%5Cmu_1%29%5E2%7D%5Csqrt%28%5Csum_i%28%5Cmathcal%7Br%7D%29_2%28i%29-%5Cmu_2%29%5E2%7D%0A%60%60%60%20%20%0AIt%20punishes%20mainly%20on%20**misplacement**%20and%20specially%2C%20Punishes%20all%20misplacement%20***equally***%20%20%0ABut%2C%20our%20goal%20is%3A%20Goal%3A%20weight%20things%20at%20the%20top%20of%20the%20list%20more%20heavily%20%20%0A%23%23%23%233.%20Discounted%20Cumulative%20Gain%20%20%0A-%20Measures%20utility%20of%20item%20at%20each%20position%20in%20Top-*N*%20the%20list%20%20%0A-%20Discount%20by%20position%2C%20so%20things%20at%20front%20are%20more%20important%0A%0AIf%20normalized%20by%20total%20achievable%20utility%2C%20It%20comes%20to%20the%20%20Normalized%20Discounted%0ACumulative%20Gain%20%28nDCG%29%20%20%0A%60%60%60mathjax%0ADGC%28r%29%3D%5Csum%20disc%28r%28i%29%29u%28i%29%0A%60%60%60%0A%60%60%60mathjax%0AnDGC%28r%29%3D%5Cfrac%7BDCG%28r%29%7D%7BDCG%28r_perfect%29%7D%0A%60%60%60%20%20%0ADGC%20is%20the%20loss%20function%20firstly%20introduced%20in%20*Learning%20to%20Rank*%2C%20and%20nDCG%20is%20increasingly%20common%20in%20recsys.%20MRR%20also%20used.%20%20%0A%0A---%20%20%0A%23%23*%u672C%u8282%u8BFE%u7A0B%u6750%u6599*%0A%5Bvedio%5D%28http%3A//pan.baidu.com/s/1gdzHL3l%29%20%20%20%0A%5Bslide%5D%28http%3A//pan.baidu.com/s/1jGiaJNC%29%20%20%20%20%0A%5Bcode%5D%28http%3A//pan.baidu.com/s/1kThYETT%29%20%20%0A%0A%0A%0A%0A


comments powered by Disqus