Evaluation of Recommemder Systems

本节课程内容

一 Prediction vs. Top-N

1. Key distinction between Prediction and Top-N:

-Prediction is mostly about accuracy, possibly decision support; focused locally
-Top-N is mostly about ranking, decision support;focused comparatively

2. Dead vs. Live Recs

-Retrospective (dead data) evaluation looks at how recommender would have predicted or recommended for items already consumed/rated.
-Prospective (live experiment) evaluation looks at how recommendations are actually received

二 Basic Accuracy Metrics:

1. MAE(Mean Absolute Error):

MAE=−∑ratings|P−R|#ratings

P is the prediction score, and R is the group-truth rating score.

2. MSE(Mean Squared Error):

MSE=∑ratings(P−R)2#ratings

MSE Penalizes large errors more than small.

3. RMSE(Root Mean Squared Error):

RMSE=∑ratings(P−R)2#ratings−−−−−−−−−−√

三 Basic Decision Support Metrics

Decision Support Metrics measures how well a recommender helps users make good decisions
examples:

For predictions: 4 vs. 2.5 worse than 2.5 vs. 1
For recommendations, top of list is what matters most.

1. Precision, Recall and F-score:

Precision is the percentage of selected items that are 'relevant':

Precision=NrsNs

Recall is the percentage of relevant items that are selected:

Recall=NrsNs

F-score is the trade-off for precision and recall:

F1=2PRP+R

2. ROC(Receiver Operating Characteristic):

The ROC curve is a plot of the performance of a classifier or filter at different thresholds.
It plots true-positives against false positives:
see http://en.wikipedia.org/wiki/Receiver_operating_characteristic for detail.
In recommender systems, the curve reflects
trade-offs as you vary the prediction cut-off
for recommending (vs. not).
Area under the curve is often used as a
measure of recommender effectiveness

3. Rank Metrics

Metrics Families:

Prediction accuracy: how well does the recommender estimate preference?
Decision support: how well does the recommender do at finding good things?
Rank accuracy: how well does the recommender estimate relative preference?

1. MRR(Mean Reciprocal Rank)

Reciprocal rank: 1/i, where i is the rank of the first ‘good’ item

Similar to precision/recall:
– P/R measures how good recommender is at only being relevant (precision) and finding things
(recall)
– RR measures how far you have to go to find something good

MRR is just average over all test queries

2. Spearman Rank Correlation

SRC=∑i(r1(i)−μ1)(r2(i)−μ2)∑i(r1(i)−μ1)2√(√∑i(r)2(i)−μ2)2

It punishes mainly on misplacement and specially, Punishes all misplacement equally
But, our goal is: Goal: weight things at the top of the list more heavily

3. Discounted Cumulative Gain

Measures utility of item at each position in Top-N the list
Discount by position, so things at front are more important

If normalized by total achievable utility, It comes to the Normalized Discounted
Cumulative Gain (nDCG)

DGC(r)=∑disc(r(i))u(i)

nDGC(r)=DCG(r)DCG(rperfect)

DGC is the loss function firstly introduced in Learning to Rank, and nDCG is increasingly common in recsys. MRR also used.

本节课程材料

vedio
slide
code