Evaluation of Recommemder Systems
Evaluation of Recommemder Systems
本节课程内容
一 Prediction vs. Top-N
1. Key distinction between Prediction and Top-N:
-Prediction is mostly about accuracy, possibly decision support; focused locally
-Top-N is mostly about ranking, decision support;focused comparatively
2. Dead vs. Live Recs
-Retrospective (dead data) evaluation looks at how recommender would have predicted or recommended for items already consumed/rated.
-Prospective (live experiment) evaluation looks at how recommendations are actually received
二 Basic Accuracy Metrics:
1. MAE(Mean Absolute Error):
P is the prediction score, and R is the group-truth rating score.
2. MSE(Mean Squared Error):
MSE Penalizes large errors more than small.
3. RMSE(Root Mean Squared Error):
三 Basic Decision Support Metrics
Decision Support Metrics measures how well a recommender helps users make good decisions
examples:
- For predictions: 4 vs. 2.5 worse than 2.5 vs. 1
- For recommendations, top of list is what matters most.
1. Precision, Recall and F-score:
Precision is the percentage of selected items that are 'relevant':
Recall is the percentage of relevant items that are selected:
F-score is the trade-off for precision and recall:
2. ROC(Receiver Operating Characteristic):
The ROC curve is a plot of the performance of a classifier or filter at different thresholds.
It plots true-positives against false positives:
see http://en.wikipedia.org/wiki/Receiver_operating_characteristic for detail.
In recommender systems, the curve reflects
trade-offs as you vary the prediction cut-off
for recommending (vs. not).
Area under the curve is often used as a
measure of recommender effectiveness
3. Rank Metrics
Metrics Families:
- Prediction accuracy: how well does the recommender estimate preference?
- Decision support: how well does the recommender do at finding good things?
- Rank accuracy: how well does the recommender estimate relative preference?
1. MRR(Mean Reciprocal Rank)
Reciprocal rank: 1/i, where i is the rank of the first ‘good’ item
- Similar to precision/recall:
– P/R measures how good recommender is at only being relevant (precision) and finding things
(recall)
– RR measures how far you have to go to find something good
MRR is just average over all test queries
2. Spearman Rank Correlation
It punishes mainly on misplacement and specially, Punishes all misplacement equally
But, our goal is: Goal: weight things at the top of the list more heavily
3. Discounted Cumulative Gain
- Measures utility of item at each position in Top-N the list
- Discount by position, so things at front are more important
If normalized by total achievable utility, It comes to the Normalized Discounted
Cumulative Gain (nDCG)
DGC is the loss function firstly introduced in Learning to Rank, and nDCG is increasingly common in recsys. MRR also used.