From the course: Recommendation Systems: A Practical Hands-On Introduction

Evaluation of recommendation systems - Python Tutorial

From the course: Recommendation Systems: A Practical Hands-On Introduction

Evaluation of recommendation systems

- Okay, so far we've seen an overview of the two big families of recommendation systems, content-based filtering and collaborative filtering, and we discussed the two algorithms that you can try. Now how do we know which one to use? In this video, we're going to talk about evaluation of recommendation systems. In all machine learning solutions, we need evaluation metrics to measure the performance of the algorithms and decide which one is the best for our use case. In reco, we have three families of evaluation metrics, rating, ranking, and diversity. Rating metrics measure how accurate a recommender is at predicting the exact interaction provided by the user. Sometimes rating metrics are not a good indicator of the performance of a recommender because in a real production system, we recommend a carousel of items. Think about the recommendations you see in media providers or in e-commerce. They tend to be rows of five to 20 items. In this case, we are interested not only in providing the right content, but also in providing it in the right order. That's what ranking metrics are for. Ranking metrics measure how relevant the recommendations are for users. Finally, there is another family of metrics called diversity which can measure things like how novel, diverse, or surprising the recommendations are. In this course, we are going to focus on rating and ranking metrics which are the most impactful for the kind of use cases we will find in the industry. If you have done other courses on machine learning, you will be familiar with some of the following rating metrics. RMSE measures the average error in the predicted ratings. It is a robust metric, but can be affected by outliers. MAE is similar to RMSE, but uses the absolute value instead of the square. MAE is a better estimator of the true average error. AUC is a good estimator of the click-through rate when we have a click or no-click classification. AUC represents the ability to discriminate between positive and negative classes. Log loss penalizes heavily classifiers that are confident about incorrect classifications. Ranking metrics are not that common in other machine learning areas but are actually very common in search, as well as in reco. Precision measures the proportion of recommended items that are relevant. In other words, it is the ability of the model to label a correct sample as correct. Recall measures the proportion of relevant items that are recommended. It is the ability to capture all correct samples. NDCG evaluates how well the recommender performs in recommending ranking items to users. Therefore both the hit of relevant items and correctness in the ranking these items matter. MAP evaluates the average precision for each user in the dataset. You will notice that these ranking metrics are @K. Precision@K, Recall@K, et cetera. That means that the metrics are filtered @K results. You will see this on an e-commerce. When you get recommendations, you may see five results. You want that most of these five results are useful recommendations. That would be high precision and recall at five. Okay, now what do we do with all these metrics? We create a table to compare them all. The algorithms that provide the best metrics are the ones that we are going to test in production using an A/B test. Evaluation metrics in reco are slightly different from all the machine learning techniques, but at the end of the day, we look for the same output. We want to select the best algorithm for our system.

Contents