Use Case #1: A Glimpse Into The World of Retail (part 2)
Part 2: Recommendation Systems
Written by Hajar Ait El Kadi & Koffi Cornelis
This is the second part of a two-part article aimed at analysing a retail dataset. If you haven’t already, checkout part 1 here, or don’t! It’s a free world.
Whether we want to rekindle an old flame with a dormant customer 🔥, or keep an active one coming for more, nothing says “I care about you 🌹and I want to take us to the next level” more than anticipating their needs and presenting them on a silver platter.
But how do we do that?
Enter Recommendation Systems 🤩
In the following article, we use common recommendation system approaches on retail transaction data analysed in part 1.
4. Recommendation System Approaches
Plan of action:
- Evaluation metrics
- Baselines
- Models
- Performance
- Example
Before we jump in, let’s first establish a few ground rules!
★ Prerequisites:
- An item is the combination of a category and a subcategory.
- Returns are removed from the transactions and only the purchase record is kept. A customer may return a product that doesn’t suit his needs, but he is probably still interested in the category/subcategory and thus, the item. Sometimes, especially in e-commerce, clients order more sizes and keep only one.
- All products can be recommended: New items are recommended, but also those that have already been purchased, as a customer may want to buy an item more than once.
If you’re a little lost, then you should have read part 1🤓.
★ What we evaluate on:
- Monthly 2013 recommendations: Most customers only make one transaction a month. For each month, we use the sales history up to month M-1 to generate recommendations for month M.
We discard 2014 data as it is not enough.
- Only Customers who actually bought something during the month we’re evaluating: Since we generate recommendations per month, we will only be able to evaluate our predictions based on customers who actually made a transaction during the month in question.
1. Evaluation Metrics
We need a winner and a loser amongst our models. To determine our champion, we base their performance on a few metrics.
Say hello to precision and recall:
However, those two are not enough, as they don’t take into account the recommendation list ordering. Instead, we call in the big guns, figuratively speaking of course, as we are anti-gun violence.
If relevant items are at the top of the ranked list of predictions, it will positively impact the following metrics¹:
- Precision@K: how many relevant items have been predicted with the top K predictions.
- Recall@K: how well the recommender retrieved all relevant items among the top K predictions.
- MAP@K — Mean Average Precision@K: computes the Average Precision (AP) over all the customers at rank K. The AP rewards having a lot of relevant recommendations and having them at the top of the list.
- Novelty@K²: provides insight into newly recommended items. If the score is high, it means that we recommend relevant new items to customers.
Sadly, we won’t be handing out any participation trophies. And of course for metrics, as for all things in life, bigger is better across the board 👀!
2. Baselines
We use two recommenders as baselines:
- The last sold items recommender: recommends the last items a customer bought. In case the customer did not buy anything before, we recommend the last items bought by the other customers.
- The most sold items recommender: recommends the most bought items by a customer. In case the customer did not buy anything before, we recommend the most bought items by all the other customers.
3. Models
All code can be found here.
a. Content-based recommender
By definition, content-based recommender systems use item features to recommend new items to a user. We created our items’ features by encoding the columns “category” and “subcategory” using OneHotEncoder.
Basically, a customer who buys women’s clothes and women’s bags is more likely to be interested in women’s items. We will not include customer related features (age, sex…), nor context related features (season, year…).
We talked a lot about categories, subcategories and items. Let’s take a closer look 🔍:
- 6 Categories: Clothings, Footwear, Electronics, Bags, Books and Home and Kitchen
- 18 Subcategories: Women, Men, Kids, Children, Non-Fiction, Academic, Fiction, Audio and Video, Furnishing, Comics, Mobiles, Computers, Personal Appliances, Camera, DIY, Kitchen, Bath and Tools
Based on our item definition and our dataset, we end up with 23 items(all categories do not necessarily pair with all subcategories).
Step 1: Build two matrices to input into the model
We build a first matrix items_features that describes items in terms of features. From transactions, we build a second matrix users_items_sales describing the history of interactions (purchases) between customers and items:
- items_features matrix: Describes each item features.
- users_items_sales matrix: Describes how many items each customer bought.
Step 2: Compute a user preference matrix
The users_features_preference matrix gives the importance of each feature for all customers that have actually bought items. It is calculated by the following formula:
The following code was used to compute said matrix:
Step 3: Compute the score of relevance per item per customer
To get the score of relevance per item per customer, we multiply the users_features_preference matrix by items_features.
The items with the highest scores are recommended for the customer 🚀.
b. Collaborative filtering recommender
By definition, collaborative filtering recommender systems use solely interactions between customers to recommend their next purchase. Basically a customer that buys several items that another customer has purchased, should be interested in what the latter has already bought, but that he had not yet himself.
Confused? That’s okay, we’re nice enough to break it down for you:
If you still don’t understand, then we can’t help you 🤷♀️!
How does it work?
or :
Using Matrix factorisation method through Alternating Least Squares!
Similarly to the content based-recommender, we will try to compute a relevance matrix using an items_features matrix and users_features_preference. The difference, this time, is that we will let the matrix factorisation learn them.
Let’s look at it step by step:
Step 1: Build two matrices to input into the model
Similarly to content-based recommender, we build the same users_items_sales matrix describing the history of interactions (purchases) between customers and items:
Step 2: Jointly compute an items’ features and a user preference matrix
Matrix factorisation will take it from here. It needs a number of latent features N (we went with lucky number 7 after trial and error) to build our two matrices:
- items_features: of shape: (# items, # latent features N), describes items in terms of features
- users_features_preference: of shape: (# users, # latent features N), describes the importance of each feature to users
The matrix factorisation builds the aforementioned matrices based on the rule:
We used the Implicit python package to implement it. The following are the output matrices:
Step 3: Compute the score of relevance per item per customer
Once again, the score of relevance per item per customer is computed by multiplying the users_features_preference matrix by items_features.
And voilà!
Based on the highest score, items are recommended for each customer.
4. Performance
We call upon the metrics mentioned above.
a. Precision@K:
At rank K=1, the difference between the models is most apparent. Starting from rank K=2 performances start to converge.
Collaborative filtering takes the cake. It makes the most precise recommendations, up until rank 10.
As expected the higher the rank K, the lower the performance is for all the models.
b. Recall@K:
The models seem to have similar performance across all ranks with collaborative filtering being very slightly superior for K=1.
c. MAP@K:
The collaborative filtering is doing slightly better, starting from the first rank. Which is to be expected since collaborative filtering performed best at rank 1 (K=1) for Precision@K. Other than that, performance seems to be close for all recommenders.
d. Novelty@K:
As is the case for precision at K, the biggest difference is registered at rank K=1. Collaborative filtering strikes again.
Remember we already mentioned in part 1 that most customers do not buy an item more than once. That might explain why Collaborative filtering is doing better than other recommenders as it recommends more new items (new in the sense that the customer in question never bought them not to confuse with new items added to the catalogue).
You probably thought: wait, shouldn’t the novelty score for Most Sold and Last Sold recommenders, be zero? You’re both very perceptive and right!
However, there is always the case where a user previously bought less items than the rank K in question. We then recommend items at random to reach K items. This is why the novelty score is different than zero.
5. Example
Remember the customers whose history we looked at in part 1, you probably don’t(last chance to give it a quick read). We will look more in depth into what we would have recommended for them throughout 2013.
Let’s look at customer 7 for example. He generated a consistent revenue over the years.
In march, having previously bought men’s clothing and 5 cameras; CFR(collaborative filtering recommender) recommends strongly fiction books, men’s clothing, a camera, men’s bags and home and kitchen tools. CBR (content based recommender), on the other hand, recommends men’s clothing, a camera, women’s clothing, fiction books and kid’s clothing.
We can clearly see that CBR tends to recommend items in the same category whereas CFR branches out a bit more.
The customer has bought these items the following month: fiction books and home and kitchen tools. Making collaborative filtering a better recommender.
Conclusion
And we have a winner ladies and gentlemen🥁🥁 Collaborative Filtering!
Not by a lot, but a win is a win! And the model will only do better once we feed it more customer transactions data.
But keep in mind that collaborative filtering will always have a popularity bias as it tends to recommend popular items. It also has a cold start problem, as it fails to recommend new items added to the categories or lesser-known items.
And that’s a warp everyone, all good things must come to an end!
Cheers ✌️
Useful Links:
References:
[1] A great blog explains all the metrics in detail: https://sdsawtelle.github.io/blog/output/mean-average-precision-MAP-for-recommender-systems.html
[2] This research article explain more about it: http://www.jestr.org/downloads/Volume6Issue3/fulltext25632013.pdf
[3] https://datasciencemadesimpler.wordpress.com/tag/alternating-least-squares/