A recommendation system is used to, well... recommend stuff. Netflix is a great example of this; after you rate several titles it will be able to recommend new titles to you. Have you ever wondered how this really works? Or even apply it to own site or application. Well... you're about to find out.
I'll be working through a complete, simple solution and provide code examples in Python. I will be using movie ratings (1-5 stars) as the data set, but this same method should work for percentages, or anything else that is can be numerically represented.
Small Scale
Before we can ramp it up into thousands of ratings and get some meaningful results we need to understand how it works on a smaller scale.
I will be using some sample data provided by this book . HP represents the trilogy of Harry Potter movies, TW is for Twilight and SW is for the first three Star Wars movies.
HP1 | HP2 | HP3 | TW | SW1 | SW2 | SW3 | |
---|---|---|---|---|---|---|---|
A | 4 | 5 | 1 | ||||
B | 5 | 5 | 4 | ||||
C | 2 | 4 | 5 | ||||
D | 3 | 3 |
Ato D are four users that have provided movie ratings. I'm not sure why D watched the second Harry Potter movie before watching the first, however let's try and estimate how much he will like the first movie with the data provided. If you use this same process to fill in all of the missing values you could then make recommendations - but we will get into this later.
The simplest way to work out this missing rating ( D , HP1 ) is to find the person with the most similar taste that did see HP1 and use their rating. This isn't too accurate on a system with 4 users, but the more users and amount of data available then more accurate this becomes.
So how do we determine who has the most similar taste to D ? There is a formula called the cosine similarity or distance that takes in the ratings of two users and outputs a number between 0.0 and 1.0 where 1.0 would represent that they have exactly the same ratings for all of their movies. So we're looking for the user with the greatest cosine distance.
Cosine Distance/Similarity
The general formula is:
$$\frac{(a_1 \times b_1) + (a_2 \times b_2) + \ldots}{\sqrt{(a_1)^2 + (a_2)^2 + \ldots}\sqrt{(b_1)^2 + (b_2)^2 + \ldots}}$$
Where a and b represent the users. Let's calculate the cosine similarity of A (red) and B (blue):
$$\frac{(\color{red}4 \times \color{blue}5) + (\color{red}0 \times \color{blue}5) + (\color{red}0 \times \color{blue}4) + (\color{red}5 \times \color{blue}0) + (\color{red}1 \times \color{blue}0) + (\color{red}0 \times \color{blue}0) + (\color{red}0 \times \color{blue}0)}{\sqrt{\color{red}4^2 + \color{red}0^2 + \color{red}0^2 + \color{red}5^2 + \color{red}1^2 + \color{red}0^2 + \color{red}0^2}\sqrt{\color{blue}5^2 + \color{blue}5^2 + \color{blue}4^2 + \color{blue}0^2 + \color{blue}0^2 + \color{blue}0^2 + \color{blue}0^2 }}$$
The zero values are all cancelled out since they have no effect on the result itself, so a more simple form is:
$$\frac{(\color{red}4 \times \color{blue}5)}{\sqrt{\color{red}4^2 + \color{red}5^2 + \color{red}1^2}\sqrt{\color{blue}5^2 + \color{blue}5^2 + \color{blue}4^2}}$$
We can only compare with two other users: A and B because C does not have a rating for HP1 so it would not be helpful. Using the cosine distance we can derive:
- A and D = 0.0
- B and D = 0.435
We don't have a lot of choice but B is the clear winner with the greatest cosine similarity. Here is the code to produce the answer:
_ = 0 # A missing value. users = { 'A': [4, _, _, 5, 1, _, _], 'B': [5, 5, 4, _, _, _, _], 'C': [_, _, _, 2, 4, 5, _], 'D': [_, 3, _, _, _, _, 3], } def cosine_distance(user1, user2): top = 0 user1bottom = 0 user2bottom = 0 for i in range(0, len(user1)): top += user1[i] * user2[i] user1bottom += user1[i] * user1[i] user2bottom += user2[i] * user2[i] return top / (sqrt(user1bottom) * sqrt(user2bottom)) print cosine_distance(users['A'], users['D']) print cosine_distance(users['B'], users['D'])
An important note here is that missing values are treated as 0
. In reality this would mean the people that have not rated the movie really dislike it. It is often better to substitute a balanced value for the missing values while calculating, in this case 2.5
might be a good choice. After changing the value of _
and rerunning it we get:
- A and D = 0.915
- B and D = 0.952
From this point we can lookup what score they ( B ) gave for the movie. In this case, 5 stars and say that D should really like it and we think they would rate it 5 stars.
Here is a more general function:
def estimate_rating(users, userName, movie): user_best_match = None user_best_match_dist = 0 for user in users: # We don't want to calculate ourself. if user == userName: continue # Ignore users that haven't rated the movie. if users[user][movie] == 0: continue dist = cosine_distance(users[userName], users[user]) print '%s -> %s = %.3f' % (userName, user, dist) if dist >= user_best_match_dist: user_best_match_dist = dist user_best_match = user # Return the rating of the best matched user. return users[user_best_match][movie] HP1, HP2, HP3, TW, SW1, SW2, SW3 = range(0, 7) print estimate_rating(users, 'D', HP1)
To get more accurate results (if we had more data) we would take an average of a fixed number of similar users rather than just the most similar.
Stay Tuned!
In part 2 we will be looking at much larger and real world data set. Tips on performance, increasing quality of the recommendations and being able to objectively calculate how accurate our algorithm is as we make changes to it.
Thank you for reading. I'd really appreciate any and all feedback, please leave your comments below or consider subscribing. Happy coding.
You might also be interested in these articles...
Elliot Chance –
I'm a data nerd and TDD enthusiast living in Sydney, Australia. I love exploring new technologies and working on modern ways to solve age old problems.