Welcome back! ** This is the continuation ofPart 1 ** which covered most of the theory and gave a small scale example. In this article I will be using a much larger and real world data set. We will also look at ways to improve the performance, quality of the algorithm and demonstrate how to objectively test your results.

## Root Mean Squared Error (RMSE)

Possibly the most important aspect of creating your recommendation system is being able to objectively recognise if your calculated ratings are getting better or worse with each tweak to your algorithm.

One metric that is very common for this kind of data is the root-mean-square error, or RMSE. It sounds like a complicated thing, but actually it's just a fancy average.

To be able to *check* our results we need to test them against something that's actually real. We do this by splitting the data set of ratings (these are all real ratings from people) randomly into two parts, usually an 80/20% split. Then we treat the smaller set (the 20%) as the correct answers and try to use the other 80% to calculate what the remaining 20% are. We compare the real 20% results with our 20% results and get a nice simple clean number that tells us, on average, how far our guesses are off by.

Here is an example where we have calculated 4 ratings and comparing them to the real 4 results:

Correct value | Our guess | |
---|---|---|

1 |
4 | 3 |

2 |
5 | 2 |

3 |
3 | 3 |

4 |
1 | 2 |

The general formula for the RMSE is:

$$\sqrt{\frac{(x_1 - y_1)^2 + (x_2 - y_2)^2 + \ldots}{n}}$$

*x* is the correct values and *y* is our guesses; or the other way around, doesn't matter. *n* is the number of guesses. So here is the formula with the values from the table above (correct values in red, our guesses in blue):

$$\sqrt{\frac{(\color{red}4 - \color{blue}3)^2 + (\color{red}5 - \color{blue}2)^2 + (\color{red}3 - \color{blue}3)^2 + (\color{red}1 - \color{blue}2)^2}{4}} \approx 1.658$$

This means that on average the estimated ratings are 1.7 stars off. The idea is to try and get this number as low as possible. However, it's worth noting that very slight gains can have an enormous benefit to the whole system.

## More Data!

Now that we have a way to check our results we need some real data to play with. The lovely folks over at GroupLens provide some free datasets at varying sizes for just what we need.

I will be using the 100K MovieLens dataset . You will need to download this to use the Python example code. You can also find other sizes here .

Once downloaded you will see a lot of files with confusing names, the ones we really care about are `u1.base`

and `u1.test`

. This is the 100,000 ratings already split into 80%/20% for us.

## Some Things Need to Change Around Here

If we were to simply use the code in the previous article it would work but it would be extremely slow. There are some very trivial steps we can take that make a huge difference...

We lookup the same cosine similarities a lot, so caching the result makes an enourmous difference to speed.

On top of that we are dealing with a *lot* more movies and it's unlikely when comparing the tastes of a single user with thousands of other users that they will each share even just a small portion of there movie ratings - which would cause a lot of unessesary processing. We will be using sets (a unique list of items) to find the commonalities between users before calculating their cosine similarity to eliminate a lot of "calculating zeros".

## 1. Random

Before we apply our recommendation algorithm lets look at a worst case scenario. This is where we generate random ratings and look at the RMSE:

from math import sqrt import pprint import random def read_file(file_path): ratings = {} with open(file_path) as f: lines = f.readlines() for line in lines: userId, movieId, rating, _ = line.strip().split("\t") if userId not in ratings: ratings[userId] = {} ratings[userId][movieId] = int(rating) return ratings # So our "randoms" are always the same. random.seed(0) test_ratings = read_file('ml-100k/u1.test') total = 0 n = 0 print "\tItem ID\tRMSE" for test_user in test_ratings: for itemId in test_ratings[test_user]: guess = random.randint(1, 5) real = test_ratings[test_user][itemId] diff = real - guess total += diff * diff n += 1.0 rmse = sqrt(total / n) print '%d\t%s\t%.3f' % (n, itemId, rmse)

This will print out a running RMSE (per movie) where the final value is **1.902** . This means that if we were to guess ratings with no other information we would be around 1.9 stars off their real guesses. Obviously this is terrible, but interesting to know.

## 2. Nearest Neighbour

As stated in the previous article we can use the rating of the nearest neighbour. That is, the user with the most similar taste in movies:

class CosineDistancer: def __init__(self): self.cache = {} self.cache_hits = 0 self.cache_misses = 0 def combine_users(self, user1, user2): # Make sure we do not modify the original user data set. user1 = copy.copy(user1) user2 = copy.copy(user2) # Get a unique list of the all the keys shared between the users. all_keys = set(user1.keys()) | set(user2.keys()) # Fill in missing values so that they both have the same keys. for key in all_keys: if key not in user1: user1[key] = None if key not in user2: user2[key] = None # Sort by key and only get values that are shared by both users. u1 = [] u2 = [] for key in sorted(user1.iterkeys()): if user1[key] is not None and user2[key] is not None: u1.append(user1[key]) u2.append(user2[key]) return u1, u2 def cosine_distance(self, user1, user2): # Test the cache first. cache_key = '%s:%s' % (user1, user2) if cache_key in self.cache: self.cache_hits += 1 return self.cache[cache_key] else: self.cache_misses += 1 u1, u2 = self.combine_users(user1, user2) top = 0 user1bottom = 0 user2bottom = 0 for i in range(0, len(u1)): if u1[i] is not None and u2[i] is not None: top += u1[i] * u2[i] user1bottom += u1[i] * u1[i] user2bottom += u2[i] * u2[i] bottom = sqrt(user1bottom) * sqrt(user2bottom) if bottom == 0: self.cache[cache_key] = 0 else: self.cache[cache_key] = top / bottom return self.cache[cache_key] def get_users_with_item(ratings, itemId): users = [] for userId in ratings: if itemId in ratings[userId]: users.append(userId) return users def calculate_rating_nearest(distancer, users, userId, itemId): best_dist = 0.0 rating = None for related_user in get_users_with_item(users, itemId): if related_user == userId: continue dist = distancer.cosine_distance(users[userId], users[related_user]) if dist > best_dist: best_dist = dist rating = users[related_user][itemId] if rating is None: rating = 2.5 return rating ratings = read_file('ml-100k/u1.base') test_ratings = read_file('ml-100k/u1.test') distancer = CosineDistancer() total = 0 n = 0 print "\tItem ID\tRMSE" for test_user in test_ratings: for itemId in test_ratings[test_user]: guess = calculate_rating_nearest(distancer, ratings, str(test_user), str(itemId)) real = test_ratings[test_user][itemId] diff = real - guess total += diff * diff n += 1.0 rmse = sqrt(total / n) print '%d\t%s\t%.3f' % (n, itemId, rmse)

This takes significantly longer since we are doing some real work now. In reality you would probably not work out all missing ratings because it would be so unnecessary. But I'll leave that decision up to you.

The important thing here is that we have our first proper RSME baseline. The final result comes out at **1.405** .

## 3. Average Nearest Neighbours

Rather than just relying on the *nearest* user it is more significantly more effective to take the average of a fixed number of users that are most similar. It adds very little processing time for a large gain in accuracy.

def calculate_rating_avg(distancer, users, userId, itemId, max_n): dist = {} for related_user in get_users_with_item(users, itemId): if related_user == userId: continue dist[related_user] = distancer.cosine_distance(users[userId], users[related_user]) sorted_dist = sorted(dist.items(), key=operator.itemgetter(1), reverse=True) total = 0.0 n = 0 for i in range(0, max_n): try: total += users[sorted_dist[i][0]][itemId] except: break n += 1 if n == 0: return 2.5 return total / n

The final request is a much more respectable RMSE of **1.047** .

## The Netflix Prize

The Netflix prize was an open competition offering a grand prize of US$1 million to anyone that could provide a collaborative filtering algorithm that would beat their current Cinematch algorithm. The starting RMSE to beat was 0.9514.

In September 2009 the grand prize was awarded to team BellKor's Pragmatic Chaos for beating CineMatch by 10.06% with an RMSE of just 0.8567.

## Results

Summarising all the result from above:

Name | Run Time | RMSE |
---|---|---|

BellKor's Pragmatic Chaos | 0.856 | |

Netflix Cinematch | 0.951 | |

Avg Nearest Neighbours | 225 seconds | 1.047 |

Nearest Neighbour | 220 seconds | 1.405 |

Random | 0.1 seconds | 1.902 |

At first glance it looks like even the original Netflix algorithm is only marginally better than this kind of rudimentary algorithm, but actually even smaller differences become exponentially harder to achieve.

## What's Next?

Cosine similarity is not our only choice, there are others like Jaccard, BM25, TFIDF and others. The further you dig into this field the more infinite and complex it can become. I would suggest you checkout this great article on other distance metrics .

I wanted to focus on a simple enough example that produces decent result with a surprisingly small amount of work and understanding. This is the end, from me, for now. But I would like to explore other options in the future.

Thank you for reading. I'd really appreciate any and all feedback, please leave your comments below or consider subscribing. Happy coding.

You might also be interested in these articles...

Elliot Chance –

I'm a data nerd and TDD enthusiast living in Sydney, Australia. I love exploring new technologies and working on modern ways to solve age old problems.