July 13, 2016
The problem of personalization has become incredibly common in machine learning and many of its applications: social networks, news aggregators and search engines are constantly updating and tweaking their algorithms to give individual users a unique experience. In this post I will expose two strategies that can be used to build personalized recommendations using Vowpal Wabbit’s fast machine learning capabilities.
Personalization engines suggest relevant content with the objective of maximizing a specific metric. For example: a news website might want to increase the number of clicks in a session; on the other hand, for an ecommerce app it is very important to identify visitors that are more likely to buy a product in order to target them with special offers. Our data comes from the RecSys challenge 2015, you can access the complete dataset here .
Installing Vowpal Wabbit & Scikit Learn
Make sure that you have all the Prerequisite Software .
$ git clone https://github.com/JohnLangford/vowpal_wabbit.git $ make $ make install
Downloading the dataset
For most applications, collaborative filtering yields satisfactory results for item recommendations; there are however several issues that arise that might make it difficult to scale up a recommender system.
- The number of features can grow quite large, and given the usual sparsity of consumption datasets, collaborative filtering needs every single feature and datapoint available.
- For new data points, the whole model has to be re-trained
We will use Vowpal Wabbit’s matrix factorization capabilities to build a recommender that is similar in spirit to collaborative filtering but that avoids the pitfalls that we mentioned before. After installing Vowpal Wabbit, I recommend that you quickly go check out it’s input format and main command line arguments .
We preprocessed the RecSys data to produce a file that is already compatible with Vowpal Wabbit’s. For ease of explanation we decided to use a pretty basic feature set, but if you are interested in finding a precise solution to this problem I suggest that you check out the winning solution for the RecSys 2015 challenge .
We will use the purchase information included in buys.vw to fit a VW model, every data point in this file represents a purchase with a quantity, session id and item id:
[quantity] |s [session id] |i [item id]
We will add the – -rank K argument to use Vowpal Wabbit in matrix factorization mode where K denotes the number of latent features. We also need to specify at least one pair of variable interactions between namespaces, in this case –interactions is represents interactions between namespace i and namespace s .
$ vw -d buys.vw --rank 20 --interactions is
The previous command fits a model with quadratic interactions for 20 latent features. Vowpal Wabbit does not output the matrix factorization weights by default, but we can use the gd_mf_weights script included in the library directory to dump all the information that we need:
$ ~/vowpal_wabbit/library/./gd_mf_weights \ -I buys.vw --vwparams '-d buys.vw --rank 20 --interactions is'
If you have trouble finding the path for gd_mf_weights:
$ find ~/ -name gd_mf_weights
The file i.quadratic should be among the files that gd_mf_weights writes out, this a compressed representation of every item and can be used to find pairs of items that are similar to each other and recommend those to users given their past browsing and purchasing history.
I like to use scikit learn’s kd-tree nearest neighbours implementation, but you can choose any other neighbor search algorithm:
from sklearn.neighbors import NearestNeighbors import pandas as pd import numpy as np items_quadratic = pd.read_csv("i.quadratic", sep="\t", header=None) nbrs = NearestNeighbors(n_neighbors=5, algorithm='kd_tree').fit(items_quadratic) distances, indices = nbrs.kneighbors(items_quadratic) print indices
Vowpal Wabbit also includes its own recommendation script implemented in recommend , in order to use it you need to specify the following parameters:
– -topk : the number of items to recommend
-U : a list of the subset of all users for which you want to output a recommendation
-I : a list of items for which to recommend from
-B : a list of user-item pairs that we should not recommend
We have included a list of all items in items.vw and an empty blacklist.vw file. Let’s say we want to recommend 5 items to session number 420471 amongst all possible items with no blacklisted pairs:
echo '|s 420471' | ~/code/vowpal_wabbit/library/./recommend --topk 5 -U /dev/stdin -I items.vw -B blacklist.vw --vwparams '-d buys.vw --rank 20 --interactions is --quiet'
Outputs the following recommendations
0.271379 |s 420471|i 3391236 0.271379 |s 420471|i 3915524 0.271506 |s 420471|i 3095836 0.279096 |s 420471|i 2531796 0.279096 |s 420471|i 5677524
If you want to read more about the mathematics of matrix factorizations for item recommendation I suggest that you check out Matrix Factorization Techniques for Recommender Systems by Koren, Bell & Volinksy.
We can also use Vowpal Wabbit to predict whether a session will end up with a buy event. We extracted the following features for the RecSys dataset:
- Whether the session ended in a buy event or not: 2 or 1
- Importance weight
- Session duration in seconds
- Total number of clicks
- Id number of all items visited during that session
We included all the click information in labeled_clicks.vw with the following format:
[label] [label weight] |len [session duration] |cli [number of clicks] | it [item 1] [item 2] …
For instance, the first datapoint:
1 1 |len 352.029 |cli 4 |it 214577561 214536506 214536500 214536502
Represents a session that ended in no buys (label=1), had a total of 4 clicks, a duration of 352.029 seconds and included the items with ID: 214577561, 214536506, 214536500 and 214536502.
We decided to add importance weights to counteract the fact that around 95% of sessions end without an item being bought, this makes our training set highly unbalanced and training without any weights would result in a highly skewed predictor. We assigned a weight of 10 to all sessions that ended with a buy, these weights are arbitrary and I recommend that you play with different configurations to achieve optimal performance.
Let’s fit a the model with the out-of-the box VW parameters:
vw -d labeled_clicks.vw
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = labeled_clicks.vw
num sources = 1
average since example example current current current
loss last counter weight label predict features
1.000000 1.000000 1 1.0 1.0000 0.0000 7
0.900698 0.801397 2 2.0 1.0000 0.1048 8
0.774933 0.649168 4 4.0 1.0000 0.2232 5
0.522392 0.269851 8 8.0 1.0000 0.4725 4
1.658445 2.567287 9 18.0 2.0000 0.3977 12
1.355825 1.146318 17 44.0 2.0000 0.7707 5
0.935035 0.514245 43 88.0 1.0000 1.0372 5
0.632303 0.351878 111 183.0 2.0000 0.8911 7
0.476271 0.320239 231 366.0 1.0000 1.4677 6
0.443583 0.410895 462 732.0 1.0000 1.0966 5
0.402748 0.362024 890 1466.0 2.0000 1.3384 5
0.344143 0.285539 1870 2932.0 1.0000 0.9615 5
0.316211 0.288279 3704 5864.0 1.0000 1.2017 10
0.284272 0.252332 7318 11728.0 1.0000 1.0498 6
0.262397 0.240522 14870 23456.0 1.0000 1.1891 5
0.249489 0.236585 29520 46917.0 2.0000 1.3768 4
0.240956 0.232423 58779 93843.0 2.0000 1.1645 8
0.232603 0.224250 118013 187691.0 2.0000 1.3749 6
0.224553 0.216503 235639 375382.0 2.0000 1.7727 9
0.220508 0.216464 472363 750769.0 2.0000 1.3269 5
0.218877 0.217246 953501 1501538.0 2.0000 1.4717 6
0.219469 0.220061 1952128 3003076.0 1.0000 1.6532 5
0.222089 0.224710 3959696 6006161.0 2.0000 1.5566 4
0.215883 0.209676 8018878 12012322.0 1.0000 0.9819 5
number of examples per pass = 9249681
passes used = 1
weighted example sum = 13836918.000000
weighted label sum = 18933848.000000
average loss = 0.214053
best constant = 1.368357
best constant’s loss = 0.232670
total feature number = 54364504
As we mentioned previously, you can always save the current state of your Vowpal Wabbit model using the -f [file] command line argument and retrain your model with more recent data without having to go through the whole dataset again:
To save the model:
vw – d labeled_clicks . vw – f serialized . model
Load it back and update it with a datapoint:
echo ‘2 10 |len 35.029 |cli 4 |it 214577561 214536506 214536500 2145365027’ | vw – d / dev / stdin – i serialized . model – f updated . model
If you want to output predictions you need to pass the -t flag or input an unlabeled sample
echo ‘|len 1.029 |cli 1 |it 214759961’ | vw -d /dev/stdin -i serialized.model -p /dev/stdout –quiet