In this post we will show how to tune an MLlib collaborative filtering pipeline using Bayesian optimization viaSigOpt. Code examples from this post can be found on our github repo .
A popular approach when building the basis of a recommendation system is to learn a model that can predict user preferences or product ratings. With an effective predictive model, and enough contextual information about users, online systems can better suggest content or products, keeping customers better engaged and leading to more sales, subscribers or conversions.
A common recommender systems model involves using a low-rank factorization of a user-product ratings matrix to predict the ratings of other products for each user . In general, algorithms related to collaborative filtering and recommendation systems will have tunable parameters similar to ones we havediscussed inprevious posts. In this problem, for example, the regularization term on the user and product factors can be difficult to choose a priori without some trial and error. Using the Bayesian optimization service provided bySigOpt, machine learning engineers and data scientists can quickly tune algorithms and systems to get the most out of models in production.
Show Me The Code
The largest MovieLens dataset ratings matrix has approximately 22 million user ratings for 33,000 movies by 240,000 users. To run this example, we recommend creating a small spark cluster in EC2 using the spark-ec2 tool provided by the spark library. We ran this experiment using a 3 machine cluster (1 master, 2 workers) in AWS using the m1.large instance for all nodes. Once the cluster is setup, you can run the provided setup script on the master instance, build the example project, and submit the job to your master spark instance.
Alternating Least Squares
To solve for the latent user and movie factors, MLlib implements a variant of what is known as quadratically regularized PCA . Intuitively, this optimization problem aims to learn latent factors $X,Y$ that best recreate the ratings matrix $A$, with a regularization penalty coefficient $\lambda$ on the learned factors.
This minimization problem can be solved using a technique known as alternating least squares . A distinct advantage of using this formulation is that it can be easily parallelized into many independent least square problems as outlined in the pseudocode below. Each factor matrix $X,Y$ is randomly initialized and the algorithm alternates between solving for the user factors $X$, holding the movie factors $Y$ constant, then solving for the $Y$ factors, holding $X$ constant. The algorithm takes as input $A$ the ratings matrix, $\lambda$ the regularization term, $k$ the desired rank of the factorization, and $T$ the number of iterations of each alternating step in the minimization. We expose $\lambda$, $k$ and $T$ as tunable parameters to SigOpt.
The regularization term $\lambda$ is particularly difficult to select optimally as it can drastically change performance of the algorithm. Previous work has attempted to use a Bayesian formulation of this problem to avoid optimizing for this regularization term explicitly .
As an error metric for this example, we used the standard measurement of the root mean square error  of the reconstructions on a random subset of nonzero entries from the ratings matrix.
Defining an appropriate error measurement for a recommendation task is critical for achieving success. Many other metrics have been proposed for evaluating recommendation systems and careful selection is required to tune for models that are best for the application at hand. Bayesian optimization methods like SigOpt can be used to tune any underlying metric, or a composite metric of many metrics (like accuracy and training time).
In this example the training, validation and holdout rating entries are randomly sampled non-zero entries from the full ratings matrix $A$ as summarized in the diagram below:
SigOpt tunes the alternating least square algorithm parameters with respect to the root mean squared error of the validation set. We also report the performance on the hold out set as a measure of how well the algorithm generalizes to data it has not seen. We compare parameters tuned using SigOpt against leaving the alternating least square parameters untuned. While the ratings entries for the train, valid and test sets were randomly sampled, they were identical sets for both SigOpt and the untuned comparison.
|SigOpt|| No Tuning
(Default MLlib ALS)
|Hold out RMSE||0.7864 (-40.7%)||1.3263|
Collaborative filtering and recommendation system techniques have the potential to add new levels of personalization and drive user engagement and conversions in online systems. However, it can often be an unintuitive process to tune the parameters governing the machine learning or optimization algorithms involved in building a production recommendation system. SigOpt offers asuccinct API with clients in Python, Java, and R to quickly give machine learning engineers and data scientists access to our hosted, proven Bayesian optimization service.
More from SigOpt
- SigOpt’s 90 Second Demo demonstrates our optimization engine with no sign-up required.
- Using Model Tuning to Beat Vegas (sklearn)
- Automatically Tuning Text Classifiers (sklearn)
- Unsupervised Learning with Bayesian Optimization (xgboost)
- Interested in helping push the state of optimization forward? SigOpt is hiring !
 Madeleine Udell, Corinne Horn, Reza Zadeh and Stephen Boyd. Generalized Low-Rank Models . Foundations and Trends in Machine Learning. 2016 [ PDF ]
 AMPLab UC Berkeley. Movie Recommendation with MLlib . AMP Camp Hands On Exercises. 2014 [ LINK ]
 Yehuda Koren. Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model. Proceedings of the 14th ACM SIGKDD . 2008 [ PDF ]
 Ruslan Salakhutdinov and Andriy Mnih. Probabilistic Matrix Factorization . Advances in Neural Information Processing Systems. 2008 [ PDF ]