PredictionIO is the first really interesting open source approach to a packaged solution for utilizing the many prediction algorithms running on top of Hadoop or other processing platforms. The main selling point is that it hides the complexity of integrating with the algorithms directly e.g. Apache Mahout. It brilliantly does this through a quite intuitive JSON API. To tune and train the algorithms a quite user-friendly administrator interface is also found in the solution. Data scientists can therefore adjust and tune without writing a single line of code. This has obviously resonated well with the Silicon Valley capital and the project has recently managed to raise 2.5 million dollars in seed funding.
So what is all the fuzz about? Is it really as promising as it sounds? What kind of workflow can you expect to meet as a data scientist or developer? These are a few questions I will try to uncover in this blog about PredictionIO.
I am aware that PredictionIO is still at an early stage. Currently it is in version 0.7.X, but the dollars received from the risk takers in the valley is probably going to take it to 1.0 pretty fast. It would not surprise me if the tool will change radically before launch. In this blog we are going to just give a brief overview and look at an example of how we could create a typical feedback model and interact with PredictionIO.
Currently PredictionIO supports a limited set of functions: Ranking, recommendation and similarity. All these are user centric and typically used in personalization scenarios. There are tons of use cases in ecommerce, social networking and online advertising. I personally love personal music and book recommendation. If you feed them with enough data, they will usually give you something back. However these engines are created by cutting edge technology firms usually situated somewhere in the valley. The traditional bricks and mortar stores or finance institutions have not embraced this kind of technology widely yet. I have yet to see a personal recommendation engine that suggests a credit card suited for my travelling habits or insurance based on behavioural data. However I believe if the tools that actually can create such solutions get more commoditized; more and more such offerings will emerge everywhere.
Some more and less familiar personalization engines.
Technical overview and installation
PredictionIO consist of three components:
- Administration interface written as SPA with the Backbone framework and a Scala backend.
- A job scheduler written in Scala with the Play framework, using Quartz for the scheduling logic.
- A HTTP-JSON API for data upload and requesting prediction. This is written in Scala with the Play framework.
The administration dashboard
In addition it requires you to install these applications
- A Hadoop 1.X cluster for processing data
- A MongoDB instance for storing data permanent
- GraphChi - "Disk-based large-scale graph computation"
As you see it uses quite modern but traditional technology stack. Nothing exciting or overly bold here. Hadoop needs to be in version 1.0, because the Mahout library does not support 2.0. I will guess that Hadoop will be phased out since Mahout now have Spark bindings as well.
To install PredictionIO is pretty straightforward. There is a wide range of official and unofficial installers available. If you have a company credit card, run it on AWS. That is by far the most elegant installation if you are starting from scratch. For the less fortunate you need to select one of the other options. The unofficial Homebrew installation did it for me!
There are few concepts that currently exist in PredictionIO that you need to learn.
- Prediction algorithms
An Item is the main entity in your PredictionIO app. Examples of an Item is an article in a news recommendation engine, a recipe in a chef aid application, a tweet in a tweet recommendation engine etc. A unique id string is used to represents an Item and you can annotate it with free form attributes, timestamp and a type. As the documentation states, these attributes are not used to aid the predictions. Currently the predictions are 100% reliant on Collaborative Filtering (CF). However you can query results by type and use the timestamp to utilize the freshness functionality in PredictionIO.
Actions are the responses each user deliver to aid the prediction. You have 5 different action feedbacks: Conversion, like, dislike and view. Each of these actions can be sent as an event to the PredictionIO API. Typically when an item is viewed, bought, liked or disliked in a review etc. As an alternative you can also rate the item on a scale 1 - 5. This way of rating the items will probably be the preferred way in an IMDB based recommendation etc.. From now on and going forward this is referred to as the feedback model.
A User is the subject entity in your app. You need to have a unique representation string of a user in your recommendation system. A User is connected to an Item through a one to many relationships with Actions.
The algorithm is basically the selection procedure for the ranking, recommendations and procedure for finding similar items. There are quite many algorithms available from GraphChi and Mahout library. You need to chose one based on your need and application requirements. This is usually where the curve gets steep. You basically need to test again and again. Most of the algorithms are mentioned as either user based or item based. Shortly user based algorithms picks the highest-ranking items from a selection of similar users. In contrast Item based algorithms are picking the highest ranking items overall. In our example case we had a quite diverse user base, with multiple interests and preferences. It then made sense to start with a user-based algorithm, but after we examined the results we ended up on the opposite approach.
It is also worth mentioning the extremely useful auto tuning of parameters that can be applied before you train your algorithm. This takes a lot of guessing away when you need to configure algorithms.
An engine is a container for algorithms and data. You have three different containers to choose from: Ranking, recommendation and similar items. The most important functionality in the engine is the possibility to adjust the freshness and also the exploration of your predictions. In addition you also adjust what objective your predictions are going to maximize. If you want to predict what users most likely want to buy, you should maximize conversion. However if you have few conversion actions registered, you may change this to likes or rating >= 3.
Screenshot from the engine settings dashboard
Example: recommending tweets
To illustrate we have created an example that is going to recommend tweets based on the timelines of the users I follow. You may ask yourselves why we selected such a case and just not utilized one of the test sets found via the PredictionIO site. The answer is quite simply that we want an example where we can create a complete feedback model from scratch and filter out any biases that may be located in the example datasets. In addition our motivation was not aim to make a complete and perfect recommendation engine, but more focus playing around with the API and testing different workflows. In other words learning by trial and error.
- Checkout the code: github
- Installed Maven
- Some Java skills
- Java 8
- An IDE or code editor
- Installed and working setup of PredictionIO
First we needed to download all the tweets from my friends and store it in a CSV format. We used the spring-social-twitter extension to download the last 200 tweets of each timeline and staged it in a CSV file. We had to do it this way to avoid getting problems with the strict API rules enforced by Twitter. We tried to collect as many as possible meta-fields to be able to experiment with different feedback models. The comma-separated header can be seen below. The most important field missing is the tweet text itself. PredictionIO does not support any NLP methods to extract knowledge from text; therefore we do not bother to stage it either. However we want to use the hash tags to be able to categorize (“itype") the items we add to the engine. These are therefore added as optional last columns.
The dataset contained around 100 000 tweets from 550 users I follow. These tweets ranged from very recent to a couple of years old, so it is quite unevenly distributed. Fortunately this is not a huge problem since PredictionIO help us avoid recommending old tweets by using the freshness indicator.
The feedback model
The most challenging part is to create a functional feedback model that will not confuse the algorithm. The most sensible approach is to just write a use case diagram and stick to variations of it. Not fall into the temptation of creating a fancy rating scheme based on properties that you think seem sensible. We created a weighted feedback model that created a rating based on if the tweet had a tag or not and a few complicated calculations. Completely waste of time. Instead we assumed that Tweets have four actions associated. The timelime owner could view, favourite, retweet or dislike it. However we could not retrieve either view or dislike data from the Twitter API, so we had to assume that the timeline owner views all tweets and if it is not a favourite or retweeted it is disliked. The feedback model we settled on is then: Dislike, like or conversion.
Java code for submitting an action:
String action = selectAction(isRetweeted, isFavorited); UserActionItemRequestBuilder userActionItemRequestBuilder = client.getUserActionItemRequestBuilder(timelineOwner, action, item) .t(dateTime); client.userActionItem(userActionItemRequestBuilder);
Selecting an engine and algorithm
We wanted to create a recommendation engine based on the tweets of my friends. Therefore it is rather easy to choose the PredictionIO Item Recommendation engine. However when selecting the algorithm to calculate the predictions we immediately had a wide array different algorithms to chose between. This made it quite difficult. First we went for a safe bet: Mahout's kNN User Based Collaborative Filtering. A quick repetition from the university textbooks was necessary, but this is reasonably simple algorithm to understand. A quick wiki lookup will help you a lot. It basically tries to find the closest neighbouring tweets based on similar users actions (it is a little more to it than that, but is beyond the scope of the post). These characteristics sounded like something we were looking for in our engine and we therefore deployed and started to train it.
We started the training from the user interface and nothing happened. Not a single feedback was communicated from the user interface. A quick tail of the scheduler logs also showed that everything was running fine. After a quick search in the PredcitionIO group I found that other people had the same issues and it was probably because of too little correlating action data. Therefore the kNN algorithm had no possibility to find any similar users since they where way too independent. We tested a few different settings on the user-based kNN with no luck and settled that this was probably not the correct path.
Next step was to test Mahout's kNN Item Based Collaborative Filtering. We utilized the default settings and tested. This went quite much better and we received tweet recommendations that were relatively well related what I would normally retweet. One tweet was about Norwegian beer prices and one a about some quite cutting edge web-frameworks. By testing it against a random set of recommended tweets we confirmed that it is not just random luck.
A tweet recommended by the engine
There are still large limitations to our recommendation engine. We still need to continuously feed with new tweets and update actions and work with the freshness and exploration settings of the engine. Fortunately this is rather simple to setup in PredictionIO. It would also be really cool to add some geo locations and play with that on a larger scale. However we then need to up the budget a bit and buy data from either GNIP or DataSift.
Since the PredictionIO project is still at a very early stage, there are a few bugs here and there. This is totally expected and I am not going to rant about them. However I hope they focus on solving a few important problems: Better visualization and operator feedback in the admin interface. One of the selling points is to hide a lot of the complexity of operating solutions like Mahout and GraphChi. However it is still a bit difficult to when you need to tail logs and dive into CSV files to understand that something is wrong. Also some data visualization when you get into “your algorithm does not work” situations would be nice. Going forward I also hope to see some support for NLP algorithms. This also implicates support for adding full text support in the data APIs etc.