Realtime funnel analysis using Solr and Cassandra

Datetime:2016-08-23 00:41:05          Topic: Solr  Cassandra           Share

One of the big challenges in providing a good funnel analysis solution is being able to provide your users with data while not complicating the implementation process.

Because of our video session recording capabilities, a lot of our users use Jaco to learn how their users behave within their product and optimize their funnels.

In order to do that, they are always looking for new ways to investigate flows and funnel in their web apps.

A funnel is defined as a sequence of events that a user went through in your application.

For example for a funnel in e-commerce apps is a sequence of events that define the checkout flow of the app.

Let’s take a hypothetical Amazon funnel for example:

  1. A user gets an email from Amazon with book recommendations.
  2. Clicks on a book
  3. Proceeds to the product page
  4. Clicks on “Buy now”
  5. Gets to checkout page
  6. Pays
  7. Exits Amazon

As a product owner, knowing your funnels and how they’re performing is a great way to get the bottom line of the state of your features. Problem is, that when one of your features isn’t performing well, traditional analytics tools don’t provide a way to find out why a feature doesn’t perform optimally. You can see where the user exited, but numbers don’t help you understand why he did so.

The problem with funnel measurement and optimization

To get meaningful insights about your funnel’s performance you have to collect as much data as possible. The problem is that as you research your data, you will often find missing data points that you didn’t track.

Why? Because you can’t track everything.

It will usually resolve with adding more tracking events tag to capture the next user who goes through the funnel. But it will waste you time and money. Also, there’s always “one more thing” you can track.

We knew this was an issue when optimizing funnels.

The problem is that as you research your data, you will often find missing data points that you didn’t track.

In Jaco we wanted to take it a step further by allowing our customers to slice & dice their sessions, using events they didn’t need to pre-define (Clicks, Input changes, URLs, Errors) and be able to dynamically play with the funnel definition.

We wanted the funnel capability experience to be in unison with our already existing querying mechanism, so all you’ll need to do to test it out is flip a switch.

Before you continue reading the post – this is not a blog post about CRO.It’s a blog post about Jaco’s technique that allows our users to dissect and understand their funnel and flows.

This is how we track funnels in Jaco (Tech talk)

We use Cassandra (aka C*) to store and keep track of all our customers’ sessions and the events they blast at us, peaking at 10k events per second. We chose C* as our database, as it fits well with our write-heavy workload. Unfortunately, C* query engine compromises query power in favor of performance, that’s why most companies combine C* with Solr/ElasticSearch. Since we already use DataStax Enterprise, it was natural for us to use it’s tight Solr integration. To view our entire tech stack, you can visit our Stackshare page.

Funnel analysis is not an easy problem to solve, especially at scale. On top of that, we wanted to be able to run any funnel query and get results back instantly.

In the wild, there aren’t many resources on how other companies approach funnels technically. The best one we’ve found is a great post by Heap , that shows a very expressive way in PostgreSQL to calculate funnels. We’ve found a very cool way to do it with Solr, and we wanted to share it.

Funnel analysis is not an easy problem to solve, especially at scale

Our data model before implementing funnels was two Solr cores:

Sessions Core– This is where we store all the metadata of a session for all of our customers’ sessions.

Timeline Events Core– Stores for each session all of the events that are queryable (URL, Clicks, Input changes, Errors).

Using the Solr Join feature, this allowed us to run queries like:

“Give me all the sessions where a user clicked on ‘Sign up'”

“Give me all the sessions where a user had an exception”

For funnel analysis, it’s not feasible to use this data model for getting back a summary of the funnel steps and the sessions matching it, since there’s no option in Solr to run a recursive query, which would allow to go over each session and check if it’s a match for the funnel.

After some research, we found two Solr features that allow you to run an ordered query against documents, which is something that can be very useful for funnel analysis. The former is the SurroundQueryParser , which lets you search for phrases in documents that are far apart from each other by a given distance. The latter is the ComplexPhraseQueryParser which is similar to SurroundQueryParser except it doesn’t have a distance limitation. This fits our requirements as it’s possible that between two steps in the funnel there are thousands of events. To utilize ComplexPhrase for funnels, we needed to figure out how to serialize session events to a string.

We’ve introduced a new string column to the Sessions Core which encapsulates all of the events that occurred in a session as a single string – This string preserves the order of the events, and each event is represented as a hash. This allowed us to create powerful ComplexPhrases that find sessions with a sequence of events. However, ComplexPhrase has a limitation where queries with wildcards can quickly hit the MaxBooleanClause limit . The way we decided to overcome this limitation is to store each event as a hash of its metadata and create another Core ( Projects Events Core ) that maps between a hash and the original event metadata.

The cool thing about ComplexPhrase is you can execute both ordered and unordered queries with it by defining it at the query level, using the inOrder attribute. Ordered queries are relevant when we’d like to query for funnels, and unordered queries are relevant for any typical query.

Let’s take this funnel for example, and see how we’ll go about querying for it:

Step 1:“Give me all the sessions where a user navigated to ‘example.com/registration’”

Step 2:“Give me all the sessions where a user clicked on ‘Sign Up!”

The steps we need to take to do these queries are:

  1. Query the Project Events Core to get back all the matches(Event Hashes) in the project for events that match the steps query.
  2. Execute a ComplexPhrase query on the Sessions Core to get back all the sessions that had a match for the Event Hashes found in step 1.

Conclusion

To sum things up, we think this is a very cool use case for the ComplexPhrase feature in Solr. So far it’s been working great for us here at Jaco, allowing us to run complex funnel queries over hundreds of millions of events with an almost instant response time. We did have one set back with the MaxBooleanClauses limit, which we solved by the hashing trick, but other than that this has proved to be a very successful approach to the funnel problem.

Want to try it for yourself? go ahead and get your freeJaco account now!





About List