As I was taking in the beauty of Big Sur during Thanksgiving weekend, an alert goes off indicating that theBotmetrics (the conversational intelligence platform we’re building) was experiencing high page-load latencies for our dashboards pages.
We’d just signed up a large customer the previous week and while we were processing millions of events per day (3–5k requests per minute), our query engine was having intermittent trouble (with request latencies >15s) digesting large volumes of data.
With the help of our friends at Citus , we re-architected our system in a matter of days , and were able to speed up our query engine 150x . This is how we did it.
Journey from a 100’s of req/min to 1000’s of req/min
Botmetrics is a conversational intelligence platform that allows customers to measure and analyze conversations . We support Facebook, Kik, Slack, SMS (Twilio), Google Home and web-based conversations. We did have a couple of high volume customers but a week before Thanksgiving, we onboarded one of the top Facebook Messenger bots on our platform and the volume of data we were processing increased by an order of magnitude.
Stepping Back: Our Data Model
Botmetrics uses Postgres as its primary data store and has served us well. The key data model that is relevant to this story is the
events table — which stores all of the different events that are sent by our customers (these can include events such as plain text, but also complicated structured data such as images, location coordinates, button clicks and so on). Different messenger platforms have different data schemas for their message payloads but with Postgres’ JSONB data structure , we are able to ingest all of the varied data and store them in a single table .
With JSONB, we can set uniqueness constraints on child elements of the JSON column so that we don’t duplicate events. For e.g. Facebook uses a combination of
seq to determine the uniqueness of an event, Slack uses a combination of
timestamp and with Postgres we are able to set conditional uniqueness constraints (depending on the messenger platform).
The Original Data Flow Design
As with any well-designed event ingestion architecture, our Events Collection API is decoupled from the Events Processing Engine, thus allowing us to scale each component independently. The HTTP endpoints queues up events in our message queue and event processors then pick up the events and performs pre-processing on it (validation, normalization of data fields and de-deduplication) before INSERTing into the events table in Postgres.
The Query Engine would then query the
events table to generate time-series data, which is used by the web frontend to display the pretty graphs and tables.
Limits of The Design
The limitation of this design is that as the number of rows in the database approaches tens of millions, querying this table (even with good indexes) results in very large query times. Some queries require >15s query time which results in an inferior user experience.
Use the Rollups, Luke
One of the solutions (which we had intended to get to at some point) is for the engine to not query the
events table directly but to cache the counts in a separate column or table and query that instead.
Coming from a Rails world, the obvious solution would have been to use
after_create callbacks and store the rolled-up counts for events in a separate table. There are a few problems with this however:
- Delegating this responsibility to the app layer is fraught with danger as we would have to do extra checks to make sure that we wrap these database writes/updates in transactions.
- Account for app deploys and restarts interrupting these callbacks and causing inconsistent values to be presented to the user.
- Callbacks in the app code are cause unintended effects (especially while running tests) that you have to account for.
- At the scale at which we were operating, we had to be careful for the callbacks to itself cause a thundering herd problem with updates to the same row in the rolled up table.
It was clear to us at this point that we needed a solution that was native to the database. Enter Citus.
Postgres Triggers, Upserts and Queues
Thanks to the amazing content from the folks at Citus , I came across a reference implementation of how rollups can be implemented in a sane way. You should definitely watch the video tutorial, but the TL;DR is:
- Use Postgres triggers (upon
INSERTof row in the events table) to queue up count data to an intermediate table (called
- At a specified frequency (with some jitter so as to avoid the thundering herd problem), we use Postgres 9.5's
UPSERTfeature to take all of the pending counts in the
rolledup_events_queuetable and roll them up (by hour) in the
The function that is invoked to flush the
rolledup_events_queue table is presented here:
- The Query Engine then queries from the
rolledup_eventstable. Due to the highly dense nature of conversational metrics (messages between agent and user happen in a matter of seconds and minutes), in initial testing we were able to achieve rollup gains of 7x and massive increases in query performance.
The updated data flow model is presented here:
Deploying to Production
The next step was to deploy the system in production with zero-downtime and zero data-loss. Our de-coupled event processing system allowed us to perform upgrades to our database without any downtime.
(For the curious, we performed three dry runs of this process, ran scripts to make sure there was no data loss before we deployed the changes to production).
- During the upgrade process, we turned off the event processor engines which would take raw event data from the message queue and insert them into the database. (As a consequence of this, we warned users beforehand that their dashboards would not update in real-time for a few hours).
- Since our Postgres trigger was activated on
INSERTit would not take care of the millions of
eventrows that were in our database. So we created a second trigger function that also worked on
UPDATEand updated the
updated_atcolumn for all of the events rows in batches. Once the migration was complete, we deleted the trigger (although updates to event data are non-existent in any case).
- Our Query Engine was then deployed which read from the
rolledup_eventstable instead of the
eventstable and voila, we were in business.