Rating system in a high-load project

Datetime:2017-04-21 05:17:35         Topic: DataBase  Nginx  Software Architecture          Share        Original >>
Here to See The Original Article!!!

Original article available at https://habrahabr.ru/post/275281/

In this article, I’m going to tell you about a content project for which I had to redesign the architecture. The project was initially built with a classic LAMP (Linux, Apache, MySQL, PHP) stack. However, the user base was growing steadily and approaching one million visitors, so the database server couldn’t process all the traffic properly. First and foremost, I suggested getting another server, but in this market segment, partner programs don’t see particularly high conversion rates, so project managers wouldn’t stump up the money.

If you’d like to know what changes I had to make to the existing architecture and how I implemented a rating and rotation system, read on.

What’s so special about this project is that it distributes video content hosted on donor sites like YouTube. The site had to display only videos by their IDs with BBCode, so there was no need to constantly generate HTML on the fly — it was done periodically (for example, once a day), and then the system was changed to rotate the content each 1,000 impressions. Apache was replaced with Nginx, which simply served generated static HTML content.

Each time a user visits the site, they should see something new. But everything new is well-forgotten old, as is often the case. In short, I needed to implement a video preview rotation system. There are quite a few rotation algorithms. You can’t even imagine how ingenious marketers are. That’s why I’ll tell you about only one simplest algorithm.

The first ten slots get filled with only new video previews. The next 90 slots get filled with video previews having the highest click-through rate (CTR, number of image clicks divided by the number of image impressions) in the category.

A video may potentially be popular, while its preview doesn’t reflect its content very well. This scenario is highly likely, because instead of letting a human select the best moments from a video, a company may choose to use a computer program to generate a preview from randomly selected frames. That’s why an interesting video may have more downvotes than it has upvotes. To make the site content more diverse and to account for computer-generated previews, the local rating is used: each video has three generated previews that are also rotated, and only the most attractive are chosen through natural selection. Then there’s a voting system with its upvotes and downvotes, but its technical implementation is almost identical to that of a rotation system.

But we’re here not to listen to SEO tales, but to share technical details. To make a long story short, the whole LAMP stack was replaced with a site generator. Nginx was used to serve static content. What was left was to implement CTR calculation.

As the number of videos on the site was around 100K, it was OK to go with persistent in-memory storage, available options being Redis , Aerospike and Tarantool .

Due to decent functionality and friendly support of the Russian-speaking developers from Mail.Ru Group, I chose Tarantool. I was still using MySQL to store video IDs for BBCode, lists of categories, content description and other information necessary for site generation. As the database was scarcely used, it was allocated minimum required memory.

A few more details about Tarantool (I’ll be using T* for short). Numerous articles have been written about it, so I’ll try to explain how to use it in real life, omitting the installation and setup steps.

But first we’ll need some boring theory to understand how things work. T* keeps all data in spaces that are analogous to SQL tables or MongoDB collections. As a table consists of rows and a collection of documents, so a space is made up of tuples.

A tuple is comprised of elements or fields. I’m used to calling tuple elements fields, so I’ll stick with this name, which is in line with the documentation . Unlike table rows, tuple fields have no names, but only ordinal numbers. However, as you’ll see, it doesn’t really make any difference.

Each tuple must have a primary key. Primary index can be of the following types: TREE, HASH, BITSET or RTREE. You can also define a secondary index on a space, which allows for unique selections that are impossible in Redis.

Analogy between MySQL and T*

To store ratings, we’ll create a space called stats . To do that programmatically, we need to execute the following commands in the console:

box.cfg{} –- loading the default configuration
box.schema.space.create(“stats”) –- creating a new space

Let’s check whether our space has been created:

- stats: 
 temporary: false 
 engine: memtx 

Assigning the space to the stats variable:

stats = box.space

If we were creating a schema for our database or MongoDB, we’d go with something like this:

  1. key — primary key, same as a video ID
  2. clicks_1 — image 1 clicks
  3. clicks_2 — image 2 clicks
  4. clicks_3 — image 3 clicks
  5. clicks_sum_1 — total image 1 clicks
  6. clicks_sum_2 — total image 2 clicks
  7. clicks_sum_3 — total image 3 clicks
  8. show_1 — same for impressions
  9. ctr_1 — image 1 CTR for the last specified period
  10. ctr_2 — image 2 CTR for the last specified period
  11. ctr_3 — image 3 CTR for the last specified period
  12. ctr_sum_1 — image 1 CTR for the whole period
  13. ctr_sum_2 — image 2 CTR for the whole period
  14. ctr_sum_3 — image 3 CTR for the whole period
  15. ctr — CTR across all images for the last specified period
  16. ctr_sum — CTR across all images for the whole period

The first column is the field number. Let’s assign each a constant number:

-- the first field is a primary key
clicks_1 = 2 
clicks_2 = 3
. . .
ctr_sum = 22

Creating a primary key of the HASH type in our space:

stats:create_index(‘primary’, {type = ‘hash’, parts = {1, ‘NUM’}})

Making sure it did work:

- 0: &0 
 unique: true 
 — type: NUM 
 fieldno: 1 
 id: 0 
 space_id: 513 
 name: primary 
 type: HASH 
 primary: *0 

In case of success, we’re going to create a function that increments the clicks_1 field, but let’s first insert a few records for debugging:


Checking one of the inserted records:

- — [2, 0, 0, 0, 0, 0, 0] 

Great, everything works! Now let’s write the code for incrementing the field:

- [2, 1, 0, 0, 0, 0, 0] 
- [2, 2, 0, 0, 0, 0, 0]

The update command has the following parameters:

  • primary key — key of the record for which update is performed
  • list of actions, where each item is a triplet (list of size 3): action type (addition in our case), number of the field to change, and a number

More information on the update command is available in the documentation .

As you can see, with each stats:update the data in the second field of the record with primary key 2 gets incremented by 1. Let’s present it in a more readable form. Earlier in the text, we did the following:

clicks_1 = 2

Let’s execute the command like this:

- [2, 4, 0, 0, 0, 0, 0]

Wrapping it up in a function:

function click_inc(key)

Checking that the function works:

- — [2, 5, 0, 0, 0, 0, 0] 
- — [2, 6, 0, 0, 0, 0, 0] 

Let’s add an image number to our function (images are counted from zero):

function click_inc(key, img_num)
    stats:update(key,{{‘+’,clicks_1 + img_num, 1}})

After making sure it works as expected, let’s move the function to a separate file called click.lua and expand it a little:

function click_inc(key, img_num)
    if img_num >3 then
        return false
    box.space.stats:update(key, {{‘+’,clicks_1 + img_num,1}})
    return true

As you can see, the logic here is pretty straightforward: the first argument is a video ID, the second is the number of its preview. Let’s now see how it can be used in real life. In a web project, this function can be called in three different ways:

  • through a user API: from PHP/Python/Perl/Java and other scripts
  • through tarantool-http, where Nginx will be proxying all the requests, or a custom Lua script using http.lib or another web server (for example, Xavante)
  • directly from Nginx by using the nginx_upstream module

We opted for the last one. My article is too long as it is, so you can check this article by the T* developers on how to install and configure the module.

So, our click.lua file will look like this:

    log_level = 5;
    listen = 10001;
click_1 = 2;
function click_inc(key, img_num)
    if img_num >3 then
        return 0
    box.space.stats:update(key, {{‘+’,click_1 + img_num,1}})
    return 1

Let’s check how it works:

curl — data ‘{“method”:”click_inc”,”params”:[2,1], “id”:0}’ {“id”:0,”result”:[[1]]}

Connecting to a running T* instance:

tarantool: connected to 
- true 
stats = box.space.stats 
- — [2, 7, 0, 0] 

We can also increment the counter of the second image:

curl — data ‘{“method”:”click_inc”,”params”:[2,2], “id”:1}’

Checking the result:

- — [2, 7, 1, 0] 

We’ve learned how to easily implement click counting. Time to see how to count impressions properly.

Each page containing previews that belong to one category should, in theory, call the show_inc function (that increments the impression counter) a hundred times (let’s pretend one category page contains a hundred previews from that category). As we all understand, it’s suboptimal, though. Another option would be to generate a variable in the body of an HTML page.

show_pictures=”1,2,3,4,5” /* list of IDs of all images to show */

Then we could send this list with the help of AJAX. But, apart from the ID of an image, here we need to pass in its impression option as well, which might result in the following list: “1–1, 2–1, 3–1, 4–2”, where the number after the hyphen specifies the impression option of an image.

Unfortunately, Lua doesn’t have any functions analogous to explode() in PHP, so I found this code snippet on the Internet:

function split(inputstr, sep)
    if sep == nil then
        sep = “%s”
    local t={} ; i=1
    for str in string.gmatch(inputstr, “([^”..sep..”]+)”) do
        t[i] = str
        i = i + 1
    return t

Let’s iterate over the table in a loop with the following function:

function values(t)
    local i = 0
    return function()
        i = i + 1;
        return t[i]
for it in values(tt) do
    show_inc(it, 2)

As you’ve probably guessed, show_inc is almost the same as click_inc , except that we’re using show_1 instead of click_1 . That’s why we can create a more versatile function called stat_inc :

function stat_inc(key, field, img_num)
    if img_num >3 then
        return 0
    box.space.stats:update(key, {{‘+’,field + img_num,1}})
    return 1

As we’re calculating two types of CTR — for the period that started at the time of the last generation and for the whole period — let’s create the click function that will be called through Nginx:

function click(key, img_num)
    stat_inc(key, clicks_1, img_num)
    stat_inc(key, clicks_sum_1, img_num)

The show function will look as follows:

function show(key_list)
    list = split(key_list, ‘,’)
    for it in value(list)
            pos = string.find(it, “-”);
            key = string.sub(it, 0, pos-1);
            img_num = string.sub(it,pos+1)
            stat_inc(key, shows_1, img_num)
            stat_inc(key, shows_sum_1, img_num)

This way we’re calculating both clicks and impressions.

Thanks for reading this article!