Aggregation features, Elasticsearch vs. MySQL (vs. MongoDB)

Datetime:2016-08-23 02:07:10          Topic: MySQL  Elastic Search  SQL           Share

To make the MySQL Document Store primary programming interface, the X DevAPI , a success we should provide building blocks to solve common web development tasks, for example, faceted search . But is there any chance a relational database bend towards a document store can compete around aggregation (grouping) features? Comparing Elasticsearch , MySQL and MongoDB gives mixed results. GROUP BY on JSON gets you pretty far but you need to watch out for pitfalls…

Scale-out, sharding, schema-less, big data all hiding a giant developer task: giving user access to semi-structured data in near real-time. This can mean providing user driven full text search, facet search or analytics to suggest topics to users, e.g. recommender systems. Elasticsearch is a popular choice here. As much as Elasticsearch differs from both MySQL and MongoDB and comparisons are not “fair” it is interesting to compare the Elasticsearch aggregation framework functionality with the other two. What’s there that MySQL could learn, which feature can be done using SQL, which comparable SQL is so tricky that it deserves to be wrapped in the DevAPI? An overview based on Elasticsearch 2.3.5, MySQL 5.7/8.0 and MongoDB 3.2.

Elasticsearch aggregations overview

The query languages of the three systems are quite different. The technology is different, the concepts differ and the terminology differs. The Elasticsearch aggregation framework speaks of three building blocks: metrics, buckets, pipelining. Because my main focus is on comparing the feature set, I’ll use SQL to describe the terms. The SQL examples help to stay focussed. Actual Elasticseach query language examples will be given after the introduction.

The Metric aggregations compute metrics over a set of documents ;-). Examples of metrics given as MYSQL SQL function names are: MIN(), MAX(), STDDEV(), SUM() and so forth.

single value metric
           |
           v
SELECT MIN(price) FROM products

Elasticsearch features metric aggregations that produce one metric/value or multiple metrics/values in one go.

multi-value metric
          |          |
          v          v
SELECT MIN(price), MAX(price) FROM products

Bucket aggregations partition the data set. They describe rules for building different sets of documents into which documents then “fall” into. In its simplest form, think of a SQL GROUP BY criteria in a SQL statement. A statement like SELECT * FROM products GROUP BY size partitions all products by their size and puts rows with different size column values into different “buckets”.

partition, bucket, ...
                                             |
                                             v
SELECT size COUNT(*) FROM products GROUP BY size 

+----------------------+
| size     |  COUNT(*) |
+----------------------+ 
| S        |   123     | <--- set of rows with size = S
| M        |   456     |
| ...      |  ...      |

Pipeline aggregations operate on the result of an aggregation instead of the original data set. Imagine you have a day-trade online shop and want to know the average price of all products depending on the day they have been stocked. In SQL you could write:

SELECT in_stock_since, AVG(price) FROM products GROUP BY in_stock_since

Additionally, you are interested in the average price across all the results the aggregation has produced. You could either issue a second query or “pipe” the result into a second aggregation. Here’s a SQL statement to illustrate the “pipe into”:

SELECT AVG(_avg_price) FROM (
  SELECT AVG(price AS _avg_price) AS _avg_price
  FROM products
  GROUP BY in_stock_since 
) AS _sub

Beats SQL: many aggregations in one go

A neat thing about Elasticsearch is that it lets you compute the result of both of the above two aggregations using one query. Elasticsearch can return the average price per day and the average over all them in one result set. I am unaware of a MySQL SQL statement that could do the same. Something similar can be achieved in SQL using a UNION ALL. This has two drawbacks. First, EXPLAIN hints that MySQL will scan the underlying data set twice and fail to optimize the query much. Second, the results from the two SELECT statements must be “sequeezed” into one row result set. Because the two SQL statements produce different columns additional columns have to be added to them to make storing their columns in one combined result set possible. On an aside: as long as MySQL lacks JSON aggregation functions similar to GROUP_CONCAT it is impossible to overcome the column issue by returning a nested JSON document (awareness of a problem is the first step towards a solution…).

SELECT 
    in_stock_since, 
    AVG(price AS _avg_price) AS _avg_price
    NULL AS _avg_price_total
  FROM products
  GROUP BY in_stock_since 
UNION ALL
  SELECT
     NULL AS in_stock_since, 
     NULL AS _avg_price,
     AVG(_avg_price) AS _avg_price_total
  FROM (SELECT 
          in_stock_since, 
          AVG(price AS _avg_price) AS _avg_price
          NULL AS _avg_price_total
        FROM products
        GROUP BY in_stock_since) AS _sub

Executing multiple aggregations as one query is not limited to pipelined aggregations. Any kind of aggregation can be used in a query. And, any number of them may appear in a single query.

(My)SQL: multiple aggregations glued together
      |
      v  
  SELECT ... GROUP BY ...      < --- bucket aggregation 
UNION ALL 
  SELECT MAX(...) ...          <--- metric aggregation 
UNION ALL
  SELECT ... FROM (SELECT...)  <--- pipeline aggregation 
UNION ALL
  ...                          <--- additional aggregations

DevAPI opportunity: go beyond SQL/CRUD

One of the things MySQL should learn from Elasticsearch is the ability to run an arbitrary number of independent aggregations on the same data set (ElasticSearch: index, MySQL: table/collection, MongoDB: collection) using only one query. Use case?Faceted search…

Possible approaches to solve the task include optimizing UNION ALL handling, giving users access to lower level aggregation features to write queries that break out of SQL limitations or exapnding commands to multiple SQL statements. All these things can easily be done “behind the scenes” and be hidden from the users. The X Plugin and the X DevAPI are perfect verhicles to shield users from the very details and the actual SQL. Technically, the song I’m singing is the same MySQL has been singing for years: MySQL server plugins make it possible to innovate on the client side independently of server development constraints.

What’s new in MySQL world is the availability of a full stack (AKA MySQL Document Store) to execute on the idea.

Component Options to innovate
Application
X DevAPI Higher level query API
|
MySQL drivers Pre-parse into command tree, translate command into series of SQL statement and aggregate results…
|
X Protocol Pipelining, send from server to client, …
X Plugin Rewrite SQL, use lower-level server data access APIs over SQL, …
MySQL Server

What’s missing in MySQL world are ideas what to do with the new stack – time to be the bad boy talking about features missing in MySQL.

Elasticsearch aggregation query syntax

With some key Elasticsearch terms and concepts explained, first comparisions with SQL made, its time to show the basic syntax of an Elasticsearch aggregation query. Elasticsearch has an HTTP query interface. All queries can be written as JSON documents and be send to the search engine via HTTP. The basic structure of an aggregation query is as follows:

"aggregations" : {
    "

  
   
 " : {
        "
 
   
    
  " : {
            
  
    
     
   
        }
        [,"meta" : {  [
   
     
      
    ] } ]?
        [,"aggregations" : { [
    
      
       
     ]+ } ]?
    }
    [,"
     
       
        
      " : { ... } ]*
}

     
       
    
      
   
     
  
    
 
   

  

The aggregation part of a search query is using the key “aggregations”, or short “aggs” in the query document:

{ <--- query document
  "aggs": { <--- aggregation part
...
  }
}

A single query can use one or more aggregations. Each aggregation has a user-defined name:

{
  "aggs": {
    "materials": { <-- aggregation using user-defined name
    }
  }
}

It follows the definition of the type of the aggregation and all relevant parameters for it:

{
  "aggs": {
    "materials": {
      "terms": {  <-- aggregation type
        "field": "material" <-- parameters
      }
    }
  }
}

The terms aggregation is a multiple bucket aggregation that matches GROUP BY field . Results will be grouped by unique values found in the field. Queries are issued again an index. Think of an index as the counterpart of a table/collection in MySQL.

# curl -XGET '127.0.0.1:9200/products/_search?pretty' -d@aggs.json

Beats SQL again: aggregation and search results in one go

When the above query is executed, Elasticsearch beats SQL once more. Elasticsearch will return all documents matching the query (no filter set) plus the result of the aggregation. The SQL counterpart requires two queries. Ignoring some pagination details, the SQL equivalent of the above is:

SELECT * FROM products
SELECT material, COUNT(*) FROM products

The need to use two queries has the afore mentioned downsides: potentially two round trips from the client to the server, potentially scanning the same data stream twice. The issue of two round trips could easily be solved in the MySQL Document Store by having the X Plugin execute the two queries locally and combine the results before returning them to the client, or, maybe, using X protocol pipelining. But the issue of having to scan the data stream twice is a SQL limitation and potentially a server matter.

If one is not interested in the search results but only the aggregation results, then one can set the desired search result size to zero in the Elasticsearch JSON query document.

{
  "size": 0,  <--- skip search results
  "aggs": {
    "materials": {
      "terms": {
        "field": "material"
      }
    }
  }
}

# curl -XGET '127.0.0.1:9200/products/_search?pretty' -d@aggs.json
{
  "took": 89,  <-- 89ms / 0.089s runtime
  ...
  "hits": {
    "total": 1000000, <--- 1e6 documents scanned
    ...
    "hits": [  <--- do not return any search results    
    ]
  },
  "aggregations": {
    "materials": { <--- material aggregation 
      ...
      "buckets": [
        {
          "key": "wool",
          "doc_count": 167074
        },
        {
          "key": "cotton",
          "doc_count": 166950
        },
        ...
      ]
    }
  }
}

Metric aggregations compared

Most but not all Elasticsearch metric aggregations have direct counterparts in both MySQL and MongoDB. For example, all systems can compute the average value of an aggegrated numeric field. While the syntax differs there are no groundbreaking differences in the functionality.

Avg Aggregation
Elasticsearch
{
  "aggs": {
    "avg_price": {
      "avg": {
        "field": "price"
      }
    }
  }
}
MySQL
SELECT AVG(price) FROM products
MongoDB
[
  {
    "$group": {
      "_id": null,
      "val": {
        "$avg": "$price"
      }
    }
  }
]

There are two metric aggregations that are hard to emulate in MySQL: percentiles and top hits per group. Proposals to calculate percentiles in MySQL using SQL include using GROUP_CONCAT or SQL session variables. Either approach results in SQL better not shown. Likely, the problem could be easily solved using an UDF respectively adding a grouping function to stock MySQL.

Aggregation Elasticsearch MySQL MongoDB
Avg Yes Yes Yes
Cardinality Yes (Sample based) Yes (Exact) Yes (Exact)
Extended Stats Yes StdDev bounds missing Variance, StdDev bounds missing in aggrgation framework?
Geo Bounds Yes for future blog post
Geo Centroid Yes for future blog post
Max Yes Yes Yes
Percentiles Yes Complex SQL or UDF Complex
Percentile Ranks Yes Complex SQL or UDF Complex
Scripted Yes No Map-Reduce
Stats Yes Yes Yes
Top Hits Yes Complex Unknown (No)
Value Count Yes Yes Yes

Beats MySQL: Top hits – SQL:99 LATERAL

Calculating the top hits per group is a functionality that should be highly valuable when building a facet navigation. SQL:99 contains a SQL feature called LATERAL which could be used for the task. Unfortunately MySQL does not support it. The LATERAL keyword gives a sub-select access to FROM clauses that appear before it. It is often described as SQL’s foreach loop. It best explained at the example. All examples so far assumed a collection/index of product documents respectively a products table like:

+-------------------+---------------+------+-----+---------+----------------+
| Field             | Type          | Null | Key | Default | Extra          |
+-------------------+---------------+------+-----+---------+----------------+
| product_id        | int(11)       | NO   | PRI | NULL    | auto_increment |
| title             | varchar(127)  | YES  |     | NULL    |                |
| size              | varchar(255)  | YES  |     | NULL    |                |
| material          | varchar(255)  | YES  | MUL | NULL    |                |
| usage             | varchar(255)  | YES  |     | NULL    |                |
| gender            | varchar(255)  | YES  |     | NULL    |                |
| price             | double        | YES  |     | NULL    |                |
| in_stock_since    | datetime      | YES  |     | NULL    |                |
| description       | varchar(1024) | YES  |     | NULL    |                |
+-------------------+---------------+------+-----+---------+----------------+

Let’s try fetch the five most expensive products (fabrics) for each material (cotton, silk, ..) by help of LATERAL. We will build it up of two simple SQL queries. The grouping comes first:

SELECT material
FROM products
GROUP BY material

A query to fetch the five most expensive products made of cotton is straightforward as well:

SELECT product_id 
FROM products
WHERE material = 'cotton'
ORDER BY price DESC
LIMIT 5

The two queries now need to be combined in such a way that for each facet _material found by the first query, the second query is invoked with the material found to compute the top five hits. The second query needs access values produced by the first query. And, this is what LATERAL is about. It gives a subquery access to the results found by a previous query. A subquery does not execute in isolation but can break out of its isolation. The general pattern goes:

SELECT          <--- for each 
  inner.*
FROM  outer     <--- result produced (e.g. distinct material)
  ...
  JOIN          <--- add results 
  LATERAL       <--- allow access to previous results/FROM 
  (
     SELECT ... FROM inner <--- run a subquery to add results
     WHERE 
      inner.field = outer.result <--- based on outer query results
  ) AS inner

Because I have neither Postgres, Oracle or any other database featuring SQL:99 LATERAL installed I haven’t check the following SQL but it should be along the lines of the actual SQL:99 to solve the task. It is a little more complex to satisfy GROUP BY constraints ( see also this Postgres example ):

SELECT 
  top_hits.*
FROM 
  (SELECT material FROM products 
  GROUP BY products) AS materials
JOIN LATERAL
  (SELECT product_id, price, material
  FROM products WHERE material = material.material
  ORDER BY price DESC
  LIMIT 5) AS top_hits

If MySQL would feature LATERAL, then the DevAPI could make a top hits feature available using a syntax close to SQL, or a function based variant of the Elasticsearch query document, whatever turns out more powerful and convenient when writing application code. Here’s the Elastic search query to find the top five products grouped by material:

{
  "aggs": {
    "materials": {
      "terms": {
        "field": "material"
      },
      "aggs": {
        "top-tag": {
          "top_hits": {
            "sort": {
              "price": "desc"
            },
            "size": 5
          }
        }
      }
    }
  }
}

On an aside, the Elasticsearch query is also an example of nesting aggregations. The data stream is partioned by material and then further processed using a nested aggregation for the top hits. Judging from the few Elasticsearch queries I examined, nesting can often be mapped to SQL either using derived tables ( SELECT nested FROM (SELECT outer) ) or using a WHERE-clause.

Bucket Aggregations

Bucket aggregations partition the data stream. Most, if not all, Elasticsearch bucket aggregations do not offer features that cannot be achieved with either MySQL or MongoDB. The Filter, Filters and Global bucket aggreations are tricky to compare with the other systems. As explained above Elasticsearch differs from MySQL (and MongoDB) by returning search results and aggregation results together in one result set as a reply to a single query. For example, one can search all products that have the word Beach in their title line and gather distribution statistics on some product attributes like size and material. The Elasticsearch query is:

{
  "query": {
    "match": {
      "title": "Beach"
    }
  },
  "aggs": {
    "by_size": {
      "terms": {
        "field": "size"
      }
    },
    "by_material": {
      "terms": {
        "field": "material"
      }
    }
  }
}

nixnutz@linux-ouxx:~/src/php-src> curl -XGET '127.0.0.1:9200/products/_search?pretty' -d@aggs.json
{
...
    "hits": [   < --- all matching products
      {
        "_index": "products",
        "_type": "product",
        "_id": "86141",
        "_score": 3.9986782,
        "_source": {
          "size": "L",
          "price": 1.24,
          "title": "L {ukqevpzmht jyorthop Beach",
          "usage": "baby",
          "gender": "female",
          "material": "cotton",
          "description": "",
          "in_stock_since": "2016-08-01 20:08:40"
        }
      }, 
  "aggregations": {
    "by_size": { < -- size distribution statistics
...
        {
          "key": "L",
          "doc_count": 194
        },
...
    "by_material": { <--- material distribution statistics
...

A SQL query that comes somewhat close to it is made of three individial queries glued together with UNION ALL (see above). The Filter bucket aggregation can be used to filter out values from one of the aggregations. Maybe, we only need to count how many of matching products are available in size “M”:

{
  "query": {
    "match": {
      "title": "Beach"
    }
  },
  "aggs": {
    "by_size": {
      "filter": {        <-- filter bucket
        "term": {
          "size": "M"
        }
      }
    },
...

In SQL the filter condition would be added to the WHERE clause of the corresponding query in the UNION ALL statement.

SELECT ... FROM products WHERE MATCH(title) AGAINST('Beach')
  UNION ALL
SELECT ... FROM products WHERE size = 'M' AND ... < -- filter bucket condition 
  UNION ALL
SELECT ... FROM products GROUP BY material WHERE ...

n/a (yes) – apples and oranges…

Albeit the Filter, Filters and Global aggregations can be emulated in SQL, I marked them as “n/a (yes)” in the feature comparison table. In the end, its comparing apples and oranges. The the wonderful, unique Elasticsearch feature of returning aggregation values together with search results has no direct counterpart in the SQL world. The Elasticsearch global aggregation is specifically tailored to this different way of operation. It breaks and aggregation out of the current search context and allows it to perform an aggregation on the entire data set. The below query calculated the average price of those products matching “Beach” in their title and the average price of all products:

{
  "query": {
    "match": {
      "title": "Beach"  <-- search products matching "Beach"
    }
  },
  "aggs": {
    "avg_price_beach": {
      "avg": {
        "field": "price" <-- average price for "Beach" products
      }
    },
    "avg_price_all": {
      "global": {
        
      },
      "aggs": {
        "avg_price": {
          "avg": {
            "field": "price" < average price for all products
          }
        }
      }
    }
  }
}

The SQL equivalent is:

SELECT 
  COUNT(*) AS _hits, AVG(price) AS _avg_beach, 
  (SELECT AVG(price) FROM products) AS _avg_all <-- break out search context
FROM products 
  WHERE MATCH(title) AGAINST ("Beach");
Aggregation Elasticsearch MySQL MongoDB
Childen Yes for future blog post
Date Histogram Yes Complex Complex
Date Range Yes Complex Complex
Filter Yes n/a (yes) n/a (yes)
Filters Yes n/a (yes) n/a (yes)
Geo Distance Yes for future blog post
GeoHash grid Yes for future blog post
Global Yes n/a (yes) n/a (yes)
Histogram Yes Complex Complex
IPv4 Range Yes Complex Complex
Missing Yes Yes Yes
Nested Yes for future blog post
Range Yes Complex Complex
Reverse Nested Yes for future blog post
Sampler Yes Complex Yes
Significant Terms Yes No Unknown (no)
Terms Yes Yes Yes

Histogram and Range aggregations

The group of histogram and range aggregation in Elasticsearch could be described as convenience functionality from a SQL database vendor perspective. Take the histogram aggregation as an example:

{
  "aggs": {
    "prices": {
      "histogram": {
        "field": "price",
        "interval": 10
      }
    }
  }
}

The SQL equivalent is:

SELECT
  COUNT(*) AS _num, 
  IF(
    price % 10 < 0, 
    price - (10 + (price % 10)), 
    price - price % 10
  ) AS _bucket
FROM 
  products 
GROUP BY _bucket

+----------+---------+
| _num     | _bucket |
+----------+---------+
|   168697 |       0 |
|   243999 |      10 |
|   288138 |      20 |
|   116725 |      30 |
|    77041 |      40 |
|    50007 |      50 |
|    34483 |      60 |
|    16487 |      70 |
|     3866 |      80 |
|      557 |      90 |
+----------+---------+

Unquestionable SQL can deliver. Unquestionable using IF is no rocket science. And, nobody should pollute an API with convenience functions. But there is more

than convenience here. MySQL evaluates the IF for every row: it’s slow. How slow? That’s for the next blog posting.

Happy hacking!





About List