Fooling around with MySQL JSON full-text indexing

Datetime:2016-08-22 22:59:47          Topic: MySQL           Share

MySQL 5.7/8.0 does not allow creating a full-text index on a JSON column. Although a full-text index would be useful – unstructured data deservesunstructured search. I am unaware o any workaround of this limitation that does not require storing the JSON data twice. Then I started messing around, allowed indexing of JSON columns, saw great execution times but eventually followed MySQLs’ decision to call it a day. Friday afternoon rambling on indexing and easy hacking…

Why full-text: JSON indexing limitations and pitfalls

The current version of the MySQL Document Store is not yet optimized for performance. When we started this effort, we quickly gathered a long list of features that we considered a must have prior to any serious performance tuning. The teams focussed building infrastructure and delivered many yet, not all their ideas. Unlike in the past I’m really relaxed about the future ahead. But we should explain the properties of our todays solutions and be open about their limits. JSON indexing has some limits.

The primary pattern for indexing JSON columns is extracting values into a generated column and indexing the generated column. There’s pitfall I missed to mention in all previous blog posts: charsets. The MySQL JSON SQL data type is using the charset utf8mb4. Now, check my typical example table definitions:

CREATE TABLE products (
  product_id int(11) NOT NULL AUTO_INCREMENT,
  product json DEFAULT NULL, <-- utf8mb4
  facet_title varchar(127) <-- charset?
    GENERATED ALWAYS AS (
      json_unquote(
        json_extract(
          product,
          '$.title
        )
      )
  ) STORED,
  PRIMARY KEY (product_id),
  FULLTEXT KEY idx_ft_title (facet_title)  
) ENGINE=InnoDB 
  AUTO_INCREMENT=50001 
  DEFAULT CHARSET=latin1 <-- latin1

What may be the charset of the facet_title column? Correct, it is latin1:

SELECT CHARSET(facet_title), CHARSET(JSON_UNQUOTE(JSON_EXTRACT(product, "$.title"))) FROM products LIMIT 1;
+----------------------+---------------------------------------------------------+
| CHARSET(facet_title) | CHARSET(JSON_UNQUOTE(JSON_EXTRACT(product, "$.title"))) |
+----------------------+---------------------------------------------------------+
| latin1               | utf8mb4                                                 |
+----------------------+---------------------------------------------------------+

If, for some reason, you need the charset of the generated column to be utf8mb4, as in the JSON, then change the charset for the column or the entire table.

ALTER TABLE 
  products
MODIFY 
  facet_title VARCHAR(127) CHARACTER SET 'utf8mb4' 
  GENERATED ALWAYS AS (
    JSON_UNQUOTE(JSON_EXTRACT(product, "$.title"))
  ) STORED

Stored… why stored?

Generated columns must be materialized (stored) to support full-text indexing. Materializing a value extracted from a JSON document effectively means duplicating data on disk: slower reads, needs more space in the buffers, … There seems to be an issue combining a stored generated column and a virtual generated column – I got wrong query results and filed a bug. The workaround is to materialize all generated columns but that takes away one of their biggest benefits.

And, if you deal with unstructured data how do you know which JSON fields to extract?

Jippie – full-text index on JSON

I turned to my remote friend GCC and together we convinced MySQL to cut one or the other safety belt to have some fun. MySQL agreed that there is no fun without risk, agreed that a backtrace is not the end of the world and gave me a full-text index on a JSON column:

ALTER TABLE products ADD FULLTEXT idx_ft_product(product);
Query OK, 0 rows affected (34,09 sec)
Records: 0  Duplicates: 0  Warnings: 0

The table products now two full-text indicies. One on the generated column facet_title and one on the entire JSON document. It now supports “unstructured string search” with the expexted performance characteristics. Full-text search is quick, and results are correct:

SELECT COUNT(*) 
FROM products2 
WHERE MATCH(product) AGAINST("Beach"); <-- FT on JSON
+----------+
| COUNT(*) |
+----------+
|       48 |
+----------+
1 row in set (0,00 sec)

SELECT COUNT(*) 
FROM products2 
WHERE MATCH(facet_title) AGAINST ("Beach"); <-- FT on generated column
+----------+
| COUNT(*) |
+----------+
|       48 |
+----------+
1 row in set (0,01 sec)

SELECT COUNT(*) 
FROM products2 
WHERE 
  JSON_UNQUOTE(JSON_EXTRACT(product, "$.title")) 
  LIKE "%Beach%"; <-- no index
+----------+
| COUNT(*) |
+----------+
|       48 |
+----------+
1 row in set (3,98 sec)

WOW? No. MySQL is using safety belts for good reasons. I for sure broke a lot without knowing – or bothering much.

One of the things that go wrong here is having the default and built-in full-text parser working on JSON. The InnoDB full-text parser is given a char* requested to make sense of it. The char* holds the binary representation of JSON that MySQL uses for on-disk storage. I have no idea which words ended up in the full-text index.

The MySQL JSON binary presentation

Assuming the InnoDB full-text parsing code can deal with words coming from JSON columns, and I would be surprised if there was much to be done to make this possible, then the task boils down to write a full-text parser plugin for JSON column values. The MySQL manual page on writing a full-text parser plugin is not too long – why not give it a try.

The binary format of JSON columns is documented in WL#8132 (see also related WLs). Couple of type flags, a little “dictionary” for object and array columns, nothing too fancy going on there. But there is, of course, code in the server to do the job. The server has three classes that could come handy: Value from json_binary.h , Json_dom from https://github.com/mysql/mysql-server/blob/5.7/sql/json_dom.h”>json_dom.h and Json_wrapper found in json_dom.cc .

Class Value is about serializing values into the binary JSON reprensentation respectively unserializing. Json_dom provides an API to manipulate Values. Json_wrapper lets you operate on Value or Json_dom without knowing which is being used. And, the “DOM” is what the JSON_* SQL functions use to to perform their work. Both Value and Json_dom are pretty much self-contained.

But, oh – not again: THD. Json_dom needs THD for memory allocations and error/warning logging. THD stands for thread handler. Traditionally, MySQL was using one thread per client connection to perform all the work. THD carries a lot of useful information for doing so. And, lots of code requires passing THD. As long as your code is part of the traditional processing queue, everything is fine.

But plugins are not. Plugins may not have access to THD, and if they figure out a way to access the THD, it is not always a recommended way. My hacked MySQL then told me: enjoy your weekend, I’ll segfault now. I respect that decision.

I’ve learned two things. For one, I think we should allow full-text indexing of JSON columns in some future version. For two, managers should not constantly wipe their development teams to their limits. It is probably no big deal to make the JSON processing classes available to plugins. In fact, Json_binary seems so already only Json_dom is missing. But given the very tight schedules there is often no time to go the extra mile. Good news is that the JSON processing classes should be available in UDFs with no hassle and that there some cool refactoring is done in the server to make it more modular long term.

Maybe, I should stop talking about desired features myself… Whatever. Next for me is either UDFs or the cool new plugin services part of MySQL 5.7 pretty much nobody has talked about because everybody talks about the use case for the end user.

Happy hacking!





About List