Comparing LevelDB and RocksDB, take 2

Datetime:2016-08-23 00:46:57          Topic: Leveldb           Share

I previously explained problems to avoid when comparing RocksDB and LevelDB. I am back with more details and results because someone is wrong on the Internet . The purpose for the test was to determine whether we created any regressions after reading a comparison published by someone else where RocksDB had some problems. Note that the LevelDB and RocksDB projects have different goals. I expect RocksDB to be faster, but that comes at a cost in code and configuration complexity. I am also reluctant to compare different projects in public. The good news is that I didn't find any performance regression in RocksDB, it is faster as expected but the overhead from performance monitoring needs to be reduced.

I made a few changes to LevelDB before running tests. My changes are in github and the commit message has the details. Adding the --seed option for read-heavy tests is important or LevelDB can overstate QPS. The next step was to use the same compiler toolchain for RocksDB and LevelDB. I won't share the diff to the Makefile as that is specific to my work environment.

I use the following pattern for tests. The pattern was repeated for N=1M, 10M, 100M and 1000M with 800 byte values and a 50% compression rate. The database sizes were approximately 512M, 5G, 50G and 500G. The test server has 40 hyperthread cores, 144G of RAM and fast storage.

  1. fillseq to load a database with N keys
  2. overwrite with 1 thread to randomize the database
  3. readwhilewriting with 1 reader thread and the writer limited to 1000 Puts/second. The rate limit is important to avoid starving the reader. Read performance is better when the memtable is empty and when queries are done immediately after fillseq. But for most workloads those are not realistic conditions, thus overwrite was done prior to this test.
  4. readwhilewriting with 16 reader threads and the writer limited to 1000 Puts/second
  5. readrandom with 1 thread
  6. readrandom with 16 threads
  7. overwrite with 1 thread
  8. overwrite with 16 threads

Results

I ran the RocksDB tests twice, with statistics enabled and disabled. We added a lot of monitoring in RocksDB to make it easier to explain performance. But some of that monitoring needs to be more efficient for workloads with high throughput and high concurrency. I have a task open to make this better.

These are command lines for LevelDBfor RocksDB with stats and for RocksDB without stats with 1M keys. There are many more options in the RocksDB command lines. When we decide on better defaults for it then the number can be reduced. This post has more details on the differences in options between LevelDB and RocksDB. There are some differences between LevelDB and RocksDB that I did not try to avoid.

  • LevelDB uses 2MB files and I chose not to change that in source before compiling. It tries to limit the LSM to 10M in L1, 100M in L2, 1000M in L3, etc. It also uses a 2M write buffer which makes sense given that L0->L1 compaction is triggered when there are 4 files in L0. I configured RocksDB to use a 128M write buffer and limit levels to 1G in L1, 8G in L2, 64 G in L3, etc.
  • For the 100M and 1000M key test the value of --open_files wasn't large enough in LevelDB to cache all files in the database.
  • Statistics reporting was enabled for RocksDB. This data has been invaluable for explaining good and bad performance. That feature isn't in LevelDB. This is an example of the compaction IO stats we provide in RocksDB.
  • Flushing memtables and compaction is multithreaded in RocksDB. It was configured to use 7 threads for flushing memtables and 16 threads for background compaction. This is very important when the background work is slowed by IO and compression latency. And compression latency can be very high with zlib although these tests used snappy. A smaller number would have been sufficient but one thread would have been too little as seen in the LevelDB results. Even with many threads there were stalls in RocksDB. Using this output from the overwrite with 16 threads test look at the Stall(cnt) column for L0 and then the Stalls(count) line. The stalls occur because there were too many L0 files. It is a challenge to move data from the memtable to the L2 with leveled compaction because L0->L1 compaction is single threaded and usually cannot run concurrent with L1->L2 compaction. We have work in progress to make L0->L1 compaction much faster.

Details

The data below shows the QPS (ops/sec) and for some tests also shows the ingest rate (MB/sec). I like to explain performance results but the lack of monitoring in LevelDB makes that difficult. My experience in the past is that it suffers from not having concurrent threads for compaction and memtable flushing especially when the database doesn't fit in RAM because compaction will get more stalls from disk reads.

My conclusions are:

  • read throughput is a bit higher with RocksDB
  • write throughput is a lot higher with RocksDB and the advantage increases as the database size increases
  • worst case overhead for stats in RocksDB is about 10% at high concurrency. It is much less at low concurrency.

--- 1M keys, ~512M of data

  RocksDB.stats  :  RocksDB.nostats  :     LevelDB

ops/sec  MB/sec  :  ops/sec  MB/sec  :   ops/sec  MB/sec  : test

 231641   181.1  :   243161   190.2  :    156299   121.6  : fillseq

 145352   113.7  :   157914   123.5  :     21344    16.6  : overwrite, 1 thread

 113814          :   116339          :     73062          : readwhilewriting, 1 thread

 850609          :   891225          :    535906          : readwhilewriting, 16 threads

 186651          :   192948          :    117716          : readrandom, 1 thread

 771182          :   803999          :    686341          : readrandom, 16 threads

 148254   115.9  :   152709   119.4  :     24396    19.0  : overwrite, 1 thread

 109678    85.8  :   110883    86.7  :     18517    14.4  : overwrite, 16 threads

--- 10M keys, ~5G of data

  RocksDB.stats  :  RocksDB.nostats  :     LevelDB

ops/sec  MB/sec  :  ops/sec  MB/sec  :   ops/sec  MB/sec  : test

 226324   177.0  :   242528   189.7  :   140095   109.0   : fillseq

  86170    67.4  :    86120    67.3  :    12281     9.6   : overwrite, 1 thread

 102422          :    95775          :    54696           : readwhilewriting, 1 thread

 687739          :   727981          :   513395           : readwhilewriting, 16 threads

 143811          :   143809          :    95057           : readrandom, 1 thread

 604278          :   676858          :   646517           : readrandom, 16 threads

  83208    65.1  :    85342    66.7  :    13220    10.3   : overwrite, 1 thread

  82685    64.7  :    83576    65.4  :    11421     8.9   : overwrite, 16 threads

--- 100M keys, ~50GB of data

  RocksDB.stats  :  RocksDB.nostats  :     LevelDB

ops/sec  MB/sec  :  ops/sec  MB/sec  :   ops/sec  MB/sec  : test

 227738   178.1  :   238645   186.6  :    64599    50.3   : fillseq

  72139    56.4  :    73602    57.6  :     6235     4.9   : overwrite, 1 thread

  45467          :    47663          :    12981           : readwhilewriting, 1 thread

 501563          :   509846          :   173531           : readwhilewriting, 16 threads

  54345          :    57677          :    21743           : readrandom, 1 thread

 572986          :   585050          :   339314           : readrandom, 16 threads

  74292    56.7  :    72860    57.0  :     7026     5.5   : overwrite, 1 thread

  74382    58.2  :    75865    59.3  :     5603     4.4   : overwrite, 16 threads

--- 1000M keys, ~500GB of data

Tests are taking a long time...

  RocksDB.stats  :    LevelDB

ops/sec  MB/sec  :  ops/sec  MB/sec  : test

 233126   182.3  :     7054     5.5  : fillseq

  65169    51.0  :                   : overwrite, 1 thread

6790          :                   : readwhilewriting, 1 thread

72670          :                   : readwhilewriting, 16 threads

PlanetMySQL Voting: Vote UP / Vote DOWN





About List