Distributed database for small records on top of LevelDB/RocksDB

Datetime:2016-08-23 00:46:36          Topic: Leveldb  DataBase           Share

LevelDB (and actually RocksDB) is a very well known local storage for small records. It is implemented over log-structured merge trees and optimized for write performance.

LevelDB is quite horrible for records larger than 1Kb because of its merge operation – it quickly reaches level where it has to always merge trees and it takes seconds to complete.But for small keys it is very useful and fast.

RebornDB is a proxy storage, which operates on top of LevelDB/RocksDB and provides Redis API to clients. Basically, it is a Redis on top of on-disk leveldb storage.

There is an interesting sharding scheme – there are 1024 slots each of which uses its own replica set, which can be configured by admin. When client writes a key, it is hashed and one of the 1024 slots is being selected (using modulo (% 1024) operation). When admin decides that some slots should be moved to new/different machine, it uses command line tool to reconfigure the proxies. During migration slots in question are available for IO, although it may span multiple servers since proxy doesn’t yet know whether required key has been or hasn’t been yet moved to the new destination.Having many slots is a bit more flexible than old-school sharding which uses number of servers, although quite far from automatic ID range generation – manual resharding doesn’t scale for admins.

RebornDB uses zookeeper/etcd to store information about slot/server matching and per-slot replication policies. This doesn’t force every operation to contact zookeeper (this actually kills this service), instead every proxy has abovementioned info locally and every reconfiguration also updates info on every proxy.

There is not that much information about data recovery (and migration) except that it is implemented on key-by-key basis. Given that leveldb databases usually contain tens-to-hundreds millions of keys recovery may take a real while, snapshot migration is on todo list.





About List