Computing stuff tied to the physical world

The LevelDB database

In Software on Sep 26, 2013 at 00:01

In recent posts, I’ve mentioned LevelDB a few times – a database with some quite interesting properties.

Conceptually, LevelDB is ridiculously simple: a “key-value” store which stores all entries in sorted key order. Keys and values can be anything (including empty). LevelDB is not a toy database: it can handle hundreds of millions of keys, and does so very efficiently.

It’s completely up to the application how to use this storage scheme, and how to define key structure and sub-keys. I’m using a simple prefix-based mechanism: some name, a “~” character, and the rest of the key, of which each field can be separated with more “~” characters. The implicit sorting then makes it easy to get all the keys for any prefix.

LevelDB is good at only a few things, but these work together quite effectively:

  • fast traversal in sorted and reverse-sorted order, over any range of keys
  • fast writing, regardless of which keys are being written, for add / replace / delete
  • compact key storage, by compressing common prefix bytes
  • compact data storage, by a periodic compaction using a fast compression algorithm
  • modifications can be batched, with a guaranteed all-or-nothing success on completion
  • traversal is done in an isolated manner, i.e. freezing the state w.r.t. further changes

Note that there are essentially just a few operations: GET, PUT, DEL, BATCH (of PUTs and DELs), and forward/backward traversal.

Perhaps the most limiting property of LevelDB is that it’s an in-process embedded library, and that therefore only a single process can have the database open. This is not a multi-user or multi-process solution, although one could be created on top by adding a server interface to it. This is the same as for Redis, which runs as one separate process.

To give you a rough idea of how the underlying log-structured merge-tree algorithm works, assume you have a datafile with a sorted set of key-value pairs, written sequentially to file. In the case of LevelDB, these files are normally capped at 2 MB each, so there may be a number of files which together constitute the entire dataset.

Here’s a trivial example with three key-value pairs:


It’s easy to see that one can do fast read traversals either way in this data, by locating the start and end position (e.g. via binary search).

The writing part is where things become interesting. Instead of modifying the files, we collect changes in memory for a bit, and then write a second-level file with changes – again sorted, but now not only containing the PUTs but also the DELs, i.e. indicators to ignore the previous entries with the same keys.

Here’s the same example, but with two new keys, one replacement, and one deletion:


A full traversal would return the entries with keys A, AB, B, and D in this last example.

Together, these two levels of data constitute the new state of the database. Traversal now has to keep track of two cursors, moving through each level at the appropriate rate.

For more changes, we can add yet another level, but evidently things will degrade quickly if we keep doing this. Old data would never get dropped, and the number of levels would rapidly grow. The cleverness comes from the occasional merge steps inserted by LevelDB: once in a while, it takes two levels, runs through all entries in both of them, and produces a new merged level to replace those two levels. This cleans up all the obsolete and duplicate entries, while still keeping everything in sorted order.

The consequence of this process, is that most of the data will end up getting copied, a few times even. It’s very much like a generational garbage collector with from- and to-spaces. So there’s a trade-off in the amount of writing this leads to. But there is also a compression mechanism involved, which reduces the file sizes, especially when there is a lot of repetition across neighbouring key-value pairs.

So why would this be a good match for HouseMon and time series data in general?

Well, first of all, in terms of I/O this is extremely efficient for time-series data collection: a bunch of values with different keys and a timestamp, getting inserted all over the place. With traditional round-robin storage and slots, you end up touching each file in some spot all the time – this leads to a few bytes written in a block, and requires a full block read-modify-write cycle for every single change. With the LevelDB aproach, all the changes will be written sequentially, with the merge process delayed (and batched!) to a different moment in time. I’m quite certain that storage of large amounts of real-time measurement data will scale substantially better this way than any traditional RRDB, while producing the same end results for access. The same argument applies to indexed B-Trees.

The second reason, is that data sorted on key and time is bound to be highly compressible: temperature, light levels, etc – they all tend to vary slowly in time. Even a fairly crude compression algorithm will work well when readings are laid out in that particular order.

And thirdly: when the keys have the time stamp at the end, we end up with the data laid out in an optimal way for graphing: any graph series will be a consecutive set of entries in the database, and that’s where LevelDB’s traversal works at its best. This will be the case for raw as well as aggregated data.

One drawback could be the delayed writing. This is the same issue with Redis, which only saves its data to file once in a while. But this is no big deal in HouseMon, since all the raw data is being appended to plain-text log files anyway – so on startup, all we have to do is replay the end of the last log file, to make sure everything ends up in LevelDB as well.

Stay tuned for a few other reasons on why LevelDB works well in Node.js …

  1. Typo in the title…

  2. Please create a post that discusses how your leveldb design stores the timestamp data. Do you have to put the timestamp in the key? = 67.1 or does leveldb store the timestamp as metadata for the key/value pair. Is the timestamp localtime or GMT? Then please discuss how a query would work using timestamp ranges to generate a dataset for my kitchen temperature. Thanks. Love watching your progress.

    • I do both – for storing historical data, the timestamp is appended to the key, for storing a table of actual values it is omitted and hence overwrites the previous value – see github. All timestamps are in milliseconds GMT to avoid summer-/wintertime issues.

      The trick to searching is to invent a range around the values. I.e. to search everything starting with “abc~def”, I define the search as “abc~def” .. “abc~def~”. More importantly, for everything starting with “abc~def~”, the range becomes “abc~def~” .. “abc~def~~”

  3. Looked over your coffee code. Looks good, thanks for answering.

Comments are closed.