Computing stuff tied to the physical world

Data, data, data

In Software on Feb 17, 2013 at 00:01

If all good things come in threes, then maybe everything that is a triple is good?

This post is about some choices I’ve just made in HouseMon for the way data is managed, stored, and archived (see? threes!). The data in this case is essentially everything that gets monitored in and around the house. This can then be used for status information, historical charts, statistics, and ultimately also some control based on automated rules.

The measurement “readings” exist in the following three (3!) forms in HouseMon:

  • raw – serial streams and byte packets, as received via interfaces and wireless networks
  • decoded – actual values (e.g “temperature”), as integers with a fixed decimal point
  • formatted – final values as shown on-screen, formatted and with units, i.e. “17.8 °C”

For storage, I’ll be using three (3!) distinct mechanisms: log files, Redis, and archive files.

The first decision was to store everything coming in as raw text in daily log files. With rollover at midnight (UTC), and tagged with the interface and a time stamp in milliseconds. This format has been in use here at JeeLabs since 2008, and has served me really well. Here is an example, taken from the file called “20130211.txt”:

    L 01:29:55.605 usb-AH01A0GD OK 19 115 113 98 25 39 173 123 1
    L 01:29:58.435 usb-AH01A0GD OK 9 4 50 68 235 251 166 232 234 195 72 251 24
    L 01:29:58.435 usb-AH01A0GD DF S 6373 497 68
    L 01:29:59.714 usb-AH01A0GD OK 19 96 13 2 11 2 30 0

Easy to read and search through, but clearly useless for seeing the actual values, since these are the RF12 packets before being decoded. The benefit of this format is precisely that it is as raw as it gets: by storing this on file, I can improve the decoders and fix the inevitable bugs which will crop up from time to time, then simply re-parse the files and run them through the decoders again. Given that the data comes from sketches which change over time, and which can also contain bugs, the mapping of which decoders to apply to which packets is an important one, and is in fact going to depend on the timeline: the same node may have been re-used for a different sketch, with a different packet format over time (has rarely happened here, once a node has been put to permanent use).

In HouseMon, the “logger” briq now ties into serial I/O and creates such daily log files.

The second format is more meaningful. This holds “readings” such as:

    { id: 1, group:5, band: 868, type: 'roomNode', value: 123, time: ... }

… which might represent a 12.3°C reading in the living room, for example.

This is now stored in Redis, using the new “history” briq (a “briq” is simply an installable module in HouseMon). There is one sorted set per parameter, to which new readings are added (i.e. integers) as they come in. To support staged archive storage, the sorted sets are segmented per 32 hours, i.e. there is one sorted set per parameter per 32-hour period. At most two periods are needed to store full details of every reading from at least the past 24 hours. With two periods saved in Redis for each parameter, even a setup with a few hundred parameters will require no more than a few dozen megabytes of RAM. This is essential, given that Redis keeps all its data in memory.

And lastly, there is a process which runs periodically, to move data older than two periods ago into “archival storage”. These are not round-robin databases, in the sense of a circular buffer which gets overwritten as new data comes in and wraps around, but they do use a somewhat similar format on disk. Archival storage can grow infinitely, here I expect to end up with about 50..100 MB per year once all the log files have been re-processed. Less, if compression is used (to be decided once speed trade-offs of the RPi have been measured).

The files produced by the “archive” briq have the following three (3!) properties:

  • archives are redundant – they can be reconstructed from scratch, using the log files
  • data in archives is aggregated, with one data point per hour, and optimised for access
  • each hourly aggregation contains: a count, a sum, a minimum, and a maximum value

I’ll describe the archive design choices and tentative file format in the next post.

  1. Great reading. I’m also converting my home automation system towards Redis as my caching and “current” persistency layer. Still using MySQL for the long term storage (logs async). This creates the possibility to create a simple query interface which can query and aggregate over your old data. Still no aggregating needed (and I got around 66 million readings so far).

    For the Redis part I’m also using a Hash to keep the current sensor values. Like: HSET currentvalues zwave.12.temp 21

    This creates one single hash which I fetch a lot in my web interface, touchscreens, rule engine, etc.

  2. Minimum and maximum values sound like sensible choices, but if you want to get some statistical data, might the addition of the variance (or standard deviation) be a good idea? That should be a good indication of, well, variance: the performance of a thermostat, light level fluctuations (both indoors—lighting—and outdoors—PV production).

    • Variance within each hour? Would it not be sufficient to do this over larger periods of time, i.e. using the hourly values as basis for calculating variance per week / month / year? I haven’t given statistics much thought yet…

  3. Hello jcw, Have you ever looked at mongodb? I’ve been following your development of HouseMon of course and see you’ve chosen to work with Redis. I am messing with both of them for curiosity. I like the quick setup of mongodb. Just wondering of you have considered it for HouseMon…

    • Yes, I’ve looked into mongodb and mongoose. I think it’s not such a good fit for time series and numeric data, in the sense that the capabilities of mongodb won’t be of much use in this context. Storing data in columnar format with implicit “time slots” will give more opportunities to optimise, I expect.

Comments are closed.