Computing stuff tied to the physical world

Who needs a database?

In Software on Feb 18, 2013 at 00:01

Yesterday’s post was about the three (3!) ways HouseMon manages data. Today I’ll describe the initial archive design in HouseMon. Just to reiterate its main properties:

  • archives are redundant – they can be reconstructed from scratch, using the log files
  • data in archives is aggregated, with one data point per hour, and optimised for access
  • each hourly aggregation contains: a count, a sum, a minimum, and a maximum value

The first property implies that the design of this stuff is absolutely non-critical: if a better design comes up later, we can easily migrate HouseMon to it: just implement the new code, and replay all the log files to decode and store all readings in the new format.

The second decision was a bit harder to make, but I’ve chosen to save only hourly values for all data older than 48 hours. That means today and yesterday remain accessible from Redis with every single detail still intact, and for things like weekly, monthly, and yearly graphs it always resolves to hourly values. The simplification over progressive RRD-like aggregation, is that there really are only two types of data, and that the archived data access is instant, based on time range: simple file seeking.

Which brings me to the data format: I’m going to use Plain Old Binary Data Files for all archive data. No database, nothing – although a filesystem is just as much a database as anything, really (as “git” also illustrates).

Note that this doesn’t preclude also storing the data in a “real” database, but as far as HouseMon is concerned, that’s an optional export choice (once the code for it exists).

Here’s a sample “data structure” of the archive on disk:

    archive/
      index.json
      p123/
        p123-1.dat
        p123-2.dat
      p124/
        p124-1.dat

The “index.json” file is a map from parameter names to a unique ID (id’s 1 and 2 are used in this example). The “p123” directory has the data for time slot 123. This is in mult1ples of of 1024 hours, as each data files holds 1024 hourly slots. So “p123” would be unix time 123 x 1024 x 3600 = 453427200 (or “Tue May 15 00:00:00 UTC 1984”, if you want to know).

Each file has 1024 hourly data points and each data point uses 16 bytes, so each data file is 16,384 bytes long. The reason they have been split up is to make varying dataset collections easy to manage, and to allow optional compression. Conceptually, the archive is simply one array per parameter, indexed by the absolute hour since Jan 1st, 1970 at midnight UTC.

The 16 bytes of each data point contain the following information in little-endian order:

  • a 4-byte header
  • the 32-bit sum of all measurement values in this hour
  • the 32-bit minimum value of all measurements in this hour
  • the 32-bit maximum value of all measurements in this hour

The header contains a 12-bit count of all measurement values (i.e. 0..4095). The other 20 bits are reserved for future use (such a few extra bits for the sum, if we ever need ’em).

The average value for an hour is of course easy to calculate from this: the sum divided by the count. Slots with a zero count represent missing values.

Not only are these files trivial to create, read, modify, and write – it’d also be very easy to implement code which does this in any programming language. Simplicity is good!

Note that there are several trade-offs in this design. There can be no more than 4095 measurements per hour, i.e. one per second, for example. And measurements have to be 32-bit signed ints. But I expect that these choices will work out just fine, and as mentioned before: if the format isn’t good enough, we can quickly switch to a better one. I’d rather start out really simple (and fast), than to come up with a sophisticated setup which adds too little value and too much complexity.

As far as access goes, you can see that no searching at all is involved to access any time range for any parameter. Just open a few files, send the extracted data points to the browser (or maybe even entire files), and process the data there. If this becomes I/O bound, then per-file compression could be introduced – readings which are only a small integer or which hardly ever change will compress really well.

Time will tell. Premature optimisation is one of those pitfalls I’d rather avoid for now!

  1. Perhaps a useful enhancement to the hourly structure would be to replace sum by average (no risk of overflow, and sum can be reconstructed). The reserved bits are available then for standard_deviation (possibly uncorrected by the n-1 factor to simplify computing standard_deviation over sequential hours). This allows data display with confidence bands – otherwise, without saving a bulky sum_of_squares value, this information is lost.

    • Interesting idea – in my original design that wasn’t feasible, but now that aggregation happens parameter-by-parameter, that is indeed an option now. Other info I was planning to track is the max time between readings. This would give an indication of large outages (or long RF interference periods).

      I’ll wait a bit to see just how large these datasets end up becoming before deciding to add more info – perhaps even increase the per-hour datapoint item size to 20 or 24 bytes.

    • Also a reply on yesterday’s musing about standard deviation: of course you can use the hourly average to calculate variance over longer periods of time; but since the variance always decreases when taking averages (the std of the average of N readings is 1/sqrt(N) *std of un-averaged readings).

      It is always possible to calculate the longer-period variance/std from the hourly values, as Martyn indicates; the other way around is impossible. Also: the variance of e.g. outer light levels aggregated per hour isn’t really that interesting: it would give the variance of a sine wave (night/day lighting), whereas the std within an hour could for instance be a measure for passing clouds…

    • Good points, Geert. Indeed, stddev of averages is going to be a lot less useful. But I do need to find a way to aggregate which can be done progressively – my statistics is a bit rusty, but I assume a sum and a sum of squares will probably do. I’m not yet convinced that I want this information, but I’ll keep it in mind.

  2. I log data to a csv file every minute, but poll and collect data once a second. This data is then averaged over the minute and written to file along with a timestamp. Log files are also per day, each log file is in a yyyy/mm/ folder and has a dd in the filename for easy retrieval. This makes it very easy to parse in js and mostly, especially for solar, I want to see stats per day. The timestamp helps me visually identify outages on the graph. When generating the graph, I zero out the display array for every minute and then populate it with the csv data. I find this approach, although more byte-heavy, makes it very easy to process data later even say in excel or by whatever means necessary.

    • Do you log the raw incoming data as well, or just these per-minute aggregations? What about motion detectors, for example?

  3. Surely it would be a worthwhile and relatively simple extension to make the aggregation period arbitrary, rather than forever locked to hourly? The optimal period depends a lot on the type and volatility of the data. Some types of data can be summarised more aggressively with little loss of value (e.g. 4-hourly or daily), whereas others may demand more retained granularity. It should only affect the index file, which can contain the information about period length, and the naming scheme on the *.dat files (replace the 3600 by the declared period length?).

    • Possible, of course. Let’s wait for a use case before making such changes, though. Keep in mind that full detail is retained in Redis for at least the past 32 hours (I’ll probably increase that to 64, to cover today and yesterday at all times). Also, a log file could be replayed and reparsed on-demand, to again obtain the complete info for any specific day, so drill-down is still possible.

      As for retaining less information: I intend to compress each of the files with 1024 measurements, and expect many of them to drop substantially in size. The server would then send these compressed sets as-is to the clients, which also saves on bandwidth and processing power. The whole archive then behaves as a file server, in 1024-hour units of per-parameter data.

      None of this is set in stone, but I’m inclined to stick to these choices until some real-world experience can be used to see what they lead to.

  4. It would be nice to compare Akash approach in terms of storage space to that of JCW. Year/month and day file structure are easier to grasp …

  5. I’ll let Akash describe his trade-offs, but note that the logs are daily (named YYYYMMDD.txt.gz) and the archive files are in such a format that small values and slow-changing values will compress extremely well. I’ll hopefully have some real figures in the near future.

    Note also that there are subtle issues when using daily files (daylight saving time and time-zones). With a web-based (i.e. client-side) export mechanism, I expect that the internal format of archive data files will become less relevant.

  6. Well, as my system runs on a tiny (512k flash and 128k ram) processor, I was unable to find a database engine that would run on it and decided to go for a csv file based approach – I did think of a record-based approach but decided against it mainly because CSV is more readable and easily parsable. I log multiple csv files with the variable name included in the file name. Right now I am only storing the processed readings in float format, but one could easily store both. Here is snapshot of what is stored per day:

    20130215sa.txt  42788 bytes
    20130215ia.txt  46521 bytes
    20130215v.txt   15488  bytes
    
    Where v is voltage, s is solar amps and i is inverter amps. The voltage just logs voltage and timestamp as follows:
    1360925641,23.440162
    1360925701,23.506372
    1360925761,24.102262
    1360925821,23.705001
    
    Whereas the inverter and solar log multiple things:
    1360925641,953.303833,10.250224,0,3109.680419,49.886066,14.450335
    1360925701,953.303833,10.250224,0,3109.923095,50.128730,14.559834
    1360925761,953.303833,10.250224,0,3110.168212,50.373962,14.713999
    1360925821,953.303833,10.250224,0,3110.410400,50.616249,14.537171
    Fields: Timestamp, total ah, ah today, amp minutes, three blank load vars, and watt hours today.

    Yes, I know, I could easily remove lots of it but I wanted a similar format for load and generation for my graphing engine and decided to stick with it. So as you can see it’s over 100 k for just 3 variables logging at once a minute 24/7 – is that a lot or not? That depends on your storage size and how many years of logs you want to keep. I have a 4 gb micro sd card, which means about 1.2 GB per year. For my purposes that is more than sufficient and I have daily stats very easily available via a known yy/mm/dd_var.txt mechanism. But as jcw pointed out one could modify it to only log changed data just as easily if you are monitoring a sensor, in which case the file size would reduce drastically – but the disadvantage is that you would have no way of knowing whether the sensor was operational during that time.

  7. … catching up with you guys …

    There can be some other considerations. For example for my solar electricity production, I just like to be able to see the time series for any moment in the past. I like the concept of the raw logging, never too loose information, but I also feel the need for “processed” archives with a higher granularity. See for example my solar website (made for my daughters school primary presentation): http://www.sevenwatt.com/zon/ Currently I use CSV files similar to Akash. I store date every ten seconds, and every two minutes. BTW, I get this from my plugwise through Linux. The two minutes is aggregate and buffered by the PW-nodes themselves, while the ten seconds readings require actual realtime access to the plugs, using the 8-second integration. Due to communication interferences I might miss a couple of readings. I never miss a two minute reading, as PW caches them for sixty days.

    The archive briq seems to aggregate to the hours, but my requirement would be that I could specify per parameter how to aggregate. And possibly do one minute, one hour, one day aggregations per parameter.

    My second consideration is on performance. Currently I offload any heavy lifting to the client side. So I just pass text files (csv), and if the client feels the need to make floats of text to graph it, I do not load my Pi, or to be more precise my NAS with same arm processor, in the case of web page serving.

    So housemon is strongly optimized towards storage effiency, judging on the use of binary format in archives. So I assume that the archive values are converted to text when transported over the socketstream. Right? Or does the client side receive binary? And maybe node.js will pleasantly surprise me. The NAS is running Apache (but not mysql, which is way to slow).

    SevenW

    • My current thinking is to aggregate per hour, indefinitely (archiver) and to keep full details around for the last 48 hours (history). The aggregation granularity could be made configurable (site-wise will be easier than per-parameter). I’m going to follow this path because I’m not that interested in finer granularity after a few days (weekly per hour suits me fine), and there is always the option to replay any of the daily logfiles on-demand if doing so might be needed after all.

      Archive files will be served as is, as static files, so the server won’t get bogged down. History data comes out of RAM via Redis. In both cases, the RPi will not do much, so I expect it to scale, but evidently real testing will have to bear this out.

      Archive files will be stored as is initially (i.e. binary slots), but probably gzipped later – this could save substantially on disk space, bandwidth, and CPU power, since those files would also be served as is, in compressed form. If this all turns out to leave lots of space and processing power to spare, it would then be easy to switch to a smaller unit of aggregation, say 30 or 15 minutes. One benefit of not aggregating further is that charts can zoom in without having to go back for more detail. A year of hourly values should still be fairly manageable on the client.

      I have several years of logfiles, which should come in handy when everything is ready to make these trade-offs. With a bit of luck, even pumping a year’s worth of parameter data to the client should be reasonably snappy.

      But It’s all guesswork at this stage, really…

  8. I’m now also aggregating the standard deviation. Each slot is now 20 bytes, i.e. five 32-bit ints. Each file is 20,480 bytes – uncompressed.

Comments are closed.