As hinted at yesterday, I intend to use the ZeroMQ library as foundation for building stuff on. ZeroMQ bills itself as “The Intelligent Transport Layer”, and frankly, I’m inclined to agree. Platform and vendor agnostic. Small. Fast.
So now we’ve got ourselves a pipe. What do we push through it? Water? Gas? Electrons?
Heh – none of the above. I’m going to push data / messages through it, structured data that is.
The next can of worms: how does a sender encode structured data, and how does a receiver interpret those bytes? Have a look at this Comparison of data serialization formats for a comprehensive overview (thanks, Wikipedia!).
Yikes, too many options! This is almost the dreaded language debate all over again…
Ok, I’ve travelled the world, I’ve looked around, I’ve pondered on all the options, and I’ve weighed the ins and outs of ’em all. In the name of choosing a practical and durable solution, and to create an infrastructure I can build upon. In the end, I’ve picked a serialization format which most people may have never heard of: Bencode.
Not XML, not JSON, not ASN.1, not, well… not anything “common”, “standard”, or “popular” – sorry.
Let me explain, by describing the process I went through:
While the JeeBus project ran, two years ago, everything was based on Tcl, which has implicit and automatic serialization built-in. So evidently, this was selected as mechanism at the time (using Tequila).
XML and ASN.1 were rejected outright. Way too much complexity, serving no clear purpose in this context.
But JSON is too complex for really low-end use, and requires relatively much effort and memory to parse. It’s based on reserved characters and an escape character mechanism. And it doesn’t support binary data.
Next in the line-up: Bernstein’s netstrings. Very elegant in its simplicity, and requiring no escape convention to get arbitrary binary data across. It supports pre-allocation of memory in the receiver, so datasets of truly arbitrary size can safely be transferred.
But netstrings are a too limited: only strings, no structure. Zed Shaw extended the concept and came up with tagged netstrings, with sufficient richness to represent a few basic datatypes, as well as lists (arrays) and dictionaries (associative arrays). Still very clean, and now also with exactly the necessary functionality.
(Tagged) netstrings are delightfully simple to construct and to parse. Even an ATmega could do it.
But netstrings suffer from memory buffering problems when used with nested data structures. Everything sent needs to be prefixed with a byte count. That means you have to either buffer or generate the resulting byte sequence twice when transmitting data. And when parsed on the receiver end, nested data structures require either a lot of temporary buffer space or a lot of cleverness in the reconstruction algorithm.
Which brings me to Bencode, as used in the – gasp! – Bittorrent protocol. It does not suffer from netstring’s nested size-prefix problems or nested decoding memory use. It has the interesting property that any structured data has exactly one representation in Bencode. And it’s trivially easy to generate and parse.
Bencode can easily be used with any programming language (there are lots of implementations of it, new ones are easy to add), and with any storage or communication mechanism. As for the Bittorent tie-in… who cares?
So there you have it. I haven’t written a single line of code yet (first time ever, but it’s the truth!), and already some major choices have been set in stone. This is what I meant when I said that programming language choice needs to be put in perspective: the language is not the essence, the data is. Data is the center of our information universe – programming languages still come and go. I’ve had it with stifling programming language choices.
Does that mean everybody will have to deal with ZeroMQ and Bencode? Luckily: no. We – you, me, anyone – can create bridges and interfaces to the rest of the world in any way we like. I think HouseAgent is an interesting development (hi Maarten, hi Marco :) – and it now uses ZeroMQ, so that might be easy to tie into. Others will be using Homeseer, or XTension, or Domotiga, or MisterHouse, or even… JeeMon? But the point is, I’m not going to make a decision that way – the center of my universe will be structured data. With ZeroMQ and Bencode as glue.
And from there, anything is possible. Including all of the above. Or anything else. Freedom of choice!
Update – if the Bencode format were relaxed to allow whitespace between all elements, then it could actually be pretty-printed in an indented fashion and become very readable. Might be a useful option for debugging.