Computing stuff tied to the physical world

Structured data

In Software, Musings on Jun 22, 2012 at 00:01

As hinted at yesterday, I intend to use the ZeroMQ library as foundation for building stuff on. ZeroMQ bills itself as “The Intelligent Transport Layer”, and frankly, I’m inclined to agree. Platform and vendor agnostic. Small. Fast.

So now we’ve got ourselves a pipe. What do we push through it? Water? Gas? Electrons?

Heh – none of the above. I’m going to push data / messages through it, structured data that is.

The next can of worms: how does a sender encode structured data, and how does a receiver interpret those bytes? Have a look at this Comparison of data serialization formats for a comprehensive overview (thanks, Wikipedia!).

Yikes, too many options! This is almost the dreaded language debate all over again…

Ok, I’ve travelled the world, I’ve looked around, I’ve pondered on all the options, and I’ve weighed the ins and outs of ’em all. In the name of choosing a practical and durable solution, and to create an infrastructure I can build upon. In the end, I’ve picked a serialization format which most people may have never heard of: Bencode.

Not XML, not JSON, not ASN.1, not, well… not anything “common”, “standard”, or “popular” – sorry.

Let me explain, by describing the process I went through:

  • While the JeeBus project ran, two years ago, everything was based on Tcl, which has implicit and automatic serialization built-in. So evidently, this was selected as mechanism at the time (using Tequila).

  • But that more or less constrains all inter-operability to Tcl (similar to using pickling in Python, or even – to some extent – JSON in JavaScript). All other languages would be second-rate citizens. Not good enough.

  • XML and ASN.1 were rejected outright. Way too much complexity, serving no clear purpose in this context.

  • Also on the horizon: JSON, a simple serialization format which happens to be just about the native source code format for data structures in JavaScript. It is rapidly displacing XML in various scenarios.

  • But JSON is too complex for really low-end use, and requires relatively much effort and memory to parse. It’s based on reserved characters and an escape character mechanism. And it doesn’t support binary data.

  • Next in the line-up: Bernstein’s netstrings. Very elegant in its simplicity, and requiring no escape convention to get arbitrary binary data across. It supports pre-allocation of memory in the receiver, so datasets of truly arbitrary size can safely be transferred.

  • But netstrings are a too limited: only strings, no structure. Zed Shaw extended the concept and came up with tagged netstrings, with sufficient richness to represent a few basic datatypes, as well as lists (arrays) and dictionaries (associative arrays). Still very clean, and now also with exactly the necessary functionality.

  • (Tagged) netstrings are delightfully simple to construct and to parse. Even an ATmega could do it.

  • But netstrings suffer from memory buffering problems when used with nested data structures. Everything sent needs to be prefixed with a byte count. That means you have to either buffer or generate the resulting byte sequence twice when transmitting data. And when parsed on the receiver end, nested data structures require either a lot of temporary buffer space or a lot of cleverness in the reconstruction algorithm.

  • Which brings me to Bencode, as used in the – gasp! – Bittorrent protocol. It does not suffer from netstring’s nested size-prefix problems or nested decoding memory use. It has the interesting property that any structured data has exactly one representation in Bencode. And it’s trivially easy to generate and parse.

Bencode can easily be used with any programming language (there are lots of implementations of it, new ones are easy to add), and with any storage or communication mechanism. As for the Bittorent tie-in… who cares?

So there you have it. I haven’t written a single line of code yet (first time ever, but it’s the truth!), and already some major choices have been set in stone. This is what I meant when I said that programming language choice needs to be put in perspective: the language is not the essence, the data is. Data is the center of our information universe – programming languages still come and go. I’ve had it with stifling programming language choices.

Does that mean everybody will have to deal with ZeroMQ and Bencode? Luckily: no. We – you, me, anyone – can create bridges and interfaces to the rest of the world in any way we like. I think HouseAgent is an interesting development (hi Maarten, hi Marco :) – and it now uses ZeroMQ, so that might be easy to tie into. Others will be using Homeseer, or XTension, or Domotiga, or MisterHouse, or even… JeeMon? But the point is, I’m not going to make a decision that way – the center of my universe will be structured data. With ZeroMQ and Bencode as glue.

And from there, anything is possible. Including all of the above. Or anything else. Freedom of choice!

Update – if the Bencode format were relaxed to allow whitespace between all elements, then it could actually be pretty-printed in an indented fashion and become very readable. Might be a useful option for debugging.

  1. Have you looked at MQTT as an embedded alternative to ZeroMQ? It’s a lightweight messaging protocol for pubsub, one of the patterns supported by ZeroMQ.

    There are client and server implementations for most languages (C, Python, Lua, Java, Javascript, etc). There’s code for Arduino + ethernet shield – and it’s accepted by Pachube/Cosm.

    Like ZeroMQ, the message format isn’t specified. So, bencode would make a good choice here too.

    • Yes, I have (to a certain extent). Very comparable. I expect that RPC will also be useful for me, not just pubsub. The generality of zeromq appeals to me. Then again, severely limited clients could wing it and implement only a subset, such as pubsub.

      Either option can probably be used for pubsub underneath bencode without affecting the rest of the system.

      This is what I mean by “there is no center in the software universe” … no matter where you go, someone else will tackle the same issues with a different set of choices.

  2. On first viewing I like the look of 0MQ and I don’t like the look of Bencode. That’s just my 2c worth.

    • Bencode is not really visible, you’ll have native data structures in the programming language you use.

  3. I really like the idea of having a standard interface to the “jee world” that would allow us the freedom to use what ever HA product that takes our fancy. Lets face it, everyone here is interested in the jee world, but if you took a survey would have a million different preferences for what we do with the data. As you point out JCW, there are plenty of great HA software packages out there already.


  4. ZMQ, oh yes, it caught my eye some time ago. Now I only hope that the world will get a Tcl – ZMQ binding as a side effect and, again, it will become a better place. Just like when Tclkit was created.

    P.S. This is not an invitation to a “language war”. It is meant to be humorous… but still true :)

  5. You have chosen a very nice and simple defacto data encoding protocol. I hope though that you will be using the extended bencode versions that also allow bool and float to be part of the data… I’m using them a lot for temperatue and humidity for instance!

    If I look at the major changes I made for connecting a JeeNode to Homeseer, it is about protocol behaviour, ie a jn announces itself on the network to others, I can define the node id from within Homeseer, self descriptive sensors, configurable event based or polling, etc.

    All these things are not defined by both bencode or zeromq, but needed to use jn with several HA programs.

    Nonetheless, I’m very happy with both choices!

  6. Can you explain, why property that any structured data has exactly one representation is worth please ?

    Floating point, boolean and nil are important data types by my opinion, how do expect to simulate them ?

    What are the reasons to not use Protocol Buffers ? Is raw encoded data dumb readability so important ?

    Lets suppose that we would like to allow simple and convenient way in code to communicate. Why in that scenario bother with sockets paradigma and not just call function on remote node (whatever is is) same way as it is local function ?

    Why to bother with particular protocol addressing scheme (IPv4, IPv6, … TCP ports) and just “configure” communication parties and use node identifiers later ?

    Even in that scenario both point-to-point and point-to-multipoint (functions without return values).

    In that scenario we would suffer synchronous programming problems, but what if e.g. Proto Threads will be incorporated as one basic system features ?

    And finaly, lets think about UI. Let me dream a little bit. Suppose we have HTML+JavaScript UI where without any other componets at user PC we can interface and programm embedded nodes … What if whole IDE would be in that environment and we will be able to tokenize programs upload and interpret them at resource limited nodes – something like

    What are your opinions ?

    • Single rep means equality of messages (or parts of messages) can be done via hashing / string compares.

      Yes, richer data types are needed at some level, but there is no end to that (dates? money? md5’s?)

      Not having them in Bencode might actually be an advantage – it’s very easy to have type info encoded and either send it along or make it implicit in the receiver. Example: send a typecode integer plus a binary string, to handle floats, or anything else. The idea is that the encoding is self-descriptive enough to support structure (i.e. nesting), but that you don’t go back to extending that low level whenever a new data type or representation comes up.

      There’s a lot more to it than basic data types. In Metakit, I use a range of extremely compact vector types – should I be going back to change Bencode if I were to transmit something like that one day?

      Bencode and protocol buffers are not exclusive choices. If protocol buffers have compelling advantages, or any other existing or future data representation, they can easily be transported as a binary string inside a minimal Bencode structure (just add its length and a colon in front). It’s not so much about readability – it’s also to keep byte-order and machine representations out of this level of the interface layer.

      I don’t have much to add to the rest of your comments at this point. I’m only addressing the lowest-level data interchange so far. Issues such as rpc vs pubsub, client-server topologies, threads, and UI are more about functionality, i.e. code – hence tied to programming language.

    • For what purpose equality of messages test is good ?

    • Oh, I don’t know, just thought it may become useful. Duplicate messages? Unchanged configs? Compression?

    • Michal – thanks for the pointer. Superb!

    • See also ;-)

      I am sorry, I still did not get the point for equality checks – why detect duplicate messages (duplicate packets are resolved at lower layers), configuration serialization will be probably generated with same library always, so different representation is unlikely to happen, compression – hmm that would be the case, but higher layer would be more efficiet for that definetly …

      What is your opinion about ProtoThreads approach how to code synchronously but for asynchronous events ?

    • ProtoThreads are neat, but perhaps a bit too tricky for widespread use. I’d prefer an interpreted language which keeps all that hidden beneath the surface, although that’s not always an option for small ultra-low power nodes.

  7. bencode looks like a good choice.

    The only other one that springs to mind is protobuf (protocol buffers) by google. About the only thing it seems to add over bencode is a schema. And like bencode it seems to have implementations for pretty much every language except TCL ;-)

    There appears to be a few implementations for small/embedded systems.

    • Protocol Buffers can not be decoded without schema (imagine how troubleshooting would be complicated) and it is always problem how to maintain schema and code consitent without complicated framework around (additional .proto files or additional types/structure definitions in code).

  8. ;-)

    There seem to be many extended versions around. Discussion is going on for years and years already…

    However, most use ‘b’ for boolean and ‘f’ for floats. Wow, that’s a surprise, isn’t it ? And the ‘None’ type is in most cases also defined to accomodate data that is expected by the structure, but not filled by the sender

  9. Folks, I knew this topic would lead to discussion…

    All I ask is: please try to take this approach as a starting point and see whether it can be made to accommodate whatever we come up with. In other words: take ZeroMQ and Bencode as a given, and look for ways to fit in whatever road-blocks you see looming. My prediction: you might be surprised by what can be done, and how extensible this is.

    This isn’t simply a “please give me some slack” remark. The reason I’m saying this is that – by definition – we cannot solve future issues today. What I’m after is a set of decisions which is strong enough to keep going, and weak enough to let us solve anything you and I will be throwing at it for years to come. The lack of types in Bencode means we immediately hit the need to map richer types to it – but once we solve that, we end up with a solution which can deal with any future datatype, while leaving Bencode 100% intact as bottom-level protocol. Or, another way to put it: Bencode doesn’t aim to handle all real-world data types, that’s one level up in the design.

    Perhaps a somewhat odd analogy: mime types define two parts, i.e. “text/html”, “image/gif”, etc. What this approach does is to fix the first part – there is only Bencode to get structure across. But that structure could be as rich and fine-grained as needed (from flat embedded protocol buffers, to very detailed type + data descriptions). Bencode is just the messenger! Or the envelope, if you will!

  10. What I forgot to ask is what the intention of these choises is.

    As you put the data into the center of the universe, I assume that you would like to use bencode to communicate between JeeNodes, meaning each sketch would be able to communicate to other nodes and at least the JeeLink ?

    The JeeLink receiver on the PC then would see standard bencode messages, could translate them one level higher, and offer a zeromq interface to the rest of the world, making interfacing with several JeeNode networks a breeze…

    Could you elaborate a bit on these assumptions?

    BTW. As you alraedy stated: bencode could be used by itself to define more datatypes (XPL for instance defines a lot of types like humi, temp, pressure, etc., but makes it a bit more complex), but if this is to be used by JeeNodes the ‘b’ and ‘f’ extensions would be logical, although NOT a neccessity…

    Do as you seem fit!!!!!!!

    • No, this isn’t for comms between JeeNodes. Far too many bytes for ultra-low power transmission. This is for use in home automation systems, middleware, message-based systems, etc. The system a central JeeNode or JeeLink gets attached to.

  11. Aha, good that I asked then.

    Assumption is the mother of all Floats ;-). (or anything else starting with an ‘f’ and ending with an ‘s’…)

Comments are closed.