There is a file format which I’ve been using for decades in various projects,
and which has recently been revived in the
Monty project. From now on, I will
call this format “VaryVec”.
A VaryVec is a binary file format which stores N variable-sized byte strings. It can easily be generated and allows random access to the entries and their size. The format is as follows:
- an index, consisting on N+1 fixed-size integers in little-endian binary format …
- … followed by each of the N entries, in consecutive order
Each index entry contains the byte offset of the start of its corresponding data value. Here are some examples, using zero-terminated byte strings as example:
|An empty VaryVec:||<1>||1 byte|
|A single-char string entry:||<2> <4>||
|Two single-character ebtries:||<3> <5> <7>||
|A single short string:||<2> <6>||
|Two strings:||<3> <7> <13>||
|Three empty entries:||<4> <4> <4> <4>||4 bytes|
These examples show two parts for clarity, but on file the data follows right after the index.
This format has a number of useful characteristics:
- arbitrary byte data can be stored in each entry
- random access is trivial, and lengths are easily calculated
- there is one more index entry than the number of byte strings
- it’s very compact (but also a bit tedious to modify in this basic form)
- the number of entries N is
index/W-1(W = 1, 2, or 4 - see below)
- an empty file is not a valid VaryVec
- the total file size is
- … this also makes it easy to detect incomplete files
1, 2, or 4-byte offsets¶
An interesting property of all is that this same approach can be used with 1-byte, 2-byte, and 4-byte index entries, limiting the total file size to 255 / 65,535 / 4,294,967,295, respectively.
The file length is enough to determine which width is in effect, assuming that the minimum width is always used. Applications which create the index in memory can convert it to a smaller entry width before saving, once the total content has been indexed.
When saving potentially large files, a quick scan over the data can be used to decide instead.
The 1-byte index format is extremely compact. It can only represent up to 127 entries (which would then all have to be empty strings if zero-terminated). But for some cases, such as a pre-parsed single line of text, this might actually be fine.
Such a “tiny VaryVec” could be an excellent format for passing key-value style
settings or a typed-in command line, for example. The benefit here is the
pre-parsing this offers: the recipient of such data can then access each of
the values without having to scan text again. I’m considering this for a small
shell interpreter, which would then pass a pointer to such a data structure
instead of the usual
(int argc, char** argv) passed on to command-line apps.
Given that text input code already does a minimal amount of “parsing” to read up to the next end-of-line character, and sometimes does a lot more when supporting line editing, it really is a very small change to add this sort of VaryVec parsing and conversion well.
But why stop there?
With the 2- and 4-byte index formats, this same approach can in fact be used for any text file. Instead of storing text as line-delimited byte streams, why not store them as VaryVecs instead? This essentially comes down to inserting a random-access index in front, and replacing the newline characters with 0-bytes.
Obviously, this is going to break every convention out there, which states that “plain” text is the most generic format, and that it is (therefore) the preferred way to store all textual information.
So much for convention. The benefit of this is a simpler way to handle all sorts of text files, allowing direct indexing of lines and referring to sections in config files by their index, for example. For “low-fat” computing (small µCs), it seems like a direction worth exploring further.