Squashing tons of source code Jan 2016
And then came internet, open source software, and public code repositories…
Never before has it been possible to access so much software, of all kinds, in any programming language, for any application, and of all qualities & complexities. Even if not of immediate use “as is”, source code is a phenomenal resource - for knowledge, development ideas, engineering tricks, clever hacks - or simply to learn what others have done and how they have done it.
At JeeLabs, there’s an archive of source code - both “unzipped” and “checked out” - which has grown to some 125 GB over the years. Over two and a half million files.
Much of it is unknown, will never be seen, and is perhaps even obsolete.
But that’s not the point. Having access to the code, to find things, to see what’s available and to learn from others where possible - it’s a truly fantastic resource. And it’s definitely not the same as going through everything online: local disk access is faster, you can access it via the editor and other tools you know, and hey… it’s even there when the internet link is having a bad day.
Having a large source code repository (or actually a huge set of them) at arm’s length can be very useful during development. But it’s also easy to create a big mess of it. Which version is which, where did it come from? Before you know it, you can drown in semi-identical copies…
Fortunately, most source code now lives in public repositories (git, svn, mercurial, bazaar, cvs, rcs, etc). Which means you can simply “clone” from the source, and you get as much history as you like with it, as well as README’s, docs, links to the original “repo”, and more. Extremely convenient with Git & GitHub. That’s exactly what’s being saved more and more at JeeLabs.
As mentioned before, this is really a read-only archive - for reference, browsing, searching, sometimes for re-use, and occasionally even as basis for modification or derived work.
Storing these files as is on a local disk can be a bit inconvenient:
- it eats up space (most files are tiny, with a lot of partial-block waste)
- it’s too easy to accidentally change things it’s hard to refer to, if the
- archive changes regularly
For this reason, the current collection of snapshots over the past 10 years has now been turned into a highly-compressed read-only archive. To be extended once a year, perhaps.
There are many ways to do this (one way was to burn file collections to iso’s, i.e. CD-ROM images, even if not physically stored that way). Which is in fact exactly how the source code collection has been managed here, until now. But it gets messy, and worst of all: there is a huge amount of duplication, as source gets checked out again, perhaps with some changes.
But there’s a solution, which looks like it could work quite well, called SquashFS. It’s a file system with a number of very useful properties in this context:
- the file system is created once and is then essentially read-only each file
- is compressed, but so is most of the file system meta information _duplicate
- files_ are identified and only included once (“poor man’s de-duplication”) no
- free space is wasted due to “blocking” with some fixed granularity
With lots of small source files and some occasional duplication across them, SquashFS achieved a remarkable compression: those 125 GB ended up as a single 32 GB disk file. Note that this is a lossless transformation: the exact same directory tree ends up inside SquashFS.
SquashFS requires Linux (sort of, see below). But that’s no big deal: we can
simply copy that
sources.sqsh archive file to a Linux setup, which could be a
Raspberry Pi or an Odroid board, and it’ll do the work for us. The result can
then be shared as a Samba file server volume.
Here’s how to “mount” that file in Linux on an (existing) directory, called
sudo mount /sources.sqsh /mnt/sources -t squashfs -o loop
Or, to make this happen automatically on reboot, we can add this line to
/sources.sqsh /mnt/sources squashfs ro,defaults 0 0
It’s extremely simple to create such a SquashFS archive:
mksquashfs /path/to/original/sources sources.sqsh
Compression of a large set of files requires a lot of processing power (in this
case an 8-core i7 running several minutes full blast). But it’s no big deal
since: 1) it only needs to be done once, and 2) you don’t have to compress on
the same machine as where the result will be mounted. There’s even a build of
the SquashFS tools for Mac OSX (via
brew install squashfs).
See the SquashFS HOWTO for further information and examples.