Computing stuff tied to the physical world

Data storage and backups

In Musings on Dec 12, 2012 at 00:01

Having just gone through some reshuffling here, I thought it might be of interest to describe my setup, and how I got there.

Let’s start with some basics – apologies if this all sounds too trivial:

  • backups are not archives: backups are about redundancy, archives are about history
  • I don’t want backups, but the real world keeps proving that things can fail – badly!
  • archives are for old stuff I want to keep around for reference (or out of nostalgia…)

If you don’t set up a proper backup strategy, then you might as well go jump off a cliff.

If you don’t set up archives, fine: some hold onto everything, others prefer to travel light – I used to collect lots of movies and software archives. No more: there’s no end to it, and especially movies take up large amounts of space. Dropping all that gave me my life back.

We do keep all our music, and our entire photo collection (each 100+ GB). Both include digitised collections of everything before today’s bits-and-bytes era. So about 250 GB in all.

Now the deeply humbling part: everything I’ve ever written or coded in my life will easily fit on a USB stick. Let’s be generous and assume it will grow to 10 GB, tops.

What else is there? Oh yes, operating systems, installed apps, that sort of thing. Perhaps 20..50 GB per machine. The JeeLabs Server, with Mac OSX Server, four Linux VM’s, and everything else needed to keep a bunch of websites going, clocks in at just over 50 GB.

For the last few years, my main working setup has been a laptop with a 128 GB SSD, and it has been fairly easy to keep disk usage under 100 GB, even including a couple of Linux and Windows VM’s. Music and photo’s were stored on the server.

I’m rambling about this to explain why our entire “digital footprint” (for Liesbeth and me) is substantially under 1 TB. Some people will laugh at this, but hey – that’s where we stand.


Ah, yes, back to the topic of this post. How to manage backups of all this. But before I do, I have to mention that I used to think in terms of “master disks” and “slave disks”, i.e. data which was the real thing, and copies on other disks which existed merely for convenience, off-line / off-site security, or just “attics” with lots of unsorted old stuff.

But that has changed in the past few months.

Now, with an automatic off-site backup strategy in place, there is no longer a need to worry so much about specific disks or computers. Any one of them could break down, and yet it would be no more than the inconvenience of having to get new hardware and restore data – it’d probably take a few days.

The key to this: everything that matters, now exists in at least three places in the world.

I’m running a mostly-Mac operation here, so that evidently influences some of the choices made – but not all, and I’m sure there are equivalent solutions for Windows and Linux.

This is the setup at JeeLabs:

  • one personal computer per person
  • a central server

Sure, there are lots of other older machines around here (about half a dozen, all still working fine, and used for various things). But our digital lives don’t “reside” on those other machines. Three computers, period.

For each, there are two types of backups: system recovery, and vital data.

System recovery is about being able to get back to work quickly when a disk breaks down or some other physical mishap. For that, I use Carbon Copy Cloner, which does full disk tree copying, and is able to create bootable images. These copies include the O/S, all installed apps, everything to get back up to a running machine from scratch, but none of my personal data (unless you consider some of the configuration settings to be personal).

These copies are made once a day, a week, or a month – some of these copies are fully automatic, others require me to hook up a disk and start the process. So it’s not 100% automated, but I know for sure I can get back to a running system which is “reasonably” close to my current one. In a matter of hours.

That’s 3 computers with 2 system copies for each. One of the copies is always off-site.

Vital data is of course just that: the stuff I never want to lose. For this, I now use CrashPlan+, with an unlimited 10-computer paid plan. There are a couple of other similar services, such as BackBlaze and Carbonite. They all do the same: you keep a process running in the background, which pumps changes out over internet.

In my case, one of the copies goes to the CrashPlan “cloud” itself (in the US), the other goes to a friend who also has fast internet and a CrashPlan setup. We each bought a 2.5″ USB-powered disk with lots of storage, placed our initial backups on them, and then swapped the drives to continue further incremental backups over the net.

The result: within 15 minutes, every change on my disk ends up in two other places on this planet. And because these backups contain history, older versions continue to be available long after each change and long after any deletion, even (I limit the history to 90 days).

That’s 1 TB of data, always in good shape. Virtually no effort, other than an occasional glance on the menu bar to see that the backup is operating properly. Any failure of 3 or more days for any of these backup streams leads to a warning email in my inbox (which is at an ISP, i.e. off-site). Once a week I get a concise backup status report, again via email.

The JeeLabs server VM’s get their own daily backup to Amazon S3, which means I can re-launch them as EC2 instances in the cloud if there is a serious problem with the Mac Mini used as server here. See an older post for details.

Yes, this is all fairly obvious: get your backups right and you get to sleep well at night.

But what has changed, is that I no longer use the always-on server as “stable disk” for my laptop. I used to try putting more and more data on the central server here, since it was always on and available anyway. Which means that for really good performance you need a 1 Gbit wired ethernet connection. Trivial stuff, but not so convenient when sitting on the couch in the living room. And frankly also a bit silly, since I’m the only person using those large PDF and code collections I’m relying on more and more these days.

So now, I’ve gone back to the simplest possible setup: one laptop, everything I need on there (several hundred GB in total), and an almost empty server again. On the server, just our music collection (which is of course shared) and the really always-on stuff, i.e. the JeeLabs server VM’s. Oh, and the extra hard disk for my friend’s backups…

Using well under 1 TB for an entire household will probably seem ridiculous. But I’m really happy to have a (sort of) NAS-less, and definitely RAID-less, setup here.

Now I just need to sort out all the old disks….

  1. The carbonite link also points to the BackBlaze site at the moment.

    fixed – thank you

  2. My backup solution consists of 2 3TB nas-drives. I have one at home and one at my mom’s place.

    At home, we also have our data stored locally on the laptops / desktops and run an rsync command to create versioned backups on the nas once every few hours. The same thing happens automatically at my mom’s place.

    Those two nas-drives also use rsync through ssh to sync the backups from one house to the other every night. So basically we have a baseline of less then 3TB for four people and 7 computers in total. (This runs for OSX, Windows and Linux boxes btw).

    We are also not cutting back on what we store anywhere other than the “crap in, crap out” principle, but we all do digitize our paper administration. So your baseline of less then 500 gb per person on average doesn’t seem to weird for me.

    The backups are also compartimentalized by the way….

  3. I’ve started being much more organised with backups – only took 1 near disaster to convince me! A couple of notes –

    1) I’ve just switched to using amazon glacier for long term archiving. Amazon S3 can now be configured to automatically migrate archives to glacier after a certain amount of time.

    2) Many of the long term archiving solutions like crashplan use S3 as their backend! For linux backups, IMO it’s often easier to use S3 directly from rsync-like commandline scripts

    Cheers Dan

  4. If you’re into this sort of thing, here’s a very interesting article by BackBlaze.

  5. smartOS and zfs ? or what happend with bitrot ?

Comments are closed.