Computing stuff tied to the physical world

Getting the most out of rsync

The rsync utility is an amazing workhorse tool: it synchronises file system trees in a very efficient manner, only sending changes in data. Which is particularly useful across a network connection, as it can save dramatically on I/O bandwidth and speed up the whole process, often by several orders of magnitude when compared to a full copy.


Perhaps even more impressive, is that rsync knows how to send changes between binary files when some parts of the data have moved up or down in the file. In other words, even when you insert or remove some bytes in a file, moving the rest of the information to a different offset in the file, rsync will still know how to avoid sending most unchanged data.

Check these pages on the algorithm rsync uses, if you’re interested. It’s incredibly clever.

The second reason why rsync is such a useful tool, is that it has very sensible defaults and a huge set of command-line options to adjust its behaviour.

On the new server at JeeLabs, rsync is used for backup. As previously described, these cron entries create a set of 7 daily backups, as an easy way to recover from all sorts of mishaps:

B = /ssd/backups/xudroid
30 23 * * 1 rsync -ax --delete / $B-1/ && touch $B-1/
30 23 * * 2 rsync -ax --delete / $B-2/ && touch $B-2/
30 23 * * 3 rsync -ax --delete / $B-3/ && touch $B-3/
30 23 * * 4 rsync -ax --delete / $B-4/ && touch $B-4/
30 23 * * 5 rsync -ax --delete / $B-5/ && touch $B-5/
30 23 * * 6 rsync -ax --delete / $B-6/ && touch $B-6/
30 23 * * 7 rsync -ax --delete / $B-7/ && touch $B-7/

Given the current server’s root disk contents of 11 GB, this requires about 77 GB on the backup SSD. It’s a great safety net, but there are still some drawbacks:

  • each backup is a “full backup”, hence the 7x storage need for a full set
  • local backups can easily get damaged (either by accident or maliciously)
  • each backup will copy more files than strictly necessary

To explain that last point: although rsync only copies differences, the above design uses last week’s copy as reference when making a new backup. So when a file changes, it will need to be copied during each of the next 7 backups, to replace that weekday’s previous version.

Which brings us to a feature in rsync called “–link-dest” which is extremely useful for rotating backup scenarios like these. Here’s the what it does:

  • when copying (eh, rsync’ing) from A to B, we can specify a third area to rsync, which it will use as alternate reference for the end result in B – let’s call it B’ for now

  • when rsync decides that file F in A is newer than file F in B, it will first check file F in B’ – if present, and if F in B’ is already identical to F in A, then instead of copying F to B, it will create a hard link from F in B’ to F in B

It can be a bit tricky to wrap your mind around this, but the effect is that B’ is treated as a “preferred backup source” for B. If B’ already has the right file, rsync re-links to it instead.

If F in A is different from both F in B and B in B’, then copying will take place as before – including the clever difference-only sending that rsync always uses. Likewise for all edge cases, such as F not being present in B or B’. The “–link-dest” option only matters in the above very specific case: F in A is the same as F in B’, but differs from F in B. Then, we link.

So why not simply re-link file F in area B to file F in area A?

After all, that too would appear to have the same effect. Two reasons:

  1. F in A is the original, if it changes, we do not want F in B to change (then it would no longer be a backup, but just an extra link to the “live” original!)

  2. The “–link-dest” approach works across file systems and across different machines. Hard links can’t. This makes link-dest an extremely useful option, as you will see.

Here are the modified cron entries to activate the link-dest mechanism:

B = /ssd/backups/xudroid
30 23 * * 1 rsync -ax --delete --link-dest=$B-7/ / $B-1/ && touch $B-1/
30 23 * * 2 rsync -ax --delete --link-dest=$B-1/ / $B-2/ && touch $B-2/
30 23 * * 3 rsync -ax --delete --link-dest=$B-2/ / $B-3/ && touch $B-3/
30 23 * * 4 rsync -ax --delete --link-dest=$B-3/ / $B-4/ && touch $B-4/
30 23 * * 5 rsync -ax --delete --link-dest=$B-4/ / $B-5/ && touch $B-5/
30 23 * * 6 rsync -ax --delete --link-dest=$B-5/ / $B-6/ && touch $B-6/
30 23 * * 7 rsync -ax --delete --link-dest=$B-6/ / $B-7/ && touch $B-7/

We simply add one extra command-line option, in which the backup from the previous day is used as B’ reference. The effects are quite dramatic:

  • changed files get copied over, as before
  • files which have not changed since the previous day become a hard link
  • as a result, total disk usage for these 7 backups is nearly the same as 1 backup
  • note also that if all files change every day, then this won’t save any disk space
  • less copying: a change gets copied over once, then all future backups link to it
  • directories can’t be hard-linked, they are still copied – this only affects files inside

In case of the JeeLabs server, disk space for the 7 backups dropped by over 50 GB. That’s not only less data to store, it’s also less data to copy in the first place. At the moment (end 2015), each daily full backup scan on the server takes 3..4 minutes for 11 GB of data.

And what about off-site backups?

Ah, yes. That’s where things really take off with this rsync trick: the link-dest mechanism also works across the network! Again, file changes get sent over as before. But only once – after that, a re-link on the remote backup is all that’s needed to track each changed file.

This is precisely how the backup server at JeeLabs has been setup. Every day, shortly after the main server backup has run, a cron job starts and uses that backup as source for (again) a rotating backup set. The backup server has an extra level of security, in that it can’t be written to from any other machine. Instead, it pulls its backups from the main server.

The result is a second rotating backup set on another system, which can’t be changed from the original server. This is currently also a set of 7 daily backups (with hard links to save on disk space), but it could have been set up for any number of backups and any update rate.

Off-site backups are just as easy. Given that rsync is so efficient in its use of bandwidth, such backups can be maintained over a fairly modest internet connection.

Also good to know: rsync supports ssh-based encrypted sessions and optional compression.

So let’s cherish rsync as one of the unsung heroes in today’s sea of data!