1. Why off-site backup?

Surely you jest? Any online service or OnlineCommunity suffers from the same MurphysLaw? as the rest of your life. At the worst possible time in your life, your server will melt into liquid mercury and wash away down the nearest storm drain. While it is tempting to believe that it won't happen to you, it will eventually be necessary when the hard drive crashes or the server melts down or your ISP goes out of business.

Many people's concept of back ups are simply to copy data around the same hard drive, but this will only protect you from your own mistakes, say if you corrupt the data store. It won't protect you when the server goes off line. The ideal is to have a full system image in a secure second location that you can bring online on a completely new server or ISP.

2. What to backup?

You need to back up things like:

The database
- RDBMS like PostgresQL? or MySQL?
- Flat files
- Blobs (images, attachments, etc.)
- User accounts

Configuration files
- Application configuration
- Web server
- Email server
- Scheduled tasks (e.g. cron)
- Start up scripts (e.g. rc)

The code
- The actual code
- Installed modules (e.g. Perl, Python) (you may just need a list)
- Installed libraries (you may just need a list, and the install files)

Log files
- Web server email log files, email inboxes,

Users' home directories

Website (HTML, JS, CSS, etc.)

3. Backup constraints

Ideally a backup is

Off-site
Compressed
Encrypted
Secure
Differential (i.e. only backup what has changed since last time, which limits space used)
Versioned
Redundantly stored
Cheap
Accessible
Resilient (i.e. unlikely to go bankrupt)
Reliable

4. Mirror as backup

The simplest strategy is to mirror the entire system on another server or servers. A mirror is simply a 1:1 copy of one system's data onto another. The benefit of this technique is that it is very simple to both create the backup and restore from the offsite backup, as either directly amounts to a simple copy operation. Additionally, you might make your mirrors public so that others can replicate your site. However, mirroring takes up a lot of disk space and bandwidth in order to make full copies of the data. Often, you will be stuck with only the latest copy of the data, so you will be left exposed if data has been corrupted since before the last backup cycle. You are also left exposed if the backup server crashes, leaving you without any defense in the event of the main server crashing.

An alternate strategy to fix the last two problems is to create many, many mirrors of the site, often each being at a different version. The more the data is duplicated, the more resilient it is against destruction. Since the typical mirroring process is spread out over time, restoring after a catastrophic crash is often a huge nightmare if you have to figure out which of the mirrors has the latest version of the file system. If you cannot find all the latest files, you'll also have to figure out if files from two different versions of the file system are compatible.

In general, it's better to start with just a mirror than have nothing. The typical strategy is to nightly or weekly create a compressed archive of the file system and FTP or scp it to a server off site. It's relatively simple after that to create a rudimentary versioned backup system by simply adding a timestamp to the filenames. You can also use rsync (via rdiff-backup). Of course, that strategy will quickly chew through hard drive space and bandwidth, so it's not ideal either.

In addition to mirroring the data of the site, a "hot rollover" system also mirrors the active applications onto the backup server. When disaster strikes, users are redirected to the backup server. Ideally, users continue to use the backup server as normal, perhaps never even noticing that there was a problem. It may take a few days to build a server, install all the active applications, and load it up with backed-up data. With a "hot rollover" system, you have already done that before the disaster, so the down-time is very short. DavidCary is building a "hot rollover" style [distributed wiki], even though it has all the flaws listed above, and more.

5. Backups with VersionHistory

Mirrors are not ideal. First, mirrors are often shallow. You can only go back to very recent data. If you don't notice a data corruption failure for a long time, the automatic mirrored backup will overwrite a good version with a corrupted version. At that point, original data is irretrievable. If you try resolving this by having a cloud of differing versions mirrored across several servers, you face the problem of identifying the latest version in the event of a crisis. The trick of simply copying multiple copies with timestamped filenames on each server eats up storage space and bandwidth, which makes this strategy expensive.

A more rational approach is to use a backup service with built-in versioning, so you will a) always know what is the latest version, b) be able to go back in time to the latest clean version, c) store and transmit only differences between versions, not entire blobs.

The typical version system is based on key frames (i.e. full backups) and differences:

If the backup is a key frame (more on this later), record it as the last key frame (i.e. in some global variable) so it is easy to find later.

The first backup creates the first key frame; i.e. it backs up the entire site, with no diffs. It's from this key frame, F[i], that the diffs, F[i+1...] have any meaning. To restore the system as it was at backup point j, where j > i:

Restore F[i] as if it was the complete filesystem.
Restore F[i+1...j] as patches to this restored filesystem from F[i].

So, to do a restore, do roughly the reverse of creating a backup point,

Download the encrypted, compressed archive that represents the last key frame.
Decrypt and uncompress the archive into some temporary directory.
For each of the backups since the last key frame
1. Download the encrypted, compressed archive that represents the backup.
2. Decrypt and uncompress the archive into some other temporary directory.
3. Patch this other temporary directory over the key frame's directory tree.
  1. Apply file-level patches to particularly large files that constantly change (e.g. server logs, database dumps).
  2. Copying files to their respective positions.
  3. Delete files and directories that no longer exist.
  4. Reload dumped data into applications; e.g. databases will have to reload from a SQL dump file.
Swap the current filesystem for this temporary directory.

Careful! If the restore fails part way through (e.g. power outage), you want to ensure that the system will be able to boot up and resume or retry. You'll probably want to do this in controlled segments, possibly backing up key system files and directories just before you begin to overwrite them with the restored data.

Once in a long while, it's a good idea to do a full backup again to create a new key frame. While theoretically you can restore the first backup and then apply every single backup since that point as a patch to get any subsequent backup point, that is a bad idea for two reasons. First, it will take a very long time. Second, the backups may become corrupted. Common corruptions include missing files for some reason and that most diff algorithms will not properly understand radical changes to the structure of the file system (e.g. moving files from one directory to another). It's always good to be somewhat redundant.

You may want to delete all the backups prior to the last full key frame, although again for redundancy this is rather unwise. Deleting everything more than two or three key frames is better. If you do a key frame once a quarter, delete everything over a year old, as that data is usually useless.

Some systems, such as Subversion, store "reverse differences". When it's time for another incremental update, the full version of the current system is created on the backup server, then the backup server replaces the previous version with a much smaller "reverse difference" file. Normal difference files tell the restore program how to build the next version from some previous version. Reverse difference files tell the restore program how to build that previous version from the following version.

6. Automatically test restoring

A critical part of any resilient backup system is testing whether or not it works. Don't forget to test restore occasionally. An easy way is to create a directory with some random data in it that is backed up along with everything else. To test the restore, rename the directory, restore it from backup, and compare the two versions. Do this every time you make a backup. This strategy does not necessarily work for databases, however, so you may also want to make a mock database and do the same sort of thing.

The rsync "dry-run" option is optimized for rapidly comparing 2 directories over slow network connections.

7. Backup and restore on Linux

7.1. Basics: Creating a backup

To create a backup point,

Set a cron job to call the backup script
Pick the file tree
Diff the file tree from the last backup
Archive the changes
g/bzip compress them
Encrypt the compressed archive
Copy the compressed archive to the remote backup site

The hardest part of all of this is creating a reasonable diff. The cheapeast method is to simply backup everything, whether or not it has changed since the last backup. This is really easy, and more likely to be correct than any diff algorithm. Diffs tend to be very fragile as they can be confused. For instance, if you use the second cheapest diff strategy of backing up only those files whose last modification timestamp is greater than the timestamp of the last backup, the diff can miss a file if a file's last modification timestamp was forcible set to something wrong. (If a file was simply touched but had no bitwise difference, a redundant copy is made, but no data is lost). Further, the last-modified diff will totally miss when you delete a file.

: NOTE: A simple filesystem backup will not suffice to safely archive most RDBMS databases, unless the database server is shut down first (less than ideal). Use the database backup/archive system, such as pg_dump for PostgresQL?, just prior to the filesystem backup.

Additionally, some files that change frequently are really huge. These files are typically log files or database dumps which are often very amenable to file-wise diffs from one version to the next. Many people believe that this requires keeping the old version around (like Subversion does), which doubles the storage cost. But keeping the old version around is unnecessary when using a program based on the rsync algorithm. Using file-wise diffs reduces bandwidth to transfer to the off-site backup site and reduces storage once there. If you're tempted to consider doing something special for these files, test to make sure the compressed versions of these files are really still relatively large. Database dumps compress very well--often to 5% of their original size. Alternatively, try an application-specific backup tool which is takes into account the structure of the data.

Alternatively, for append only log files, they can be emptied after a successful backup. This removes the need for diffs. Reconstruction would be to cat several log backups together, if needed. Similar strategies can be employed for time-ordered data sets like wiki changes and blog entries.

7.2. Backing up a Linux system

http://www.bluehaze.com.au/unix/cdbkup.html -- with backup and cpio
http://www.die.net/doc/linux/man/man8/dump.8.html -- dump utility for ext2/3
tar ?
cpio ?
rsync
http://www.nongnu.org/rdiff-backup/
http://www.cis.upenn.edu/~bcpierce/unison/ Unison -- a pretty graphical front-end to rsync for Unix and Windows.

7.3. S3

Amazon's SimpleStorageService? (S3) is very cheap for less than ~50GB of storage. Jeremy Zawodny's [analysis] suggests it is cheaper than building your own backup server. Moreover, it's fun to play with and Amazon is unlikely to go out of business soon. In fact, it's more likley to put the other contenders out of business.

Useful anecdotes

Tools

http://s3sync.net/ -- s3sync.rb -- rsync clone. Loads the entire blob in RAM before PUTting, which is kind of sucky, but otherwise is it very good.
http://js3tream.sourceforge.net -- stdio stream to s3 (java)
https://jets3t.dev.java.net/
http://www.sync2s3.com/
http://www.nongnu.org/duplicity/ -- very nice rsync based tool with GPG encryption, file bitwise diffs, and directory versioning. beta. Requires some Python BitBucket? thing.
http://s3backup.blogspot.com/
http://www.s3bk.com

7.4. Duplicity + S3

The best available option at the current time (2007) is using [Duplicity] to backup to Amazon's S3.

Duplicity script:

#!/bin/sh
export S3KEY=[key]
export S3SECRET=[secret]
/usr/local/share/duplicity/duplicity-bin --include-filelist="backuplist.txt" / s3+http://my-bucket

BackupAndRestore