The quest for the end-all backup solution
So I've been looking around for the perfect solution for backups - both my home machines and my server farm (including my customers' data) - I think I've figured out the final destination for the data - Amazon's S3 service. However, how to get the data there has been the question.
I have a few basic requirements:
- Encrypted: the content must be encrypted - in transit and at rest. However, if it is encrypted before it is transferred, I think encryption during transit would be optional.
- Nondescript: the third party storage provider should have no knowledge of the file contents NOR the file names; the files should be generic, gibberish or otherwise indistinguishable.
- Compressed: to save bandwidth costs and storage space, the content should be compressed. bzip2 seems to be the winner although a little more CPU intensive. If all else fails, we'll go with gzip.
- Incremental: in addition to compressing the data to save storage space, doing a incremental/differential backup will transfer and store only the files that have changed (and possibly only the content inside of those files, too)
- Versioned: backups are only as good as their source data. If something goes corrupt, or is altered incorrectly, it will overwrite the "good" copy from the last backup. By versioning the files, you can go back in history at any point.
- Cross-platform: I want to run this on my Linux servers as well as in Windows - running as a command-line (crontab in Linux, Scheduled Tasks in Windows - or even better as a service) - note if anything elegant was produced from this venture, I'd share the results so it could be reused anywhere (including you folks on OS X)
- Cheap: I'm sure there are some very expensive enterprise solutions, including using services like rsync.net (more expensive than S3, but allows for using rsync, for instance) and many other options, but for this I need something cost-effective.
The programs/scripts/utilities/whathaveyou I've looked at:
- DAR - supports encryption, incremental archiving
- rsync - supports transfer encryption, incremental transfers
- rdiff-backup - supports incremental archiving, versioning
- Subversion - supports incremental archiving, versioning
- backup-manager
- duplicity - seems to solve nearly everything (?)
Services:
... edited later ...
I've been messing around with this too long now, and it's time to look at the two best options.
So far the best choice seems to be Duplicity:
- Encrypted: yes, all archives are encrypted (and signed) using GPG prior to transfer.
- Non-descript: yes, it packages up the files prior to sending to the third party:
- The remote location will not be able to infer much about the backups other than their size and when they are uploaded.
- Compressed: yes, by default gzip level 6; easy enough to change it to 9 though.
- Differential: yes, on the file and intra-file level - uses the rsync algorithm.
- Incremental: yes, even makes restoring easy too:
- Restoring traditional incremental backups can be painful if one has to apply each incremental backup by hand. Duplicity automatically applies the incrementals to the full backup without any extra intervention.
- Versioned: it appears so.
- Cross-platform: not quite; requires a POSIX-like system, which means running inside of a full Cygwin environment. If this meets my needs, will look into trying to make it a standalone executable (using py2exe perhaps?) and the Cygwin DLL's; otherwise I may go to rent-a-coder and pay someone to port it.
- Cheap: the software is FOSS - so most definately yes.
The other option is a somewhat DIY solution (which on hindsight resembles this one):
- First, use DAR to create the archive - it supports differential, incremental, versioning and compression.
- Next, use GPG to encrypt the DAR archive files.
- Finally, send it to S3 - this is the open option still. I almost have a very simple, clean, procedural PHP script for S3 interaction; however it is not that efficient (from what I can tell, there is no way to support streaming of the upload at the time being, which means the file content is read in to memory completely by the PHP script - unless the files are small enough, that might not be an option at all) - there's also some other options to send the files though, and I will visit those if needed.
Another idea I had was to use Subversion as a transparent change control layer, since it handles versioning properly, and do an svndump, compress and encrypt that, and send that to S3. That would require an extra copy of all the files to be stored in change control though, as well as (it appears) a local SVN server running on the machine.
UPDATE: Things look good. I think this is what I've been looking for (except for the Windows support) - however, I do have a couple suggestions for the duplicity team in the meantime:
- Add support for bzip2 (can be through GPG; I couldn't figure it out myself)
- Add support for altering the compression level using a command line option (for either compressor)
- Add support for defining max file size (5MB is the default) - it's easy enough to edit, should be easy enough to make it a command line parameter (again, I'd do this, but I'm not a Python guru nor want to fuss with implementing it incorrectly)
- Make it run properly in Windows (I don't care how!)
The only question left unanswered, which may be a question with any backup tool is how it deals with corrupt files. Will a corrupt file be transferred, and the corruption will overwrite the corrupt previous version? Can you still access the previous version without the corruption affecting it? I assume if a file is corrupt, its checksum will have changed, so it will treat it like a new version, and allow us to revert back to the pre-corrupted file (assuming it's easy to view the history/when it became corrupt.)