Home > Software > The quest for the end-all backup solution

The quest for the end-all backup solution

December 28th, 2006 Leave a comment Go to comments

So I've been looking around for the perfect solution for backups - both my home machines and my server farm (including my customers' data) - I think I've figured out the final destination for the data - Amazon's S3 service. However, how to get the data there has been the question.

I have a few basic requirements:

  • Encrypted: the content must be encrypted - in transit and at rest. However, if it is encrypted before it is transferred, I think encryption during transit would be optional.
  • Nondescript: the third party storage provider should have no knowledge of the file contents NOR the file names; the files should be generic, gibberish or otherwise indistinguishable.
  • Compressed: to save bandwidth costs and storage space, the content should be compressed. bzip2 seems to be the winner although a little more CPU intensive. If all else fails, we'll go with gzip.
  • Incremental: in addition to compressing the data to save storage space, doing a incremental/differential backup will transfer and store only the files that have changed (and possibly only the content inside of those files, too)
  • Versioned: backups are only as good as their source data. If something goes corrupt, or is altered incorrectly, it will overwrite the "good" copy from the last backup. By versioning the files, you can go back in history at any point.
  • Cross-platform: I want to run this on my Linux servers as well as in Windows - running as a command-line (crontab in Linux, Scheduled Tasks in Windows - or even better as a service) - note if anything elegant was produced from this venture, I'd share the results so it could be reused anywhere (including you folks on OS X)
  • Cheap: I'm sure there are some very expensive enterprise solutions, including using services like rsync.net (more expensive than S3, but allows for using rsync, for instance) and many other options, but for this I need something cost-effective.

The programs/scripts/utilities/whathaveyou I've looked at:

  • DAR - supports encryption, incremental archiving
  • rsync - supports transfer encryption, incremental transfers
  • rdiff-backup - supports incremental archiving, versioning
  • Subversion - supports incremental archiving, versioning
  • backup-manager
  • duplicity - seems to solve nearly everything (?)

Services:

... edited later ...

I've been messing around with this too long now, and it's time to look at the two best options.

So far the best choice seems to be Duplicity:

  • Encrypted: yes, all archives are encrypted (and signed) using GPG prior to transfer.
  • Non-descript: yes, it packages up the files prior to sending to the third party:
    • The remote location will not be able to infer much about the backups other than their size and when they are uploaded.
  • Compressed: yes, by default gzip level 6; easy enough to change it to 9 though.
  • Differential: yes, on the file and intra-file level - uses the rsync algorithm.
  • Incremental: yes, even makes restoring easy too:
    • Restoring traditional incremental backups can be painful if one has to apply each incremental backup by hand. Duplicity automatically applies the incrementals to the full backup without any extra intervention.
  • Versioned: it appears so.
  • Cross-platform: not quite; requires a POSIX-like system, which means running inside of a full Cygwin environment. If this meets my needs, will look into trying to make it a standalone executable (using py2exe perhaps?) and the Cygwin DLL's; otherwise I may go to rent-a-coder and pay someone to port it.
  • Cheap: the software is FOSS - so most definately yes.

The other option is a somewhat DIY solution (which on hindsight resembles this one):

  1. First, use DAR to create the archive - it supports differential, incremental, versioning and compression.
  2. Next, use GPG to encrypt the DAR archive files.
  3. Finally, send it to S3 - this is the open option still. I almost have a very simple, clean, procedural PHP script for S3 interaction; however it is not that efficient (from what I can tell, there is no way to support streaming of the upload at the time being, which means the file content is read in to memory completely by the PHP script - unless the files are small enough, that might not be an option at all) - there's also some other options to send the files though, and I will visit those if needed.

Another idea I had was to use Subversion as a transparent change control layer, since it handles versioning properly, and do an svndump, compress and encrypt that, and send that to S3. That would require an extra copy of all the files to be stored in change control though, as well as (it appears) a local SVN server running on the machine.

UPDATE: Things look good. I think this is what I've been looking for (except for the Windows support) - however, I do have a couple suggestions for the duplicity team in the meantime:

  • Add support for bzip2 (can be through GPG; I couldn't figure it out myself)
  • Add support for altering the compression level using a command line option (for either compressor)
  • Add support for defining max file size (5MB is the default) - it's easy enough to edit, should be easy enough to make it a command line parameter (again, I'd do this, but I'm not a Python guru nor want to fuss with implementing it incorrectly)
  • Make it run properly in Windows (I don't care how!)

The only question left unanswered, which may be a question with any backup tool is how it deals with corrupt files. Will a corrupt file be transferred, and the corruption will overwrite the corrupt previous version? Can you still access the previous version without the corruption affecting it? I assume if a file is corrupt, its checksum will have changed, so it will treat it like a new version, and allow us to revert back to the pre-corrupted file (assuming it's easy to view the history/when it became corrupt.)

Categories: Software
  1. Sergey
    December 29th, 2006 at 04:04 | #1

    Well, S3 Backup pretty much covers all that. http://s3bk.com/

  2. mike
    December 29th, 2006 at 09:24 | #2

    Actually, I use S3 Backup on my Windows machine right now for some temporary offsite storage, since I've had some failures as of late (not to mention all my desktops are near failure)

    However, it does not have a CLI mode (afaik) and Linux support is only "planned", so it is not good for my servers, and I want something that could also encrypt the filenames too. What good is privacy if it leaves wide open the title of the documents? Basically encrypting the object's "key" - using what I believe should be a third key (there is the access key, secret access key, but Amazon knows both of those - so a third one specific to just your own personal setup that nobody will know is what I'm thinking)

    Since it is written in Python it could be easy to replicate into CLI mode, however it uses SmartHash (which appears to be Windows-only, and very little support/documentation) that would need to be ported or changed, and I do not know Python nor wish to tinker with it that much.

    I wonder if I can pay the creator to make some of the modifications that I mentioned; it could be just like rsync... however, it still does not take care of my "versioning" requirement; which isn't the most important one, but it is something that I think would be wise to have.

    p.s. I just put this up last night... you're fast 🙂

  3. leigh
    June 18th, 2007 at 18:52 | #3

    how does duplicity do versioning? i don't see any way to restore a specific stage of the backup...

  4. mike
    June 19th, 2007 at 21:55 | #4

    From the man page, some combination of these might work:

    --remove-older-than time
    Delete all backup sets older than the given time. If old backup sets will not be deleted if backup sets newer than time depend on them. See the TIME FORMATS section for more information.

    -ttime, --restore-time time
    When restoring, specify the time to restore to.

    I don't know if you can also use the include/exclude/regexp options either. I haven't used it much, only a couple test runs...

  5. ralph
    August 7th, 2008 at 19:22 | #5

    duplicity and s3 backup does not do versioning. super flexible file sync is better but it lacks in encrypting file names when on amazon. Jungle disk does but the restore feature is super clunky. I am still looking.

  6. mike
    October 15th, 2008 at 01:11 | #6

    I've been using Mozy and been happy so far, at least for my Windows machines. I also added my parents under my account as well. Better safe than sorry. Also thought of adding a second service "just in case" ...

  1. No trackbacks yet.
You must be logged in to post a comment.