Home > Development > Large file uploads over HTTP - the final solution (I think)

Large file uploads over HTTP - the final solution (I think)

August 26th, 2008 Leave a comment Go to comments

Problem statement: HTTP sucks for file uploads.

You know it. I know it. The problems?

  • No resuming
  • POST multipart/form-data bloats the size of the file due to encoding
  • Slow connections typically time out on large files
  • Any server resets or any other network "burps" on the path from client to server effectively kills the upload
  • People have had moderate success by tuning their webserver and PHP to accept large POSTs, and in general, it works - but not for everyone and it suffers from everything previously noted.

What would the ideal file upload experience support?

  • It should resume after manual pause, a server reset, a timeout, or any other error.
  • It should allow for multiple files being uploaded at once.
  • It should work transparently over HTTP - which means proxies will support it like any normal web request, it can be done over HTTPS (SSL), it will reuse cookies and standard HTTP authentication methods.
  • (Ideally!) the browser would handle this itself without requiring Java, Flash, or any other applets.

With all this in mind, I somehow stumbled across the idea (roughly posted here) based on the time-tested learnings from Usenet and NZB files, and BitTorrent. The main idea? Splitting the file up into manageable segments. There's also some other logic too, but that's the main idea.

Why do I claim this is the final solution?

  • It can reuse the same HTTP/HTTPS connection, so proxies and HTTP authentication can be honored.
  • It doesn't care what speed of your connection - due to the small size of the files, it's easier to get them to the server and each piece can be confirmed one step at a time. No more having to start from the beginning due to a failure or timeout.
  • It will support multiple files at once. The server could (although we might not implement it this way) support multi-threaded uploading of the same file, too, just like BitTorrent or Usenet downloading - upload multiple pieces at the same time and assemble them in the end. We're trying to make a decision whether or not we want to do that right now. The fundamental difference is an implementation detail on the server end.
  • It allows for any client that can split a file up, hash it, encode it and upload it via POST
  • It will still require an applet, since browsers have no support for anything but standard file upload semantics (Although this would be a neat thing to get into a specification)

What's required, how does it work?

As of right now, this is what I have down (it has changed already since the PHP post):

  1. The client contacts the server to begin the transaction. It supplies the following information:
    • Action = "begin"
    • Final filesize
    • Final filename
    • Final file hash (SHA256 or MD5, still haven't determined which one)
    • A list of all the segments - their segment id, byte size, hash (again SHA256 or MD5) - XML or JSON or something
  2. Server sends back "server ready, here's your $transaction_id"
  3. Client starts sending the file, one segment at a time, with the following information:
    • Action = "process"
    • Transaction ID = $transaction_id
    • Segment ID = $segment_id
    • Content = base64 or uuencoded segment (for safe transit)
  4. Server replies back "segment received, transaction id $transaction_id, segment id $segment_id, checksum $checksum"
  5. Client compares the checksum for $segment_id, if it matches, move on to the next segment. If not, retransmit.
  6. When the client is done sending all the segments, client sends message to the server:
    • Action = "finish"
    • Transaction ID = $transaction_id
  7. Server assembles all the segments (if they're separate) and sends to the client:
    • Transaction ID = $transaction_id
    • Checksum = $checksum of the final file
  8. Client compares the checksum to it's own checksum. If it matches, client sends message to server:
    • Action = "complete"
    • Transaction ID = $transaction_id

Viola, done. I think the "protocol" transmits some extra information that isn't needed; so some of this might need to be cleaned up. This is the initial idea though. Props to Newzbin for inventing NZB files which was a big influence in this concept.

I'm somewhat rushing this post out, hopefully it solicits some feedback. I'm going to be revising this and working with a Java developer to work on a client written in Java. Hopefully someday we'll get one with less overhead. I'll post PHP code as I write it too to handle the server portion of it.

Categories: Development