Tag: file storage
2010
12.02

What is it?

A few weeks ago, I went to a hackathon (hackabit) hosted by bit.ly. Originally, I planned to spend my time working on some personal projects: however, once I got to bit.ly’s offices, I decided to embrace the spirit of the event and work on a project related to bit.ly’s API.

After some consultation with my friend Paul Kernfeld, I began working on bit.ly File Storage (available on github). The idea was fairly simple: files get broken up into chunks, the chunks are stored on bit.ly, and the chunks can be reassembled from a single bit.ly URL.

How does it work?

To start with, I determined that bit.ly could store 2048 total characters in a URL. I also realized that bit.ly would store arbitrary data if it was put in the query string of a URL. So, the first thing I decided was to prefix all my content with ftp://?, to minimize the space taken up by unnecessary data (notice that I didn’t need a hostname: not sure why that was considered optional).

I also ended up base64-encoding the chunks of data, since I wasn’t sure how binary data would interact with the bit.ly API and I didn’t want to find out. I chose to take chunks 1024 bytes long: despite the size increase caused by base64-encoding I probably could have taken larger chunks of data, but I wasn’t particularly worried about the efficiency of the implementation.

Finally, once the chunks were encoded (and turned into bit.ly short URLs), I needed a way to turn them into a single URL. I decided on a binary tree structure because it was simple to implement and provided an easy way for me to expand the data on the other end (via in-order traversal).

I grouped the URLs into pairs: the first two were a pair, the second two were a pair, etc. I then asked bit.ly to generate a short URL for a long URL that looked something like http://?base64_encode(first-url|second-url) (the base64-encoding was not strictly necessary). If there were an odd number of URLs, the last URL on the list would simply be added on to the list of newly generated URLs. I repeated that process on each list of newly generated URLs, building each level of the tree until finally there was only one element: the root.

From the root, it was possible to retrieve the data via the following process:

  1. Call the bit.ly API to get the expanded version of the current node/URL.
  2. If the URL starts with ftp://?, we’ve reached a piece of the file (a leaf node). base64 decode it and echo it out.
  3. If the URL starts with http://?, there’s still more to do. base64 decode the content, split on the pipe character, and then visit the left (and then the right) URL (go to step 1).

I packaged the code as two CLI PHP applications, using an existing bit.ly API wrapper. The end result, excluding the API wrapper, contained fewer than 100 lines of code. I submitted it to the contest email address and headed home.

What happened?

This morning, I got an email telling me I had tied for second place in the contest, which blew me away. I went to the bit.ly blog and I read the following (you can read the announcement here):

* bit.ly File Storage (http://bit.ly/hx2kAc) - A hack for storing data with bit.ly.

We were especially impressed by the use of a binary tree for data retrieval with a single bit.ly URL! While this is not necessarily a use of bit.ly that we would encourage, credit is due for cleverness and chutzpah.

Cleverness and chutzpah? I’ll take it :-)

I also found out that there’s an existing project, furl, which does similar things using other URL shorteners: http://code.google.com/p/furl/.