Twitter on S3: S3 objects as lists

S3 as a data store was still on my mind as I rolled out of the fog and into Marin on my bike ride today. Let me take the thought experiment a bit further. Let's assume tweets could be reliably blocked and written to S3 objects. Just to put a number on it, let's say 1000 tweets are stored in one S3 object.

If every tweet in one object is by a different user, how does the application iterate over one user's tweets? There needs to be an index that points to the location of the user's next tweet. One option would be to store the previous tweet's location (object name + offset) with the current tweet. But like a linked list, the location of the head of the list would need to be stored and updated with every tweet. As I mentioned, it would be cost prohibitive to store this in an S3 object (unless again this data could be effectively cached and flushed periodically).

Also with such a strategy, the application would need to do a significant amount of caching to achieve reasonable performance because iterating over a list would require a round trip to S3 for each tweet which could not be pipelined.

The other problem with this strategy is that deletes are expensive. Because S3 objects can not be updated, changing the list pointers to bypass the deleted record would require reading the entire object, updating the pointer, and writing it out the object to S3. Plus it would require some sort of locking strategy in the case that the object was being updated by two users simultaneously. That gets pretty ugly quickly.

But deletes are kind of an odd ball operation with Twitter. Once you tweet something you can delete it from your history, but you've already broadcasted it to the world.

Well that line thought raises more questions than answers.