Twitter storage on S3
Following Twitter's well publicized scalability problems, I've been wondering if it would be possible to use Amazon's S3 service as a data store for tweets and EC2 to handle tweet distribution. Assuming a rate of 10k tweets/second, it would pretty tricky to do cost effectively, especially given Amazon's new fixed cost / transaction, which makes storing a high volume of small objects extremely expensive.
S3 presents a couple problems as a general purpose structured data store. There is no way to update objects. This limitation leads to a potential solution with a lot of small objects, for instance one object per tweet. Unfortunately PUT operations are expensive at $0.01/1000 PUTs. With a sustained rate of 10k tweets per second, the transaction cost just to write the data into the S3 storage would be around $360 / hour or $3million / year.
So obviously multiple tweets would have to be blocked into a larger PUT. This has some merit because S3 does allow random access using HTTP ranges on GETs. But it immediately leads to two new problems. How are the tweets indexed (ie how do you find all the tweets by one user), and how is data queried before it is flushed to S3?
Because Twitter doesn't have a high data integrity requirement, unlike say banking transaction servers, it might be possible to manage indexes on EC2 which are periodically flushed to S3, but this very quickly turns into a complicated solution.
S3's fixed transaction costs are unfortunate because it changes the economics for write intensive applications, and relegates the service to a large static object cache. (Update: I removed a comparison with GFS here.)