Since you are still reading, then I will assume that you too enjoy being an armchair architect. Since my day job is as an architect at eBay, I tell myself that exercises such as this make me better at my job. Heh heh. Let's start with Facebook.
For several months now, I've noticed an interesting phenomenon at Facebook. My news feed would often have big gaps in it. I have about 200 friends on Facebook, and I'd say that around 70% of these friends are active, and probably 20-25% are very active Facebook users. So at any time I could look at my feed, and there would be dozens of posts per hour. However, if I scrolled back around 3-4 hours, I would usually find a gap of say 4-6 hours of no posts. The first time I ever noticed this, it was in the morning. So I thought that this gap must have been normal -- people were asleep. Indeed, most of my friends are in the United States. However, I started noticing this more and more often, and not always in the morning. It could be the middle of the day or late at night, and I would still see the same thing: big gaps. So what was going on?
Well here's where the "I don't know what I'm talking about" becomes important. Facebook has been very happy to talk about their architecture, so that has given me speculation ammo. It is well known that Facebook has probably the biggest memcached installation in the world, with many terabytes of RAM dedicated to caching. Facebook has written about how they have even used memcached as a way to synchronize databases. It sure sounds a lot like memcached has evolved into something of a write-through cache. When you post something to Facebook, the web application that you interact with only sends your post to the cache.
Now obviously reads are coming from cache, that's usually the primary use case for memcached. Now I don't know if the web app can read from either memcached and a data store (either a MySQL DB, or maybe Cassandra?) or if Facebook has gone for transparency here too, and augmented memcached to have read-through cache semantics as well. Here's where I am going to speculate wildly. If you sent all your writes to a cache, would you ever try to read from anything other than the cache? I mean, it would be nice to only be aware of the cache -- both from a code complexity perspective and from a performance perspective as well. It sure seems like this is the route that Facebook has taken. The problem is that not all of your data can fit in cache, even when your cache is multiple terabytes in size. Even if your cache was highly normalized data (which would be an interesting setup, to say the least) a huge site like Facebook is not going to squeeze all of their data into RAM. So if your "system of record" is something that cannot fit all of your data... inevitably some data will be effectively "lost." News feed gaps anyone?
Maybe this would just be another useless musing -- an oddity that I noticed that maybe few other people would notice, along with a harebrained explanation. However, just this week Facebook got a lot of attention for their latest "redesign" of their home application. Now we have the News Feed vs. the Live Feed. The News Feed is supposed to be the most relevant posts, i.e. incomplete by design. Now again, if your app can only access cache, and you can't store all of your data in cache, what do you do? Try to put the most "relevant" data in cache, i.e. pick the best data to keep in there. Hence the new News Feed. The fact that a lot of users have complained about this isn't that big of a deal. When you have a very popular application, any changes you make are going to upset a lot of people. However, you have to wonder if this time they are making a change not because they think it improves their product and will benefit users overall, but if instead it is a consequence of technology decisions. Insert cart before horse reference here...
Facebook has a great (and well deserved) reputation in the technology world. I'm probably nuts for calling them out. A much easier target for criticism is Twitter. I was lucky enough to be part of their beta for lists. Now lists are a great idea, in my opinion. Lots of people have written about this. However, the implementation has been lacking to say the least. Here is a very typical attempt to use this feature, as seen through the eyes of Firebug:
However, this is much more disturbing to me than what's going on at Facebook. I don't get how you can release a feature, even in beta, that is this buggy. Since its release, Twitter has reported a jump in errors. I will speculate and say that this is related to lists. It would not be surprising for a feature having this many errors to spill over and affect other features. If your app server is taking 6-10 seconds to send back (error) responses, then your app server is going to be able to handle a lot less requests overall. So not only is this feature buggy, but maybe it is making the whole site buggier.
Now, I know what we (eBay) would do if this was happening: We'd wire-off the feature, i.e. disable it until we had fixed what was going wrong. Twitter on the other hand...