Sunday, October 25, 2009

Social Technology Fail

This is the kind of posting that needs a disclaimer. I'm going to talk a little about recent changes at Facebook and Twitter, but strictly from a technology perspective. It goes without saying that I have no idea what I'm talking about. I am fortunate enough to be acquaintances with several engineers at both companies, and I have a college classmate (and fellow Pageboy) who seems to be a pretty important dude at Facebook, but I have no extra knowledge of these companies' technology than anybody else. So just to repeat: I have no idea what I'm talking about. You should really stop reading.

Since you are still reading, then I will assume that you too enjoy being an armchair architect. Since my day job is as an architect at eBay, I tell myself that exercises such as this make me better at my job. Heh heh. Let's start with Facebook.

For several months now, I've noticed an interesting phenomenon at Facebook. My news feed would often have big gaps in it. I have about 200 friends on Facebook, and I'd say that around 70% of these friends are active, and probably 20-25% are very active Facebook users. So at any time I could look at my feed, and there would be dozens of posts per hour. However, if I scrolled back around 3-4 hours, I would usually find a gap of say 4-6 hours of no posts. The first time I ever noticed this, it was in the morning. So I thought that this gap must have been normal -- people were asleep. Indeed, most of my friends are in the United States. However, I started noticing this more and more often, and not always in the morning. It could be the middle of the day or late at night, and I would still see the same thing: big gaps. So what was going on?

Well here's where the "I don't know what I'm talking about" becomes important. Facebook has been very happy to talk about their architecture, so that has given me speculation ammo. It is well known that Facebook has probably the biggest memcached installation in the world, with many terabytes of RAM dedicated to caching. Facebook has written about how they have even used memcached as a way to synchronize databases. It sure sounds a lot like memcached has evolved into something of a write-through cache. When you post something to Facebook, the web application that you interact with only sends your post to the cache.

Now obviously reads are coming from cache, that's usually the primary use case for memcached. Now I don't know if the web app can read from either memcached and a data store (either a MySQL DB, or maybe Cassandra?) or if Facebook has gone for transparency here too, and augmented memcached to have read-through cache semantics as well. Here's where I am going to speculate wildly. If you sent all your writes to a cache, would you ever try to read from anything other than the cache? I mean, it would be nice to only be aware of the cache -- both from a code complexity perspective and from a performance perspective as well. It sure seems like this is the route that Facebook has taken. The problem is that not all of your data can fit in cache, even when your cache is multiple terabytes in size. Even if your cache was highly normalized data (which would be an interesting setup, to say the least) a huge site like Facebook is not going to squeeze all of their data into RAM. So if your "system of record" is something that cannot fit all of your data... inevitably some data will be effectively "lost." News feed gaps anyone?

Maybe this would just be another useless musing -- an oddity that I noticed that maybe few other people would notice, along with a harebrained explanation. However, just this week Facebook got a lot of attention for their latest "redesign" of their home application. Now we have the News Feed vs. the Live Feed. The News Feed is supposed to be the most relevant posts, i.e. incomplete by design. Now again, if your app can only access cache, and you can't store all of your data in cache, what do you do? Try to put the most "relevant" data in cache, i.e. pick the best data to keep in there. Hence the new News Feed. The fact that a lot of users have complained about this isn't that big of a deal. When you have a very popular application, any changes you make are going to upset a lot of people. However, you have to wonder if this time they are making a change not because they think it improves their product and will benefit users overall, but if instead it is a consequence of technology decisions. Insert cart before horse reference here...

Facebook has a great (and well deserved) reputation in the technology world. I'm probably nuts for calling them out. A much easier target for criticism is Twitter. I was lucky enough to be part of their beta for lists. Now lists are a great idea, in my opinion. Lots of people have written about this. However, the implementation has been lacking to say the least. Here is a very typical attempt to use this feature, as seen through the eyes of Firebug:

It took my five attempts to add a user to a list. Like I said, this has been very typical in my experience. I've probably added 100+ users to lists, so I've got the data points to back up my statement. What the hell is going on? Let's look at one of these errors:

Ah, a 503 Service Unavailable response... So it's a temporary problem. In fact look at the response body:

I love the HTML tab in Firebug... So this is the classic fail whale response. However, I'm only getting this on list requests. Well, at the very least I'm only consistently getting this on list requests. If the main Twitter site was giving users the fail whale at an 80% clip... In this case, I can't say exactly what is going. I could try to make something up (experiments with non-relational database?)
However, this is much more disturbing to me than what's going on at Facebook. I don't get how you can release a feature, even in beta, that is this buggy. Since its release, Twitter has reported a jump in errors. I will speculate and say that this is related to lists. It would not be surprising for a feature having this many errors to spill over and affect other features. If your app server is taking 6-10 seconds to send back (error) responses, then your app server is going to be able to handle a lot less requests overall. So not only is this feature buggy, but maybe it is making the whole site buggier.
Now, I know what we (eBay) would do if this was happening: We'd wire-off the feature, i.e. disable it until we had fixed what was going wrong. Twitter on the other hand...

Huh? You've got a very buggy feature, so you're going to roll it out to more users? This just boggles my mind. I cannot come up with a rationale for something like this. I guess we can assume that Twitter has the problem figured out -- they just haven't been able to release the fix for whatever reason. Even if that was the case, shouldn't you roll out the fix and make sure that it works and nothing else pops up before increasing usage? Like I said, I just can't figure this one out...


Dave Briccetti said...

I instrumented TalkingPuffin with JMX so I could gather some statistics on how many Twitter requests worked on the first try, after an immediate retry, after an additional 250 ms wait and retry, 1 sec, and so on. When I got it working I tried it on adding 75 users to a list. Twitter worked perfectly! Bad timing for me, since I wanted to test the new code. But the next day things were back to normal and I started to see a more interesting distribution among the buckets. Hopefully Twitter know what’s wrong with lists and will fix it soon.

Dave Briccetti said...

“However, you have to wonder if this time they are making a change not because they think it improves their product and will benefit users overall, but if instead it is a consequence of technology decisions.”

Remember the Twitter #fixreplies? (Twitter started hiding from us tweets from people we follow, that start with @somebody-else.) Twitter announced the change, and later told us they they were compelled to do so for technical reasons: they simply couldn’t cast all the data so wide.