Showing posts with label architecture. Show all posts
Showing posts with label architecture. Show all posts

Wednesday, January 04, 2012

Mobile Application Architecture, Part 3

[Note: So it's taken me a long time to get this third part out. I changed jobs in the middle of writing all of this, and there's a lot to learn at the new gig. Also in case you are wondering, this is heavily based on my experiences working on eBay Mobile for Android and Bump for Android. Both apps use an event driven architecture in places, or at least they did while I was working on them. What follows is not the secret sauce from either application, but some good things that were common to both along with some ideas for improving on the patterns.]

This is the third post in a series on the architecture of mobile applications. In the first post we looked at the classical architecture of mobile apps that evolved from web apps. In the second post we looked at event driven architecture and when it should be used. Now we're going to dive a little deeper into event driven architectures and in particular talk about some of the significant design decisions that must be made and the pitfalls that need to be avoided.

So you say you want an event driven architecture? I'll assume you've done your homework and aren't just doing this to because it seems like the cool thing to do. I'll assume that it really fits your application. One of the first things that you will realize as you get started is that event driven architecture does not exempt you from many important aspects. Security is a great example. You must figure out how you will authenticate with your server and how to secure any sensitive information that you send over the wire. Just because you are no longer in the world of HTTP and HTTPS does not mean that you get to send over people's credit card numbers in plain text. It's not any harder to sniff your bits and chances are that you are just as vulnerable to things like man-in-the-middle attacks. So go read up on things like TCP headers and TLS.

Once the basics are in place, you can think more about the architecture of your application. Your servers are going to be sending events to your application. So what is going to receive them? Do you have each UI (controller, Activity, etc.) receive them? How about something that is more long-lived. The most important thing to remember is that whatever receives events generally has to be able to deal with any kind of event. You can never assume that just because your controller only cares about some small set of events that those will be the only events that it ever receives. Thus you probably need to have an event receiver whose lifecyle extends beyond that of any given screen in your application and can handle any and all events. Now keep in mind that it will often delegate events to other components in your application, so no need to worry about this global receiver being stuffed with knowledge of your app. On Android you could use a Service for this or even a subclass of Application. You could also create some other singleton object that is tied to your application's lifecycle.

Now you might ask does this really need to be a singleton? Not necessarily. However I think that in many (most?) circumstances that a singleton will simplify things for you. What happens to those events that are not needed by the UI? Such events probably mutate some internal state of your application. If that internal state is transient, then it probably makes sense for there to be only one copy of it. If it is persistent... 

Inevitably there will be some events coming in from your server that you do care about at the UI level. These things will probably be "second-hand", i.e. the events will have been received by the longer-lived object we talked about above. Now you need to deal with how to get these events to the UI. The most accepted way to do this is to use an Observer pattern. You define an interface for an object that cares about a certain event (or certain set of related events potentially.) Then the object registers itself with the central receiver. When a new event comes in from the server, the central receiver examines it and decides on who to delegate it to.

I must add a few words of caution here. For each event you can easily wind up with a class for the event's data structure, an interface for the observer, and an implementation of the observer. The last of these things is often handled by having the Activity/Fragment/controller implement the interface, or by using an anonymous inner class to implement it. In any case, you wind up with multiple types being defined for each event. So if you have many types of events that your server will push to your application then you can end up with an explosion of code. Android developers should consider using Handlers and Messages as a leaner, more de-coupled solution. The big downside is that you may windup doing manual type checking and casting. This can be avoided by late binding, i.e. keep the data from the server raw as long as possible and do the parsing after the dispatch to the observer.

Finally I will close this long series of posts with a brief word about server side technologies. This post in particular has focussed on how to write your mobile/client applications with an event driven architecture. There is a flip-side to this and that is how to implement an event driven architecture on your server. Just like in our classical architectures, most servers are dominated by HTTP. That is not only the default interface into them, but it is at the heart of the server's architecture. A connection is a short-lived, usually stateless object. There is generally a one-to-one mapping of HTTP requests to either threads or processes. This is a poor fit for event driven architectures where a connection is extremely long lived. It only dies when the client app dies or explicitly closes it for some reason.

Fortunately there are numerous server side technologies that do not use the request-per-thread/process model. Certainly the most hyped one of these is Node.js. The Internet is chock full of benchmarks showing how Node's non-blocking out-scales Apache, Nginx, etc. However there are numerous other event driven app servers in Python, Java, Ruby, etc. Personally I have a bias towards my ex-colleague Jamie Turner's Diesel framework. The point is that there are many choices on the server and I expect that you will see many more moving forward.

Thursday, December 08, 2011

Mobile Application Architecture, Part 2

This is the second post about mobile architecture. In the first post, I talked about how the evolution from web applications has led to architecture's centered around asynchronous access to remote resources using HTTP. I also mentioned how Android applications are able to benefit from cloud-to-device messaging (C2DM) to push new data to apps. The most common use for this is for push notifications, but C2DM is not limited to push notifications. The key to pushing data via C2DM is that it uses a persistent socket connection to the C2DM servers. So whenever any new data is sent to those servers, they can send it over this connection to mobile apps. Now you might think that there is something magical or even worse, proprietary, about using sockets within a mobile application. This is not the case at all. Any application can make use of sockets, not just ones written by Google. Actually that's not true. Any native applications on iOS and Android can use sockets. Socket support was recently added to Windows Phone 7 as well. For web apps there is the Web Socket API, but it is not yet implemented across platforms.

So what are the benefits of using sockets? Well there is the data-push capability that you see in C2DM. In general you can make a much faster app by using sockets. There is a lot of wasted time and overhead in making HTTP connections to a server. This wasted time and overhead gets multiplied by the number of HTTP calls used in your application. There is also overhead in creating a socket connection to a server, but you only pay it one time. The extra speed and server push-iness allows for very interactive apps. Sounds great, but how does this affect the application architecture? Here's an updated diagram.

Event driven architecture

The big difference here is that instead of requests and corresponding responses, we have events being sent from client to server and back again. Now obviously some of these events may be logically correlated, but our communication mechanism no longer encodes this correlation. The "response" to an event from the app does not come immediately and may not be the next event from the server. Everything is asynchronous. Now in our traditional architecture we also had "asynchronous things" but it was different. Each request/response cycle could be shoved off on their own thread or in their own AsyncTask for example. This is a lot different.

Let's take a simple example. One of the first things that many applications require a user to do is to login, i.e. send a combination of name and password and get some kind of authenticated materials in response (token). In a traditional architecture you would send a login request to the server, wait for the response, and then either get the token or display an error if there was a problem with the username+password combination. In an event driven architecture, your application fires a login event that contains the username+password. There is no response per se. Instead at some point in the future the server will fire an event indicating a successful login that includes a token, or an error that may contain a reason as well. You don't know when this will happen, and you may receive other events from the server before getting a "login response" event.

Logging in is just one example. There are numerous other things that are easy to think of in terms of request/response. The whole idea of REST architectures are built around this request/response cycle and HTTP. Doing a search, getting user information, posting a comment, the list goes on and on. Now of course it is possible to recreate the illusion of a request/response cycle -- just fire your request event and block until you get the response event. But clearly that is Doing It Wrong.

So maybe these event driven architectures are not right for all applications. Yes they are sexier, but don't try to put a square peg in a round hole. This kind of architecture is best suited for apps where you are interacting with somebody else. Let's say you are bidding on an auction. Somebody else bids on the same auction and ideally you will see this happen very quickly. Or maybe you sent a message to a friend on some type of social network, and you can't wait to see their response. Social games are also a great example of the types of apps that would benefit from an event driven architecture. Even if the game doesn't have to be real-time, it would probably still benefit from "lower latency" between moves.
Hopefully you have a better idea now on if your application should use a traditional architecture or an event driven one. Of course there is a lot of room in between. You could easily have an architecture that uses both, picking the right interaction model depending on the use case within your app. Maybe most things use a classical model, but then on certain screens you switch to an event-driven one. Or vice versa. Either way you'll want to read my next post where I will talk about the nuances of implementing an event driven architecture.

Friday, December 02, 2011

Mobile Application Architecture, Part 1

The boom in mobile applications has been fueled by excellent application frameworks from Apple and Google. Cocoa Touch and Android take some very different approaches on many things, but both enable developers to create applications that end users enjoy. Of course "excellent" can be a relative term, and in this case the main comparison is web application development. For both experienced and novice developers, it takes less effort to be more productive developing a native mobile application than a web application (mobile or not.) This despite the fact that you really have to deal with a lot more complexity in a mobile app (memory, threads, etc.) than you do for web applications.

Despite having significant advantages over web development, mobile applications often have some similar qualities and this often leads to similar limitations. Most mobile applications are network based. They use the network to get information and allow you to interact with other users. Before mobile applications, this was the realm of web applications. Mobile apps have shown that there is not necessary to use a browser to host an application that is network based. That's good. However the way that mobile apps interact with servers over the network has tended to resemble the way that web applications do this. Here's a picture that shows this.
Traditional mobile app architecture
This diagram shows a traditional mobile application's architecture. In particular it shows an Android app, hence that little red box on the right saying C2DM, which we will talk more about later. Most of this is applicable to iOS as well though. Most apps get data from servers or send data to servers using HTTP, the same protocol used by web browsers. HTTP is wonderfully simple in most ways. Significantly, it is a short-lived, synchronous communication. You make a request and you wait until you get a response. You then use the response to update the state of your application, show things to the user, etc.

Now since HTTP is synchronous and network access is notoriously slow (especially for mobile apps), you must inevitably banish HTTP communication to a different thread than the one where you respond to user interactions. This simple paradigm becomes the basis for how many mobile applications are built. Android provides some nice utilities for handling this scenario like AsyncTask and Loaders. Folks from the Android team have written numerous blogs and presentations on the best way to setup an application that follows this pattern, precisely because this pattern is so prevalent in mobile applications. It is the natural first step from web application to mobile application as you can often re-use much of the backend systems that you used for web applications.

Before we go any further, take another look at the diagram. I promised to talk more about it. That is the cloud-to-device messaging system that is part of Android (or at least Google's version, it's not available on the Kindle Fire for example.) It provides a way for data to get to the application without the usual HTTP. Your server can send a message to C2DM servers, which will then "push" the message to the device and your app. This is done through a persistent TCP socket between the device and the C2DM servers. Now these messages are typically very small, so often it is up to your application to retrieve additional data -- which could very well go back to using HTTP. Background services on Android can makes this very simple to do.

Notice that in the previous paragraph we never once used the word "notification." C2DM was seen by many as Android's equivalent to Apple's Push Notifications. They can definitely fulfill the same use cases. Once your service receives the message it can then choose to raise a notification on the device for the user to respond to. But C2DM is not limited to notifications, bot by a long shot. Publishing a notification is just one of many things that your application can do with the message that it receives from C2DM. What is most interesting about C2DM is that leads us into another type of application architecture not built around HTTP. In part 2 of this post we will take a deeper look at event-driven architectures.

Sunday, October 25, 2009

Social Technology Fail

This is the kind of posting that needs a disclaimer. I'm going to talk a little about recent changes at Facebook and Twitter, but strictly from a technology perspective. It goes without saying that I have no idea what I'm talking about. I am fortunate enough to be acquaintances with several engineers at both companies, and I have a college classmate (and fellow Pageboy) who seems to be a pretty important dude at Facebook, but I have no extra knowledge of these companies' technology than anybody else. So just to repeat: I have no idea what I'm talking about. You should really stop reading.

Since you are still reading, then I will assume that you too enjoy being an armchair architect. Since my day job is as an architect at eBay, I tell myself that exercises such as this make me better at my job. Heh heh. Let's start with Facebook.

For several months now, I've noticed an interesting phenomenon at Facebook. My news feed would often have big gaps in it. I have about 200 friends on Facebook, and I'd say that around 70% of these friends are active, and probably 20-25% are very active Facebook users. So at any time I could look at my feed, and there would be dozens of posts per hour. However, if I scrolled back around 3-4 hours, I would usually find a gap of say 4-6 hours of no posts. The first time I ever noticed this, it was in the morning. So I thought that this gap must have been normal -- people were asleep. Indeed, most of my friends are in the United States. However, I started noticing this more and more often, and not always in the morning. It could be the middle of the day or late at night, and I would still see the same thing: big gaps. So what was going on?

Well here's where the "I don't know what I'm talking about" becomes important. Facebook has been very happy to talk about their architecture, so that has given me speculation ammo. It is well known that Facebook has probably the biggest memcached installation in the world, with many terabytes of RAM dedicated to caching. Facebook has written about how they have even used memcached as a way to synchronize databases. It sure sounds a lot like memcached has evolved into something of a write-through cache. When you post something to Facebook, the web application that you interact with only sends your post to the cache.

Now obviously reads are coming from cache, that's usually the primary use case for memcached. Now I don't know if the web app can read from either memcached and a data store (either a MySQL DB, or maybe Cassandra?) or if Facebook has gone for transparency here too, and augmented memcached to have read-through cache semantics as well. Here's where I am going to speculate wildly. If you sent all your writes to a cache, would you ever try to read from anything other than the cache? I mean, it would be nice to only be aware of the cache -- both from a code complexity perspective and from a performance perspective as well. It sure seems like this is the route that Facebook has taken. The problem is that not all of your data can fit in cache, even when your cache is multiple terabytes in size. Even if your cache was highly normalized data (which would be an interesting setup, to say the least) a huge site like Facebook is not going to squeeze all of their data into RAM. So if your "system of record" is something that cannot fit all of your data... inevitably some data will be effectively "lost." News feed gaps anyone?

Maybe this would just be another useless musing -- an oddity that I noticed that maybe few other people would notice, along with a harebrained explanation. However, just this week Facebook got a lot of attention for their latest "redesign" of their home application. Now we have the News Feed vs. the Live Feed. The News Feed is supposed to be the most relevant posts, i.e. incomplete by design. Now again, if your app can only access cache, and you can't store all of your data in cache, what do you do? Try to put the most "relevant" data in cache, i.e. pick the best data to keep in there. Hence the new News Feed. The fact that a lot of users have complained about this isn't that big of a deal. When you have a very popular application, any changes you make are going to upset a lot of people. However, you have to wonder if this time they are making a change not because they think it improves their product and will benefit users overall, but if instead it is a consequence of technology decisions. Insert cart before horse reference here...

Facebook has a great (and well deserved) reputation in the technology world. I'm probably nuts for calling them out. A much easier target for criticism is Twitter. I was lucky enough to be part of their beta for lists. Now lists are a great idea, in my opinion. Lots of people have written about this. However, the implementation has been lacking to say the least. Here is a very typical attempt to use this feature, as seen through the eyes of Firebug:

It took my five attempts to add a user to a list. Like I said, this has been very typical in my experience. I've probably added 100+ users to lists, so I've got the data points to back up my statement. What the hell is going on? Let's look at one of these errors:

Ah, a 503 Service Unavailable response... So it's a temporary problem. In fact look at the response body:

I love the HTML tab in Firebug... So this is the classic fail whale response. However, I'm only getting this on list requests. Well, at the very least I'm only consistently getting this on list requests. If the main Twitter site was giving users the fail whale at an 80% clip... In this case, I can't say exactly what is going. I could try to make something up (experiments with non-relational database?)
However, this is much more disturbing to me than what's going on at Facebook. I don't get how you can release a feature, even in beta, that is this buggy. Since its release, Twitter has reported a jump in errors. I will speculate and say that this is related to lists. It would not be surprising for a feature having this many errors to spill over and affect other features. If your app server is taking 6-10 seconds to send back (error) responses, then your app server is going to be able to handle a lot less requests overall. So not only is this feature buggy, but maybe it is making the whole site buggier.
Now, I know what we (eBay) would do if this was happening: We'd wire-off the feature, i.e. disable it until we had fixed what was going wrong. Twitter on the other hand...

Huh? You've got a very buggy feature, so you're going to roll it out to more users? This just boggles my mind. I cannot come up with a rationale for something like this. I guess we can assume that Twitter has the problem figured out -- they just haven't been able to release the fix for whatever reason. Even if that was the case, shouldn't you roll out the fix and make sure that it works and nothing else pops up before increasing usage? Like I said, I just can't figure this one out...

Monday, February 09, 2009

Give me the Cache

Web scale development is cache development. Oversimplification? Yes, but it is still true. Figuring out what to cache, how to cache it, and (most importantly) how to manage that cache is one the more difficult things about web application development on a large scale. Most other things have been solved before, but chances are that your caching needs are quite unique to your application. You are probably not going to be able to use Google to answer this kind of design question. Unless of course your application is extremely similar to the one I am about to describe.

I had an application that performs a lot of reads, in the neighborhood of 1000 per second. The application's data needed to be modified very infrequently, less than 100 times per day. Ah, perfect caching! Indeed. In fact the cache could even be very simply: a list of value objects. Not only that, but it's a very small list. It gets even better, the application has a pretty high tolerance for cache staleness. So it's ok if it takes a few minutes for an update to show up. A few different caching strategies emerged out of this too-good-to-be-true scenario. So it was time to write some code and do some testing on these strategies. First, here is an interface for the cache.

public interface EntityCache {
List<Entity> getData();
void setData(Collection<? extends Entity> data);
}

Again, the Entity class is just a value object (Java bean) with a bunch of fields. So first, let's look at the most naive approach to this cache, a class I called BuggyCache.

public class BuggyCache implements EntityCache{
private List<Entity> data;

public List<Entity> getData() {
return data;
}

public void setData(Collection<? extends Entity> data) {
this.data = new ArrayList<Entity>(data);
}
}
So why did I call this buggy? It's not thread safe. Just imagine what happens if Thread A calls getData(), and then is in the middle of iterating over the data, and Thread B calls setData. In this application, I could guarantee that the readers and writers would be from different threads, so it was just a matter of time before the above scenario would happen. Hello race condition. Ok, so here was a much nicer thread safe approach.

public class SafeCache implements EntityCache{
private List<Entity> data = new ArrayList<Entity>();
private ReadWriteLock lock = new ReentrantReadWriteLock();
public List<Entity> getData() {
lock.readLock().lock();
List<Entity> copy = new ArrayList<Entity>(data);
lock.readLock().unlock();
return copy;
}

public void setData(Collection<? extends Entity> data) {
lock.writeLock().lock();
this.data = new ArrayList<Entity>(data);
lock.writeLock().unlock();
}
}

This makes use of Java 5's ReadWriteLock. In fact, it's a perfect use case for it. Multiple readers can get the read lock, with no contention. However, once a writer gets the write lock, everybody is blocked until the writer is done. Next I wrote a little class to compare the performance of the Safe and Buggy caches.

public class Racer {
static boolean GO = true;
static int NUM_THREADS = 1000;
static int DATA_SET_SIZE = 20;
static long THREE_MINUTES = 3*60*1000L;
static long THIRTY_SECONDS = 30*1000L;
static long TEN_SECONDS = 10*1000L;
static long HALF_SECOND = 500L;

public static void main(String[] args) throws InterruptedException {
final AtomicInteger updateCount = new AtomicInteger(0);
final AtomicInteger readCount = new AtomicInteger(0);
final EntityCache cache = new SafeCache();
long startTime = System.currentTimeMillis();
long stopTime = startTime + THREE_MINUTES;
Thread updater = new Thread(new Runnable(){
public void run() {
while (GO){
int batchNum = updateCount.getAndIncrement();
List<Entity> data = new ArrayList<Entity>(DATA_SET_SIZE);
for (int i=0;i<DATA_SET_SIZE;i++){
Entity e = Entity.random();
e.setId(batchNum);
data.add(e);
}
cache.setData(data);
try {
Thread.sleep(THIRTY_SECONDS);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
});
updater.start();
Thread.sleep(TEN_SECONDS);


List<Thread> readers = new ArrayList<Thread>(NUM_THREADS);
for (int i=0;i< NUM_THREADS;i++){
Thread reader = new Thread(new Runnable(){
public void run() {
while (GO){
int readNum = readCount.getAndIncrement();
List<Entity> data = cache.getData();
assert(data.size() == DATA_SET_SIZE);
for (Entity e : data){
System.out.println(Thread.currentThread().getName() +
"Read #" + readNum + " data=" + e.toString());
}
try {
Thread.sleep(HALF_SECOND);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
});
readers.add(reader);
reader.start();
}

while (System.currentTimeMillis() < stopTime){
Thread.sleep(TEN_SECONDS);
}
GO = false;
for (Thread t : readers){
t.join();
}
long duration = System.currentTimeMillis() - startTime;
updater.join();
int updates = updateCount.get();
int reads = readCount.get();
System.out.println("Duration=" + duration);
System.out.println("Number of updates=" + updates);
System.out.println("Number of reads=" + reads);
System.exit(0);
}
}
This class creates a writer thread. This thread updates the cache every 30 seconds. It then creates 1000 reader threads. These read the cache, iterate and print its data and pause for half a second. This goes on for some period of time (3 minutes above), and the number of reads and writes is recorded.

Testing the BuggyCache vs. the SafeCache, I saw a 3-7% drop in throughput from using the SafeCache. Actually this was somewhat proportional to the size of the data (the DATA_SET_SIZE variable.) If you made it bigger, you saw a bigger hit as the reading/writing took longer and there was more contention.

So this seemed pretty acceptable to me. In this situation, even a 7% performance hit for the sake of correctness was worth it. However, another approach to this problem came to mind. I like to call it a watermark pattern, but I called the cache LeakyCache. Take a look to see why.

public class LeakyCache implements EntityCache{
private final List<Entity> data = new CopyOnWriteArrayList<Entity>();
private final AtomicInteger index = new AtomicInteger(0);

public List<Entity> getData() {
List<Entity> current = data.subList(index.get(), data.size());
return current;
}

public void setData(Collection<? extends Entity> newData) {
int oldSize = this.data.size();
this.data.addAll(newData);
if (oldSize > 0){
index.set(oldSize + newData.size());
}
}
}
The idea here is to keep one ever growing list for the cache and index (or watermark) to know where the current cache starts. Every time you "replace" the cache, you simply add to the end of the list and adjust the watermark. When you read from the cache, you simply copy from the watermark on. I used an AtomicInteger for the index. I probably did not need to, and a primitive int would have been good enough. I used a CopyOnWriteArray for the cache's list. You definitely need this. Without it you will wind up with ConcurrentModificationExceptions when you start mutating the cache with one thread, while another thread is iterating over it.

So you probably see why this is called LeakyCache. That internal list will grow forever. Well at least until it eats all of your heap. So that's bad. It also seems a bit more complicated than the other caches. However, it is thread safe and its performance is fantastic. How good? Even better than the BuggyCache, actually 3x as good as the BuggyCache. That deserves some qualification. Its througput was consistently more than 3x the throughput of the other caches, but I didn't run any long tests on it. It would eventually suffer from more frequent garbage collection as it leaks memory. However, if your updating is not too frequent, the entities are relatively small, and you've got lots of memory, then maybe you don't care and can just recycle your app once a month or so?

Maybe you aren't satisfied with that last statement. You're going to force me to fix that memory leak, I knew it. Here is a modified version that does just that.

public class LeakyCache implements EntityCache{
private final List<Entity> data = new CopyOnWriteArrayList<Entity>();
private final AtomicInteger index = new AtomicInteger(0);
private final ReadWriteLock lock = new ReentrantReadWriteLock();

public LeakyCache(){
Thread cleaner = new Thread(new Runnable(){
public void run() {
while(true){
try {
Thread.sleep(60*60*1000L);
} catch (InterruptedException e) {
e.printStackTrace();
}
lock.writeLock().lock();
if (data.size() > 500000){
List<Entity> current = new ArrayList<Entity>(getData());
index.set(0);
data.clear();
data.addAll(current);
}
lock.writeLock().unlock();
}
}
});
cleaner.start();
}
public List<Entity> getData() {
lock.readLock().lock();
List<Entity> current = data.subList(index.get(), data.size());
lock.readLock().unlock();
return current;
}

public void setData(Collection<? extends Entity> newData) {
lock.readLock().lock();
int oldSize = this.data.size();
this.data.addAll(newData);
if (oldSize > 0){
index.set(oldSize + newData.size());
}
lock.readLock().unlock();
}
}
So what does this do? In the constructor we spawn Yet Another Thread. This one periodically (once an hour in the example) checks to see how big the cache is. If it is over some limit, it gets the current data, clears the cache, adds the current data back to it, and reset the watermark to 0. It also Stop The World to do this by once again using a ReentrantReadWriteLock. Notice how I have abused the lock by using the read lock for both getting and setting the cache. Why use it for setting? The cleaner thread gets exclusive access to the write lock. It uses it when it is cleaning up. By having the setData method use the read lock, it will be blocked if the cleaner thread is in the middle of a cleanup.

Adding this extra bit of complexity fixes the memory leak, while maintaining thread safety. What about performance? Well the performance is highly configurable depending on how often the cleaner thread runs (well how long it sleeps really) and how big you are willing to let the cache grow before cleaning it up. I put it on same very aggressive settings, and it caused about a 15% hit to the leaky version. The performance is still much better than any of the other versions of the cache.

Next up ... write the cache in Scala using Actors.

Friday, May 02, 2008

As the Bird Turns



Another week, another Twitter outage ... and a new round of technology questions and rumors. TechCrunch now thinks that Twitter is abandoning Rails. This time out, Arrington attempts to be a little more fair-and-balanced than when he wrote about Blaine leaving. He points other sites that claim to have scaled Rails. This was of particular interest to me, so let's take a look.

Scribd -- Slide 7 claims three databases! Uh oh, is DHH right and I'm going to have to eat crow? Well maybe, but not because of Scribd. They only use master-slave relationship, but cleverly offload expensive queries to the slaves. Still only one place to write data. When (if) their data gets too big for a single database, they are going to be the ones singing "bring that beat back!"

Friends for Sale -- I could write a lot just about these guys and how ... umm ... interesting their setup is. I'll just quote them: "The most important thing we learned is that your scalability problems is pretty much always, always, always the database" but "on the database side we're still with a monolithic master and we're trying to push off sharding for as long as we can." They still have no problem claiming that "The whole 'but does Rails scale?' discussion sounds like a bunch of masturbation - the point is moot." You can't make up stuff like this!

Wednesday, April 23, 2008

Why Johnny Can't Scale

Today TechCrunch "reported" that Blaine Cook has left Twitter. I had the pleasure of interviewing Blaine and his fellow Twitter developer Alex Payne last year. I was working on a book about Ruby on Rails, so they made the perfect people to talk to. After all, they had written a Ruby on Rails application that was being truly pushing the scalability limits of Rails. I was most interested in those limits and how they were attacking them. Let's get back to the present though.

Michael Arrington pins a lot of Twitter's notorious instability squarely on Blaine. He has a point. He points out that Blaine did a presentation at a Rails conference on how Twitter had scaled Rails. If you go out and say "here's how to scale Rails", your app is part of your proof. Facebook can go out and tout the scalability of PHP, MySQL, and memcache. You can dispute their rationale, but their results are hard to argue against. Blaine could have a great rationale behind how they scaled Rails, but the instability of Twitter discredits any argument.

Some of TechCrunch's reader object to this. They point out that Twitter only had three developers. Others say only ignorant non-programmers would blame Rails or Blaine. Maybe they are right. Here goes my explanation.

Take a look at that presentation that Blaine did. In particular, look at slide 3. I quote "180 Rails instances (Mongrel). Growing fast." That is essentially 180 web servers. That is 180 instances of the Twitter application serving requests made by Twitter users. Keep in mind that this was a year ago, so you can only imagine what the numbers are like now. Now take a look at the next line in that slide "1 Database Server". No mention of that growing. I won't claim to know the intricacies of Twitter's operations, but this is an obvious bottleneck.

Since then Blaine & co. did a lot to alleviate the pressure on their bottleneck. They created an innovative messaging system called Starling. They made heavier user of memcache. To my knowledge, they still have that 1 Database Server.

If you are familiar with Rails, then you know that this is a flaw in Rails (a flaw in ActiveRecor to be more precise.) RoR associates a class to a database connection. You can write code that alters this behavior, but it is very hard to do this efficiently. Now let's be fair. This is true of a lot of frameworks, maybe all of them. Java practitioners like me know that our favorite technology with similar functionality, Hibernate, has the same flaw. Google has kindly produced an extension, Shards, to address this. Certainly the general JavaEE specification does not address this.

So it's not just a Rails problem, and we shouldn't blame Rails, right? Not so fast. Rails is the poster child for rapid, high productivity development frameworks. Remember how TC readers pointed out that Twitter only had three develoeprs? Rails is a big part of that story. In general, Rails is one of many technologies that seek to simplify web development as much as possible. This allows three developers to build such a popular site, but that's also the problem.

It's easier than ever to build a website. Social media makes it easier than ever to gain eyeballs. But it is just as hard to scale your site to handle massive traffic. Actually, maybe it is harder. When you rely heavily on frameworks and out-of-the-box goodness, it is that much harder to rip it out or reject it when you outgrow it. Rails magnifies this problem, because it is not just a technology, it is a philosophy. Rails developers pride themselves on the elegance and terseness of their code. If you are already writing ugly code, it's a lot easier to throw it away and write code that is an order of magnitude uglier but solves your scale problems. When you write beautiful code and don't know of a beautiful way to solve your scale problem, what do you do?

Finally, it is easy to trivialize Twitter's problems. "Oh they just need to use PHP" or whatever. Think about their application a little more before you say that. Let's think about a really naive approach to their application. You have a User who creates Updates. One User has many Updates. For example, I am a Twitter User and today I have made three Updates. I follow 65 other Users. The naive approach (also the one a DBA would tell you to use) is to have a join table to associate who you follow. So what has to happen when I load my home page? First, we need to figure out the 65 people I follow. Next we need to get their Updates. How many of their Updates? Well there are 20 Updates shown, but they are the 20 newest across all of my 65 friends. So do you grab all of the updates from all of my friends and then sort it? That is certainly the most naive approach. In that naive approach, we need one query to get my friends and then N queries to get all of their updates. We can put a sort on those N queries and then do a merge sort, quitting once we get the first 20 potentially. Still each of those N queries could be returning a lot of data. We all follow @Scobleizer, right?

So even if we only needed "1 Database Server", things are non-trivial. What about splitting your database? Obviously you want to split the Updates table. Let's think about your home page. Ideally all of the Updates on that page would be in the same database, so all of your friends need to be the same database server. But what about the other people following your friends? You see where this is going. The Kevin Bacon six degrees of separation problem is going to kick your ass.

The point is that even though Twitter might seem simple to the outsider, it is not. It is complex. Even if they didn't use Rails, it is non-trivial. Rails definitely does not help, at least in some regards, and that's the real story. It may be easier than ever to build something and, perhaps because of social networking, easier to get an audience. But all of the syntactic sugar in the world doesn't make it any easier to scale an application.