Monday, April 28, 2008

Web 3x and Web Lite

Earlier today, I read this interesting article about the growth of web pages. In short, average web page size has tripled in the last five years.
Yes this swell in page size corresponds very nicely with everyone's favorite cliche, Web 2.0. It is obviously not a coincidence. More features and interactivity is going to be mean more initial download, which is the very coarse metric being used here. What may have been four web pages may now be one Ajax-ified one, whose weight is three times as much. More interesting tidbits.

  • Average page weight: 310 KB
  • Average total JS: 68 KB
  • Average # of external JS files: 7
  • Average # of unique external JS files: 6 (gotta love duplicates!)
  • Average total CSS: 15 KB
  • Average total image pixels: 49,144
  • Percentage of HTTP requests dedicated to images: 75%

That last stat is not really as bad as it seems. Images are (can be) loaded in parallel. There is a limit of connections per domain, however (generally two.) A common trick is if you have to load 20 images all from the same server, trick the browser. Make the first two images from images0.mydomain.com, the next two from images1.mydomain.com, etc. Obviously images0 and images1 point to the same IP address, but you get the gist. The browser will load all of the images in parallel (or close to it.) The one disadvantage is you can lose some browser caching. The browser will think that http://images0.mydomain.com/foo.gif and http://images1.mydomain.com/foo.gif are different images.

The other interesting thing talked about in the article, is that broadband speed has more than kept up with page bloat, err Web 2.0. However, not everybody has broadband, and you are basically suffering now more than ever if you do not. These studies are only talking about home users, what about mobile phones? If you are an iPhone user on EDGE, how do you like 310 KB pages?

It seems likely that we will start seeing "lite" versions of websites in the future. We kind of see this for mobile devices, i.e. m.mydomain.com or mobile.mydomain.com. At some point, will we start directing dial-up users to the lite sites? If you were a dial-up user, how would you feel about that? I'm from the South, so it kind of feels like segregation to me: "equal but separate." 

Sunday, April 27, 2008

My JavaOne 2008 Schedule

Here is my schedule. I will probably miss a few of these, that always seems to happen. In particular, I may have to miss the opening day(5/6) because of an important project at work. 

Thursday, April 24, 2008

Diophantine Equation

Earlier today I solved Problem 31 on Project Euler. The problem is to find the number of ways of making change for 2 British pounds (or 200 pence) given 8 types of coins. For a mathematical person like myself, the problem reduces to a Diophantine Equation:

a + 2b + 5c + 10d + 20e + 50f + 100g + 200h = 200

Where 1,2,5,10,20,50,100,200 are the values of the various types of coins. I had some meetings to go to, so I wrote a brute force solution (in Ruby) and let it run. I got back an hour later, and it was still running. I felt quite stupid. I then came up with a much better solution using some dynamic programming and recursion. It ran in 0.23s. Here it is:


def solve(n,coefficients)
x = coefficients.pop
if (coefficients.length == 0)
if (n % x == 0)
return 1
else
return 0
end
else
d = (n/x).to_i
cnt = 0
0.upto(d){|y|
xc = coefficients.collect{|p| p} #must copy since we pop array
cnt += solve(n-x*y,xc)
}
return cnt
end
end

c = [1,2,5,10,20,50,100,200]
n = 200
puts solve(n,c)

Wednesday, April 23, 2008

Why Johnny Can't Scale

Today TechCrunch "reported" that Blaine Cook has left Twitter. I had the pleasure of interviewing Blaine and his fellow Twitter developer Alex Payne last year. I was working on a book about Ruby on Rails, so they made the perfect people to talk to. After all, they had written a Ruby on Rails application that was being truly pushing the scalability limits of Rails. I was most interested in those limits and how they were attacking them. Let's get back to the present though.

Michael Arrington pins a lot of Twitter's notorious instability squarely on Blaine. He has a point. He points out that Blaine did a presentation at a Rails conference on how Twitter had scaled Rails. If you go out and say "here's how to scale Rails", your app is part of your proof. Facebook can go out and tout the scalability of PHP, MySQL, and memcache. You can dispute their rationale, but their results are hard to argue against. Blaine could have a great rationale behind how they scaled Rails, but the instability of Twitter discredits any argument.

Some of TechCrunch's reader object to this. They point out that Twitter only had three developers. Others say only ignorant non-programmers would blame Rails or Blaine. Maybe they are right. Here goes my explanation.

Take a look at that presentation that Blaine did. In particular, look at slide 3. I quote "180 Rails instances (Mongrel). Growing fast." That is essentially 180 web servers. That is 180 instances of the Twitter application serving requests made by Twitter users. Keep in mind that this was a year ago, so you can only imagine what the numbers are like now. Now take a look at the next line in that slide "1 Database Server". No mention of that growing. I won't claim to know the intricacies of Twitter's operations, but this is an obvious bottleneck.

Since then Blaine & co. did a lot to alleviate the pressure on their bottleneck. They created an innovative messaging system called Starling. They made heavier user of memcache. To my knowledge, they still have that 1 Database Server.

If you are familiar with Rails, then you know that this is a flaw in Rails (a flaw in ActiveRecor to be more precise.) RoR associates a class to a database connection. You can write code that alters this behavior, but it is very hard to do this efficiently. Now let's be fair. This is true of a lot of frameworks, maybe all of them. Java practitioners like me know that our favorite technology with similar functionality, Hibernate, has the same flaw. Google has kindly produced an extension, Shards, to address this. Certainly the general JavaEE specification does not address this.

So it's not just a Rails problem, and we shouldn't blame Rails, right? Not so fast. Rails is the poster child for rapid, high productivity development frameworks. Remember how TC readers pointed out that Twitter only had three develoeprs? Rails is a big part of that story. In general, Rails is one of many technologies that seek to simplify web development as much as possible. This allows three developers to build such a popular site, but that's also the problem.

It's easier than ever to build a website. Social media makes it easier than ever to gain eyeballs. But it is just as hard to scale your site to handle massive traffic. Actually, maybe it is harder. When you rely heavily on frameworks and out-of-the-box goodness, it is that much harder to rip it out or reject it when you outgrow it. Rails magnifies this problem, because it is not just a technology, it is a philosophy. Rails developers pride themselves on the elegance and terseness of their code. If you are already writing ugly code, it's a lot easier to throw it away and write code that is an order of magnitude uglier but solves your scale problems. When you write beautiful code and don't know of a beautiful way to solve your scale problem, what do you do?

Finally, it is easy to trivialize Twitter's problems. "Oh they just need to use PHP" or whatever. Think about their application a little more before you say that. Let's think about a really naive approach to their application. You have a User who creates Updates. One User has many Updates. For example, I am a Twitter User and today I have made three Updates. I follow 65 other Users. The naive approach (also the one a DBA would tell you to use) is to have a join table to associate who you follow. So what has to happen when I load my home page? First, we need to figure out the 65 people I follow. Next we need to get their Updates. How many of their Updates? Well there are 20 Updates shown, but they are the 20 newest across all of my 65 friends. So do you grab all of the updates from all of my friends and then sort it? That is certainly the most naive approach. In that naive approach, we need one query to get my friends and then N queries to get all of their updates. We can put a sort on those N queries and then do a merge sort, quitting once we get the first 20 potentially. Still each of those N queries could be returning a lot of data. We all follow @Scobleizer, right?

So even if we only needed "1 Database Server", things are non-trivial. What about splitting your database? Obviously you want to split the Updates table. Let's think about your home page. Ideally all of the Updates on that page would be in the same database, so all of your friends need to be the same database server. But what about the other people following your friends? You see where this is going. The Kevin Bacon six degrees of separation problem is going to kick your ass.

The point is that even though Twitter might seem simple to the outsider, it is not. It is complex. Even if they didn't use Rails, it is non-trivial. Rails definitely does not help, at least in some regards, and that's the real story. It may be easier than ever to build something and, perhaps because of social networking, easier to get an audience. But all of the syntactic sugar in the world doesn't make it any easier to scale an application.

Tuesday, April 22, 2008

New Scala Article

IBM published my first Scala article today. It's on using Scala to work with XML. It was a lot of fun to write about Scala. I am very lucky to get to share something that I enjoy. However, it was also a little challenging. It is so easy to be so terse with Scala that it seemed to carry over into my writing. I thought it would have the opposite effect, i.e. I would feel like I would need to write a lot more to describe what was going on with Scala code than I would with Java. Instead I had to really force myself to be descriptive about things. I would look at some of the code and just think: "the reader does not need me to explain that, it is so obvious." That is a testimony to Scala. Its handling of XML takes advantage of its DSL-friendly nature. A lot of complex and powerful expressions in Scala simply do exactly what they look like they should do. Or maybe its just my mathematical mind that sees things that way, who knows. Anyways, I have another article in the works for developerWorks that will cover Lift, a web framework developed with Scala.

Monday, April 21, 2008

Video Solution

Last week my oldest son, Michael, Jr. was in a musical production at his preschool. Of course we videotaped it so we could share it. This wasn't the first time I had shared video online, but this time really made me annoyed. I was really annoyed with the poor quality of sharing sites like YouTube. I tried Facebook's video sharing, and it wasn't much better. I decided it was time to roll my own. After some experimenting here is the stack of tools and service I came up with.

Video Importing/Editing: iMovie. This is what I was already using. iMovie is famously easy to use. The importing is easy, the editing is easy. The default export is .m4v files, which are H.264 files designed to work on iPods. This is important.

Video Playing: Flash. I wrote my own video player using Flex. The Flash player has built-in support for H264 video. Thus the videos I export out of iMovie play naturally in the Flash player. They just need some Flash code to load and control the video stream. I wrote this myself as it was pretty straightforward. I created an external XML file as a metadata repository. This tells my player what videos are available and information about each video, like its size. I hosted the custom built player and metadata files on Google Page Creator.

Video Storage: Amazon S3. This was the most difficult part. S3 is relatively cheap. It is NOT easy to use, contrary to what others might say. I was shocked to discover that there is no interface (web or desktop) for uploading and managing files stored on S3. I wound up using S3Fox. It seems like a nice interface, but buggy. Amazon really should offer administrative tools.

Next Steps: Ideally I would build a plugin for iMovie that would upload the video to S3 and write the metadata about the movie to my Google Pages. I am thinking of doing an AIR app for this first, and then maybe go Cocoa. I must also monitor my usage to make sure that S3 is cheaper than other alternatives like Box.net.

Saturday, April 19, 2008

NBA Playoffs

I am very excited about the NBA playoffs, and for one very simple reason...


Go Magic!

I went to my first Magic game before they had played a regular season game (it was a pre-season game in Tallahassee.) I've lived through the betrayals of both Shaquille O'Neal and Tracy McGrady. Now we have Superman. Nobody is talking about Orlando, and that is perfect. 

Now I have to admit that Boston scares me. I really think Orlando can handle Detroit in the second round, but Boston... I also think Orlando could do well against any team in the West, but Boston... People don't seem to realize that Boston just had a historically good season. If Orlando wasn't so awesome this year, I would be rooting for Boston big time, as Kevin Garnett is one of my all time favorite players to watch.

Back in the West, everybody around here is pissed that Golden State didn't make the playoffs despite having a much better record than most of the teams in the East. It's a good reason to be upset, but Golden State controlled their own destiny. If they would have beaten Denver, at home on April 10, they would almost certainly be in the playoffs and Denver fans would be the ones pissed about the Atlanta Hawks being in the playoffs while their Nuggets were not.

Finally, here are some predictions:

East
First Round Winners: Boston(4), Detroit(6), Orlando(5), Washington (6)
Second Round Winners: Boston (4), Orlando(7)
Eastern Finals: Orlando (7)

West
First Round Winners: Lakers (6), Dallas (7), Utah (5), Phoenix (6)
Second Winners: Phoenix (6), Utah (6)
Western Finals: Utah (6)

Utah, like Orlando is much better than people realize. Phoenix traded for Shaq so they can deal with San Antonio and the Lakers, but they won't be able to deal with Utah. 

NBA Finals: Orlando over Utah (6)

Wednesday, April 16, 2008

eBay Wins MySQL Award

Yesterday eBay won an award at the O'Reilly MySQL conference. Congratulations to the Chris Kasten's team. Chris also did a presentation today at the MySQL conference. The caching layer his team created allows us to store "session" data on the server. This data can actually be persisted across sessions, depending on the use case, so it is really a little better than your typical HTTPSessions. Plus it actually scales :-)

The MySQL cache compliments HTTP cookies. We also use Flash for local storage as another alternative to the tiny world of HTTP cookies. Flash has a lot of advantages. The most obvious is the 100K default limit as opposed to the 4K limit of HTTP cookies. Flash "cookies" are not sent to the server, making them much more secure. One of the fun things I did recently was help setup guidelines on when an application should use plain 'ol HTTP cookies, Flash local storage, or the MySQL cache. I got to become much more familiar with the MySQL cache, and thus it wasn't suprising to me when I heard about it winning the award from MySQL.

Saturday, April 12, 2008

Yes to Bitterness

A lot of folks think that Barack Obama has really shot himself with his comments about "bitter" Americans:
"It's not surprising, then, they get bitter, they cling to guns or religion or antipathy to people who aren't like them or anti-immigrant sentiment or anti-trade sentiment as a way to explain their frustrations."
That's the quote that everybody cites. It incited Obamafan Dave Winer to take umbrage at Obama's quote and state that "To equate geography with intellect is as wrong as to equate it with race, ethnicity, gender or age." That quote is taken from a blog post by Winer titled Is my candidate too elite?

Before I state my opinion on the matter, let's get a few things out of the way. A lot of people would say that I am the worst kind of elitist: a libertarian. I think Republicans are stupid for wanting to tell people what is right and wrong and how to live their lives. I think Democrats are stupid for thinking they are smart enough to fix everyone's lives and treating people like children. In other words, I think everyone is an idiot. I am a supporter of Obama, but only after Ron Paul fell out of things and Obama became my best bet to ending the ongoing atrocity that is the War and Occupation of Iraq.

That being said, I also consider myself to have a very scientific mind. Thus Obama's quote must be examined in context. Taking quotes out of context is the worst kind of yellow journalism (and perhaps the most common as well.) Obama was explaining why he was having a hard time winning over white working class. He used the bitterness bit as a counter to the notion that it is just because he is black.

You might be wondering what my point is. This is the kind of argument that always annoys my wife. The semantics of the statement are what is important. Obama was not talking about how certain people (white working class in the midwest) are, but why they are skeptical of him as a leader. Look at the recent history of presidential campaigns. The Republican Party has done an exemplary job of using divisive issues to win elections. What are some of those issues? Things like religion and moral values (abortion), immigration, and gun control. These are the exact things brought up by Obama.

So was Obama saying that midwest white working class people are gun toting, racist, religious zealots? I don't think so. I think he was simply stating the fact that certain issues have been effective in deciding the votes of those people. That is not a stereotype or generalization, that is a statistical fact. In the past those people have been swayed by divisive issues by the Republican Party. Now we are seeing them swayed by divisive issues by Hillary Clinton. Obama is just relating these as the facts that answer the question posed to him.

Finally, my propensity for scientific thought forces me to denounce the hypocrisy of the outrage. It is publicly acceptable to call the people on the East and West coast elitists. Everyone else says that the "liberals" think everyone else are dumb. In other words stereotypes of people in "blue states" is ok, but stereotypes of people in "red states" are not. Why is this? Because there are more red states than blue. Might makes right. This is the reason why divisive issues have worked for the GOP. They can play all of the stereotypical, prejudiced cards they want because the people who will be offended are outnumbered. Well outnumbered from an Electoral College perspective at least...

Google App Engine ... PyNing?

There has been a lot of excitement this week about the Google App Engine, and it deserves the press. Most people are comparing it to the Amazon's triumvirate of S3, EC2, and SimpleDB. That is a fair comparison, but there is another one that came to mind for me: Ning.

Ning co-founder Marc Andreessen wrote a now famous (famous in SiValley at least) blog about the three kinds of platforms on the Internet. Andreessen claimed that Ning was the only Level 3 platform, where your code runs directly on the platform. Clearly the Google App Engine is also a Level 3 platform.

Ning and App Engine seem different, and they are. Ning has built an application called the Ning Social Network, which is an application that runs on the Ning platform. When you use Ning, you get the code for your own instance of the Social Network and then customize it from there. Can you rip it all out completely and build something from scratch? I don't know. But essentially that is what Google gives you: Ning with non Social Network app.

But wait, what about BigTable? Ning has a little thing called its content store which lets you store your own data structures and then query them ... a lot like BigTable. I am no expert on either, but any Level 3 platform has a similar need that is satisfied by these technologies.

Of course the other big difference between Ning and App Engine is that Ning is PHP based and App Engine is Python. That is also a major difference they both have from the AWS collection. They both support a single language with a set of advanced platform APIs implemented in that language. The other big difference is that both Ning and App Engine are free.

Personally I am thinking of writing a generic web proxy to deploy on App Engine. This would be useful for any Flash/Silverlight development to allow scripting to a site that does not have a nice cross-domain policy, such as ... Ning and Google :-)

Wednesday, April 09, 2008

Robbed

On Sunday, I took my kids to the Oakridge Mall. We often go there on Sundays to get ice cream and to play at the indoor playground there. That's exactly what we were doing this past Sunday. As we were walking to our car, I saw a terrible site. Our car had been broken in to. It was my wife's Toyota Sienna. Somebody had broken the driver side window so they could steal the DVD system we had in the car for the kids to watch. It was a portable system we bought over a year ago, with two monitors, one for each of our kids. One of the major selling points for us was that it was portable, so we could use it either in our minivan or in my car. Turns out that this was a selling point for thieves as well.

Needless to say I was pretty furious to have my car damaged and my property stolen. The thief was obviously an idiot. He forgot to take the power cord to the DVD system, so it will not even play for him. Of course the worst part was having to explain to my children what had happened and why they no longer had a "TV in the car."

One of the other disturbing things that came out of all of this was that it led Crystal to discover CrimeReports.com. We went on there and put in our address. It is truly disturbing to see how much crime goes on. It made me depressed to think that maybe we have picked a bad place to live. I scrolled around the Bay Area, and everywhere was pretty bad. Even very expensive areas like Los Gatos showed a huge amount of theft, burglaries, etc. in just the last week.

Tuesday, April 08, 2008

Twitter API in ActionScript

I noticed that the Twitter ActionScript API was a bit out of date to say the least. I had to update it for a small project I was playing around with. I sent a note to al3x about this, and suggested open sourcing the API. He agreed and so here it is.
It needs a lot of work. I basically removed the old authentication scheme, since it relied on setting the Authorization HTTP header and that is no longer allowed starting in Flash Player 9.0.115. This also let me remove a dependency on a third party base 64 encoder. I guess a base64 encoded string that includes your username and password is supposed to be secure :-) I also fixed the loadFriends method, as this now returns users not status messages. I will get the API up to date, but if anyone wants to help out then just send me a note.

Friday, April 04, 2008

MLB 2007

It's almost a week into the baseball season, and I haven't written about it at all! I am always psyched when baseball starts. This year is no different. Performance enhancing drugs be damned!
I am very optimistic about my favorite team, the Atlanta Braves. The Braves were really one of the top 2 or 3 teams in the National League last year. Their record did not show this, but their stats did. That makes them an easy pick to improve.
I am also optimistic about Jeff Francoeur. Jeff's big weakness has always been his plate discipline. He nearly doubled his walk total and rate last year. This was reflected in his increase in pitches per plate appearance. His home run total was down last year compared to 2006, but his doubles(40) were way up. It's easy to feel optimistic about a young player who is showing more patience and hitting a lot of doubles. Not to mention that he is hitting behind a pair of guys who regularly post .400+ OBP (Chipper Jones and Mark Texeira.)
So if the Braves can just get enough pitching out of their rather old rotation (Hudson, Smoltz, Glavine, Hampton) then it should be a great year. Contrary to popular belief, Turner Field is a hitter's park. People think it is great for pitching just because the Braves had so many great pitchers for so long. It's not. It's actually at the second highest altitude of any MLB park, next to Coors Field of course. So the Braves are going to once again going to score a lot of runs.
Finally, gotta mention the local teams. I don't know what to make of the Oakland A's. I think they will surprise, but then again expectations are so low that they almost have to. The Giants on the other hand are going to be miserable, and they deserve it. They were just a little too happy to get rid of Barry Bonds, the guy who built their current stadium and gave them a future as a big market team (once they finish paying for the stadium.) I do hope to catch a Tim Lincecum game though...

Wednesday, April 02, 2008

Long Division

Earlier this week, I solved another problem from Project Euler. This was #26. Here is the description:

A unit fraction contains 1 in the numerator. The decimal representation of the unit fractions with denominators 2 to 10 are given:

1/2= 0.5
1/3= 0.(3)
1/4= 0.25
1/5= 0.2
1/6= 0.1(6)
1/7= 0.(142857)
1/8= 0.125
1/9= 0.(1)
1/10= 0.1

Where 0.1(6) means 0.166666..., and has a 1-digit recurring cycle. It can be seen that 1/7 has a 6-digit recurring cycle.

Find the value of d < 1000 for which 1/d contains the longest recurring cycle in its decimal fraction part.

I knew this problem was related to the totient function, but I went a brute force route instead. The brute force route required writing a long division algorithm, which just seemed like a fun thing to do anyways. So I solved the problem, and the solution was plenty fast (0.185s for the calculation.)

I was ready to post my solution to the forums. I was also planning on reading the forums to see if anyone had a clever solution using the totient function. So I was greatly disappointed to see that the forms were down. Still my ego required me to publish my long division algorithm, so here it is:


def cycle(d)

m,i = 1,1
arr = Hash.new

while m > 0
while m < d
m *= 10

i += 1
end
m = m % d

if arr[m]
return i - arr[m]

end
arr[m] = i
end
return 0

end

max = 0
m = 2
3.upto(ARGV[0].to_i){|n|
c = cycle(n)

if c > max
max = c
m = n

end
}
puts m

Obviously this is in Ruby. The cycle calculates the length of the recurring cycle. The m variable is essentially the digits in the decimal representation of 1/d. The algorithm doesn't capture it, since it is not needed for the problem, but you could easily capture it (append to a string or array or whatever) and return it for a true long division algorithm.