Programming and politics: amazon

Last week at Google I/O, big G finally delivered on Google Music. Well sort of. As many have pointed out, it's taken a long time to get Music out the door, despite it being announced a year ago. What is most interesting is that it comes after Amazon made a similar offering with its Cloud Drive/Player service. I have no used both services along with their Android apps quite a bit. So I thought I'd share my experiences, in no particular order...

Uploading 4000 songs takes a long time. That's about how many songs I have on my MBP, and it comes out to a little more than 50 GB. I was one of the lucky attendees of I/O, so not only do I have access to Google Music, but it is currently free. Amazon gives you 5 GB, and 20 GB free if you buy an MP3 album. I did the latter. However 20 GB is not enough space for me, so I have not uploaded a lot of music to Amazon. I have done this with Google Music. It took many days, and it tends to wreck havoc on WiFi networks (which should be the subject a future blog post/rant.)
The Android players are good, but both have room for improvement. Google Music has a an instant mix feature, similar to the Genius feature in iTunes. I would say that it is better than Genius for several reasons. First, it seems to do fine with "non-standard" sings. I mean stuff like Girl Talk, or remixes and live versions of popular (or not) songs. Genius fails on this consistently, maybe because these are songs (or in the case of Girl Talk, artists) that are not in iTunes? Genius also fails for newer music. Google Music seems to do fine in this situation too. The Cloud Player does not have this feature, and that is a shame. However it does have an equalizer. This is something that Google Music lacks, and that is a shame. I generally find that mobile devices (and mobile headphones, if you will) especially need equalization. The Amazon EQ is not that great though, as it only has a list of presets (Jazz, Rock, etc.)
I don't like listening to music in the browser. For desktop computers, both of these services have you open a browser and listen to music that way. I'd say that Google's is a little better, but they both seem clunky. The Amazon one does not have the equalizer that their Android app has. The sound on the Google one also seems a little better, which is counter-intuitive. It is my impression that Google may downsample your music during playback, based on bandwidth, whereas Amazon plays your music back as-is. Anyways, neither sounds as good as iTunes. Of course they aren't the ginormous mess that iTunes is either.
Google Music works better over crappier networks. It seems to do fine over edge, even though I *think* I can hear a difference in sound quality. This could be psychosomatic. On the other hand Amazon has a lot more noticeable pauses.
Google Music seems to manage metadata better, both metadata about songs and about collections (albums, playlists, etc.) However, I have heard other users complain about this.

I am generally pleased with both services. Since I was able to upload all of my music to Google for free, I have used it more. However it has convinced to upload more of my music to Amazon, and consider paying for it. However, that would cost me $100, since I have more than 50 GB of music.

So far I have only uploaded music from my laptop. I have about 80 GB of music on my desktop computer, though this is pretty much a superset of what is on my laptop. I am going to start the Google Music uploader on it too. Hopefully it will not do two copies of songs that are in common, and only upload the 30 GB of music that is not already present. If it does that well, and I have all of my music on their servers, it will be very tempting to pay for this service once the beta holiday ends.

These cloud based services have made me wish that I could have similar access in my car. I have an old iPod Touch (8 GB) hooked up in my glove box currently. It would be nice to have 10x capacity, but at what cost? Not to mention that the car interface (I have a large Sony head unit with a touchscreen interface) leaves a lot to be desired. That only gets worse with 10x data to deal with.

This weekend I was helping a friend with loading some data to Amazon's SimpleDB. The problem was fairly simple. He had a flat file with 170K lines of data. Each line represented a video from YouTube along with some metadata about the video. He wanted to turn that file into a "table" on SimpleDB, where each line (video) from the file would become a "row" in the table.

I decided to use Java for the task. I found a useful Java library for using SimpleDB. Some users of the library didn't like it, as it uses JAXB to turn Amazon's XML based API into a Java based API directly. That didn't bother me so I used it.

I wrote a quick program to do the upload. I knew it would take a while to run, but didn't think too much about it. I had some other things to do, so I set it running. Some three hours later, it was still going. I felt pretty silly. I should have done some math on how long this was going to take. So I scrapped it and adjusted my program.

Amazon has no bulk API, and this is the source of the problem. So you literally have to add one item at a time to SimpleDB. The best I could do was to parallelize the upload, i.e. load multiple items simultaneously, one per thread. Java's concurrency APIs made this very easy. Here is the code that I wrote.

import java.io.BufferedReader;

import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.ThreadPoolExecutor;

import java.util.concurrent.TimeUnit;

import com.amazonaws.sdb.AmazonSimpleDB;

import com.amazonaws.sdb.AmazonSimpleDBClient;
import com.amazonaws.sdb.AmazonSimpleDBException;

import com.amazonaws.sdb.model.CreateDomain;
import com.amazonaws.sdb.model.CreateDomainResponse;

import com.amazonaws.sdb.model.PutAttributes;
import com.amazonaws.sdb.model.ReplaceableAttribute;

import com.amazonaws.sdb.util.AmazonSimpleDBUtil;


public class Parser {

   private static final String DATA_FILE="your file here";
   private static final String ACCESS_KEY_ID = "your key here";
   private static final String SECRET_ACCESS_KEY = "your key here";
   private static final String DOMAIN = "videos";
   private static final int THREAD_COUNT = 40;
  
   public static void main(String[] args) throws Exception{

       List<Video> videos = loadVideos();
       AmazonSimpleDB service =
           new AmazonSimpleDBClient(ACCESS_KEY_ID, SECRET_ACCESS_KEY);
       setupDomain(service);
       addVideos(videos,service);
   }


   private static List<Video> loadVideos() throws IOException {

       InputStream stream =
           Thread.currentThread().getContextClassLoader().getResourceAsStream(DATA_FILE);
       BufferedReader reader = new BufferedReader(new InputStreamReader(stream));
       List<Video> videos = new ArrayList<Video>();
       String line = reader.readLine();
       while (line != null){

           Video video = Video.parseVideo(line);
           videos.add(video);
           line = reader.readLine();
       }

       return videos;
   }
  
   // This creates a table in SimpleDB
   private static void setupDomain(AmazonSimpleDB service) {

       CreateDomain request = new CreateDomain();
       request.setDomainName(DOMAIN);
       try {

           CreateDomainResponse response = service.createDomain(request);
           System.out.println(response);
       } catch (AmazonSimpleDBException e) {

           e.printStackTrace();
       }
   }   
  
   // adds all videos to SimpleDb
   private static void addVideos(List<Video> videos, final AmazonSimpleDB service) throws Exception{

       // create a thread pool
       ThreadPoolExecutor pool =
           new ThreadPoolExecutor(THREAD_COUNT, THREAD_COUNT, 10,
                   TimeUnit.SECONDS,
                   new ArrayBlockingQueue<Runnable>(videos.size()));
       // Create a task for each video, and give the collection to the thread pool

       for (final Video v : videos){
           Runnable r= new Runnable(){

               public void run() {
                   addVideo(v, service);
               }

           };
           pool.execute(r);
       }
   }
  
   // This adds a single item to SimpleDB

   private static void addVideo(Video v, AmazonSimpleDB service){

       PutAttributes request = new PutAttributes();
       request.setDomainName(DOMAIN);
       request.setItemName(v.getVideoId());
       List<ReplaceableAttribute> attrs = videoToAttrs(v);
       request.setAttribute(attrs);
       try {

           service.putAttributes(request);
       } catch (AmazonSimpleDBException e) {

           e.printStackTrace();
       }
   }
  
   // Turns a video into a list of name-value pairs

   private static List<ReplaceableAttribute> videoToAttrs(Video v){
       ReplaceableAttribute author = new ReplaceableAttribute();
       author.setName("author");
       author.setValue(v.getAuthor());
       ReplaceableAttribute date = new ReplaceableAttribute();
       date.setName("date");
       date.setValue(Long.toString(v.getDate().getTime()));
       // for votes we pad so we can sort

       ReplaceableAttribute votes = new ReplaceableAttribute();
       votes.setName("votes");
       votes.setValue(AmazonSimpleDBUtil.encodeZeroPadding(v.getVotes(), 4));
       return Arrays.asList(author, date, votes);
   }

  

}

And for completeness, here is the Video class:

import java.util.Date;


public class Video {

   private final String videoId;
   private final int votes;
   private final Date date;
   private final String author;
   private Video(String videoId, int votes, long date, String author) {

       super();
       this.videoId = videoId;
       this.votes = votes;
       this.date = new Date(date);
       this.author = author;
   }

   public String getVideoId() {
       return videoId;
   }

   public int getVotes() {
       return votes;
   }

   public Date getDate() {
       return date;
   }

   public String getAuthor() {
       return author;
   }


   public static Video parseVideo(String data){
       String[] fields = data.split(" ");
       return new Video(fields[1], Integer.parseInt(fields[0]), 1000*Long.parseLong(fields[2]), fields[3]);
   }

}

Some interesting things... I played around with the number of threads to use. Everything seemed to max out at around 3-4 threads, regardless of whether I ran it on my two core laptop or four core workstation. Something seemed amiss. I opened up the Amazon Java client code. I was pleased to see it used a multi-threaded version of the Apache HttpClient, but it was hard-coding the maximum number of connections per host to ... 3. I switched to compiling against source so I could set the maximum number of connections to be the same as the number of threads I was using.

Now I was able to achieve much better throughput. I kept number of threads and max number of http connections the same. For my two-core laptop, I got optimal throughput for 16 threads and connections. For my four-core workstation, I got optimal throughput for 40 threads and connections. I think I will re-factor the Amazon Java API and offer it to the author as a patch. There is no reason to hard code the number of connections to three, just make it configurable. The underlying HttpClient code is highly optimized to allow for this.

Programming and politics

Tuesday, May 17, 2011

We All Love Music

Tuesday, May 27, 2008

Bulk Upload to Amazon SimpleDB

Blog Archive