Monday, July 18, 2011

GraphLab - Machine Learning in the Cloud

A few days ago we have released a detailed technical report about the performance of distributed GraphLab on Amazon EC2 with up to 64 nodes (512 cores total) : http://arxiv.org/abs/1107.0922

We compared GraphLab using three applications: matrix factorization, CoEM (a variant of personalized pagerank, a named entity recognition algorithm), and video co-segmentation. 

As a reference we compared three platforms: Hadoop, MPI (message-passing-interface) and GraphLab. In a nutshell, GraphLab runs about 20x to 100x times faster than Hadoop, depending on the data and application. The main reason is that we perform all computation in memory and do not provide any fault tolerance. Compared to MPI, GraphLab has a similar performance. The drawback of MPI is that the code has to be rewritten for each application, while GraphLab provides building blocks for iterative computation.

The following graph shows the speedup of the 3 applications using 64 Amazon HPC machines:

The baseline for speedup calculation are 4 machines. For matrix factorization (line denoted as Netflix) we get a speedup of x16 on x64 machines.  For video co-segmentation we get a speedup of x40 on x64 EC2 nodes.
When we increase factorized matrix width, the problem becomes computation heavy and we get
even a better speedup of x40 on 64 nodes.

No comments:

Post a Comment