Friday, February 4, 2011

Mahout - SVD matrix factorization - formatting input matrix

Converting Input Format into Mahout's SVD Distributed Matrix Factorization Solver

Purpose
The code below, converts a matrix from csv format:
<from row>,<to col>,<value>\n
Into Mahout's SVD solver format.


For example, 
The 3x3 matrix:
0    1.0 2.1
3.0  4.0 5.0
-5.0 6.2 0


Will be given as input in a csv file as:
1,0,3.0
2,0,-5.0
0,1,1.0
1,1,4.0
2,1,6.2
0,2,2.1
1,2,5.0

NOTE: I ASSUME THE MATRIX IS SORTED BY THE COLUMNS ORDER
This code is based on code by Danny Leshem, ContextIn.

Command line arguments:
args[0] - path to csv input file
args[1] - cardinality of the matrix (number of columns)
args[2] - path the resulting Mahout's SVD input file

Method:
The code below, goes over the csv file, and for each matrix column, creates a SequentialAccessSparseVector which contains all the non-zero row entries for this column.
Then it appends the column vector to file.

Compilation:
Copy the java code below into an java file named Convert2SVD.java
Add to your IDE project path both Mahout and Hadoop jars. Alternatively, a command line option for compilation is given below.


import java.io.BufferedReader;
import java.io.FileReader;
import java.util.StringTokenizer;

import org.apache.mahout.math.SequentialAccessSparseVector;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.VectorWritable;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.SequenceFile.CompressionType;

/**
 * Code for converting CSV format to Mahout's SVD format
 * @author Danny Bickson, CMU
 * Note: I ASSUME THE CSV FILE IS SORTED BY THE COLUMN (NAMELY THE SECOND FIELD).
 *
 */

public class Convert2SVD {


        public static int Cardinality;

        /**
         * 
         * @param args[0] - input csv file
         * @param args[1] - cardinality (length of vector)
         * @param args[2] - output file for svd
         */
        public static void main(String[] args){

try {
        Cardinality = Integer.parseInt(args[1]);
        final Configuration conf = new Configuration();
        final FileSystem fs = FileSystem.get(conf);
        final SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf, new Path(args[2]), IntWritable.class, VectorWritable.class, CompressionType.BLOCK);

          final IntWritable key = new IntWritable();
          final VectorWritable value = new VectorWritable();

   
           String thisLine;
        
           BufferedReader br = new BufferedReader(new FileReader(args[0]));
           Vector vector = null;
           int from = -1,to  =-1;
           int last_to = -1;
           float val = 0;
           int total = 0;
           int nnz = 0;
           int e = 0;
           int max_to =0;
           int max_from = 0;

           while ((thisLine = br.readLine()) != null) { // while loop begins here
            
                 StringTokenizer st = new StringTokenizer(thisLine, ",");
                 while(st.hasMoreTokens()) {
                     from = Integer.parseInt(st.nextToken())-1; //convert from 1 based to zero based
                     to = Integer.parseInt(st.nextToken())-1; //convert from 1 based to zero basd
                     val = Float.parseFloat(st.nextToken());
                     if (max_from < from) max_from = from;
                     if (max_to < to) max_to = to;
                     if (from < 0 || to < 0 || from > Cardinality || val == 0.0)
                         throw new NumberFormatException("wrong data" + from + " to: " + to + " val: " + val);
                 }
              
                 //we are working on an existing column, set non-zero rows in it
                 if (last_to != to && last_to != -1){
                     value.set(vector);
                     
                     writer.append(key, value); //write the older vector
                     e+= vector.getNumNondefaultElements();
                 }
                 //a new column is observed, open a new vector for it
                 if (last_to != to){
                     vector = new SequentialAccessSparseVector(Cardinality); 
                     key.set(to); // open a new vector
                     total++;
                 }

                 vector.set(from, val);
                 nnz++;

                 if (nnz % 1000000 == 0){
                   System.out.println("Col" + total + " nnz: " + nnz);
                 }
                 last_to = to;

          } // end while 

           value.set(vector);
           writer.append(key,value);//write last row
           e+= vector.getNumNondefaultElements();
           total++;
           
           writer.close();
           System.out.println("Wrote a total of " + total + " cols " + " nnz: " + nnz);
           if (e != nnz)
                System.err.println("Bug:missing edges! we only got" + e);
          
           System.out.println("Highest column: " + max_to + " highest row: " + max_from );
        } catch(Exception ex){
                ex.printStackTrace();
        }
    }
}


A second option to compile this file is create a Makefile, with the following in it:
all:
        javac -cp /mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core-3.1.1.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-core-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-math-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli-1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2-core.jar *.java
Note that you will have the change location of the jars to point to where your jars are stored.

Example for running this conversion for netflix data:
java -cp .:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core-3.1.1.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-core-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-math-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli-1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2-core.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-logging-1.0.4.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-logging-api-1.0.4.jar Convert2SVD ../../netflixe.csv 17770 netflixe.seq

Aug 23, 2011 1:16:06 PM org.apache.hadoop.util.NativeCodeLoader
WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Aug 23, 2011 1:16:06 PM org.apache.hadoop.io.compress.CodecPool getCompressor
INFO: Got brand-new compressor
Wrote a total of 241 rows, nnz: 1000000
Wrote a total of 381 rows, nnz: 2000000
Wrote a total of 571 rows, nnz: 3000000
Wrote a total of 789 rows, nnz: 4000000
Wrote a total of 1046 rows, nnz: 5000000
Wrote a total of 1216 rows, nnz: 6000000
Wrote a total of 1441 rows, nnz: 7000000

...

NOTE: You may want also to checkout GraphLab's collaborative filtering library: here. GraphLab has a 100% compatible SVD solver to Mahout, with performance gains up to x50 times faster. I have created Java code to convert Mahout sequence files into Graphlab's format and back. Email me and I will send you the code.

21 comments:

  1. Handy :)

    Am I misreading, or are you counting rows from 0, columns from 1?

    ReplyDelete
  2. Hi Dan!
    Rows and columns should start from the same zero index. Let me know if I have some bug!

    Would you mind sharing what are you working on and what is the size of the factorized matrix? The reason I am asking is because I am writing a very efficient matrix factorization in GraphLab, when assuming matrix fits into memory, is about 50 times faster. Also I have some comments about the way eigenvalues are computed in SVD. See my comment in ; https://issues.apache.org/jira/browse/MAHOUT-369

    Best,

    Danny

    ReplyDelete
  3. Hey Danny,

    So re zero index, I forgot to mention I was talking about your example input in the blog post, rather than the logic of the source code:

    "For example,
    The 3x3 matrix:
    0 1.0 2.1
    3.0 4.0 5.0
    -5.0 6.2 0"


    Also btw I found it useful to break out the failure conditions,

    if (from < 0)
    throw new NumberFormatException("from < 0 fail: " + from + " to: " + to + " val: " + val);

    if (to < 0)
    throw new NumberFormatException("to < 0 fail: " + from + " to: " + to + " val: " + val);

    if (to > Cardinality)
    throw new NumberFormatException("Exceeded cardinality fail: " + from + " to: " + to + " val: " + val);

    if (val == 0.0)
    throw new NumberFormatException("Zero value fail: " + from + " to: " + to + " val: " + val);

    ... as I didn't initially realise that zero values were a no-go. I started digging into SVD a while ago, and after finding that the Ruby setup in
    http://www.igvita.com/2007/01/15/svd-recommendation-system-in-ruby/ .... was both a bit fragile (re the underlying libraries being a pain in OSX), I found my way to Mahout, where I got distracted by other pieces (recommender etc) until finally today resolving to take a serious look at SVD.

    So I started testing by taking the toy example from the above blog post and putting it into your format. The zeros being the same as missing data wasn't something I'd completely absorbed before, but if that's the deal, fine.

    My actual usage plans - nothing running yet - are from the NoTube project (http://www.notube.tv/), which is about TV and the Web. So we have a pile of data about TV content (video archives) including subject classification codes. Not Web-scale scary big, but chunky enough. And I'm also (Pig/Hadoop) mining Twitter crawls for info about users (urls they post, celebs they follow), cross linked to dbpedia/wikipedia, ... which gives somewhat bulkier data. My initial thought was just to get my hands dirty with SVD by taking a table of content items (videos) and features (subject classification codes), then explore from there. Since we have also some explicit metadata about how the content items relate to each other, and how the subject codes relate too, there's plenty to investigate there in terms of making a sensible workflow and SVD representation.

    But as I say, early days. Having your tools to get things in/out of the Mahout format removes one of those obstacles that was putting me off. Next stop is the little bit of RTFM'ing to get Mahout talking to the real Hadoop cluster properly...

    cheers,

    Dan

    ReplyDelete
  4. Hi Dan,
    Your research sounds interesting!
    Is this academic venue, a hobby or some kind of a start up? I would consider using GraphLab PMF, which factorizes the matrix W~=UV to two smaller matrices. Let me know if you need help, I would love to help you test it - I have experimented with matrices with several billion non zeros.


    Hey Danny,

    So re zero index, I forgot to mention I was talking about your example input in the blog post, rather than the logic of the source code:

    "For example,
    The 3x3 matrix:
    0 1.0 2.1
    3.0 4.0 5.0
    -5.0 6.2 0"


    Also btw I found it useful to break out the failure conditions,

    if (from < 0)
    throw new NumberFormatException("from < 0 fail: " + from + " to: " + to + " val: " + val);

    if (to < 0)
    throw new NumberFormatException("to < 0 fail: " + from + " to: " + to + " val: " + val);

    if (to > Cardinality)
    throw new NumberFormatException("Exceeded cardinality fail: " + from + " to: " + to + " val: " + val);

    if (val == 0.0)
    throw new NumberFormatException("Zero value fail: " + from + " to: " + to + " val: " + val);

    ... as I didn't initially realise that zero values were a no-go. I started digging into SVD a while ago, and after finding that the Ruby setup in
    http://www.igvita.com/2007/01/15/svd-recommendation-system-in-ruby/ .... was both a bit fragile (re the underlying libraries being a pain in OSX), I found my way to Mahout, where I got distracted by other pieces (recommender etc) until finally today resolving to take a serious look at SVD.

    So I started testing by taking the toy example from the above blog post and putting it into your format. The zeros being the same as missing data wasn't something I'd completely absorbed before, but if that's the deal, fine.

    My actual usage plans - nothing running yet - are from the NoTube project (http://www.notube.tv/), which is about TV and the Web. So we have a pile of data about TV content (video archives) including subject classification codes. Not Web-scale scary big, but chunky enough. And I'm also (Pig/Hadoop) mining Twitter crawls for info about users (urls they post, celebs they follow), cross linked to dbpedia/wikipedia, ... which gives somewhat bulkier data. My initial thought was just to get my hands dirty with SVD by taking a table of content items (videos) and features (subject classification codes), then explore from there. Since we have also some explicit metadata about how the content items relate to each other, and how the subject codes relate too, there's plenty to investigate there in terms of making a sensible workflow and SVD representation.

    But as I say, early days. Having your tools to get things in/out of the Mahout format removes one of those obstacles that was putting me off. Next stop is the little bit of RTFM'ing to get Mahout talking to the real Hadoop cluster properly...

    cheers,

    Dan

    ReplyDelete
  5. Hi Dan,

    I think there is a small error that will cause some problems if the matrix is not squared.
    At line 67 we force the input matrix to have at most "cardinality" different column values, I think this condition should work on the row id's due to the fact that after, at line 80 we initialize the vector with dimension = "cardinality".
    So at this line we should change "to > Cardinality" to "from > Cardinality". Otherwise if the input matrix has size mxn, the program will load it as an mxm matrix producing garbage in the results.

    Thank you very much for your code, it has been very helpful, and thanks to my coleague Ana for reporting me this error.

    Cheers,

    ReplyDelete
  6. Hi Percha!
    You are definitely right, send Ana my thanks, I have fixed the code. Would you mind sharing what are you working on? Maybe you will be interested in testing my own SVD implementation in GraphLab?

    Best,

    DB

    ReplyDelete
  7. Hi Danny,

    We are working on a couple of text mining applications with several corpus that can have more than 1 million of documents and 200k terms (I can't provide more details...). Is GraphLab prepared for large scale factorization? In fact we are very surprise that we didn't find GraphLab until now since we have been evaluating several options.

    ReplyDelete
  8. Definitely. SVD GraphLab iteration takes 10 seconds on a single 8-core machine using 100,000,000 non zeros. We have tried it out up to 100,000,000,000 non zeros. But now we are working to scale to even larger settings. Email me if like to know more.

    Best,

    DB

    ReplyDelete
  9. Hi Danny, thank you for the code, it is a great help, but there are couple of things that I dont really understand.
    1) Cardinality is the number of columns in the input matrix, then why the failure condition is "from > cardinality" where from is the row index as evident by your 3x3 matrix at the start of this blog. (or is it because the netflix dataset is organized as movies (columns in the input matrix) appearing with all the users who rated it(rows in the input matrix).
    2) the 2nd is a minor thing: your System.out.print statement is outputing Columns where as the example output you have shown at the end is showing Rows.

    Ehtsham

    ReplyDelete
  10. Hi Eshtsham!
    I assume the matrix is sorted by the columns, and the 2nd column is the from field (the user) and the 1st column is the to field (the item). Cardinality is the total number of items. I apologize if this may confused you. It is quite easy to swap this ordering, if your dataset is sorted by the 1st column, which are the users, then you can simply swap between from and to.
    I have fixed the example. Thanks!

    ReplyDelete
  11. Thanks Danny for the prompt reply, but I still dont understand, I apologize for my ignorance. Let me clarify what I understand: The Netflix data is 480189 users and 17770 movies which I assume is arranged in a No.of.users(rows) by no.of.movies(columns) matrix. You are saying that 2nd column is the 'from' field but you read-in the first column as the 'from' field in your tokenizer code and 'to' is reading the 2nd column and val is reading the rating(third column).
    So your code would produce a Sequence file with 'key' being the column number and row number as the index of the value. Have I understood it correctly? Thanks,

    ReplyDelete
  12. In my case, to (the 2nd column) is the user, the first column is the movie, and the cardinality is 17770 in your example. Sorry for the confusion. Maybe I will rename from,to => movie, user ?

    ReplyDelete
  13. I used the code and the given example csv file,and got the following error:

    java.lang.NumberFormatException: wrong data0 to: -1 val: 3.0
    at Convert2SVD.main(Convert2SVD.java:69)

    ReplyDelete
    Replies
    1. Hi!
      I assume the row and column ids are 1,2,3,.., and it seems that in you case they are start with zero and up. So you need to remove the "-1" , namely change the code to:
      from = Integer.parseInt(st.nextToken());
      to = Integer.parseInt(st.nextToken());

      Best,

      DB

      Delete
    2. okay, many thanks for your reply.
      Sorry for missing it instead of publishing the following comment as I did not though I could got such a fast reply. You are a great blogger. -)

      Delete
  14. Got it ,I changed the code as follows to fix the error mentioned above:

    from = Integer.parseInt(st.nextToken())
    to = Integer.parseInt(st.nextToken());

    ReplyDelete
  15. Hi Danny,

    thanks for providing this, but I am sorry to say that your explanation is very very misleading.
    As ehtsham points out, it is common practice to have :

    ROWS = documents (users in ehtsham case)
    COLUMNS = tokens (movies in ehtsham case)

    You are reversing this common assumption, which will confuse many people, including me. I was puzzled that Mahout takes as input vectors of columns instead of vectors of rows, until I figured out that you inverted these two concepts.

    I highly suggest you to revise your (otherwise very useful!) blog entry, so that newbies won't get confused too much.

    Thanks for sharing your code,
    Marco

    ReplyDelete
    Replies
    1. Sorry for the confusion, you are right, the matrix is transposed so I treat the key as the column number. However, it is very easy to reverse...

      Delete
  16. hi Bickson,
    My project is on malware analysis in big data. can u please provide some input about how to use SVD in it. I am little bit confused. SVD as a recommender or Dimension reduction in mahout. the Post on SVD are about an recommendation engine. The factorizer in mahout are based on numerical inputs. how it works with data inputs. I am totally out of line. could you suggest me on this?

    ReplyDelete
    Replies
    1. Hi,
      I suggest a much easier way to start: http://graphlab.com/learn/notebooks/index.html

      Delete
    2. thank you bickson.
      my work on svd dimension reduction i converted a csv to sequence file using seqdirectory and den convert it in to vector seq2sparse i dont know how to identify the number if rows and column in vector can u help me out.

      Delete