Monday, February 27, 2012

Matrix Market Format

Matrix Market is a very simple format devised by NIST to store different types of matrices.

For GraphLab matrix libraries: linear solvers, matrix factorization and clustering we recommend using this format for the input file. Once this format is used for the input, the same format will be used for the output files.


Sparse matrices
Matrix market has a very simple format: 2 header lines. Here are some examples. Here is
example 3x4 matrix:

A =

0.8147         0    0.2785         0
0.9058    0.6324    0.5469    0.1576
0.1270    0.0975    0.9575         0

And here is the matching matrix market file:

%%MatrixMarket matrix coordinate real general
% Generated 27-Feb-2012
3 4 9
1 1  0.8147236863931789
2 1  0.9057919370756192
3 1  0.1269868162935061
2 2  0.6323592462254095
3 2  0.09754040499940952
1 3  0.2784982188670484
2 3  0.5468815192049838
3 3  0.9575068354342976
2 4  0.1576130816775483

The first line, include the header. coordinate means this is a sparse matrix.
The third row includes the matrix size: 3 4 9 means 3 rows, 4 columns and 9 non zero entries.
The rest of the rows include the edges. For example the 4th row is:
1 1 0.8147236863931789, namely means that it is the first row, first column and its value.

TIP: Sparse matrices should NOT include zero entries!
For example, the row 1 1 0 is not allowed!
TIP: First two numbers in each non-zero entry line should be integers and not double notation. For example 1e+2 is not a valid row/col number. It should be 100 instead.
TIP: Row/column number always start from one (and not from zero as in C!)
TIP: It is not legal to include the same non zero entry twice. In GraphLab it will result in a duplicate edge error. Note that the number of edges in GraphLab starts in zero, so you have to add one to source and target values to detect the edge in the matrix market file.

Dense matrices:
Here is an example on how to save the same matrix as dense matrix:

A =

0.8147         0    0.2785         0
0.9058    0.6324    0.5469    0.1576
0.1270    0.0975    0.9575         0


%%MatrixMarket matrix array real general
3 4
0.8147
0.9058
0.1270
0
0.6324
0.0975
0.2785
0.5469
0.9575
0
0.1576
0

Symmetric sparse matrices:
Here is an example for sparse symmetric matrix:
B =

    1.5652         0    1.4488
         0    2.0551    2.1969
    1.4488    2.1969    2.7814

And here is the matching matrix market file:
%%MatrixMarket matrix coordinate real symmetric
% Generated 27-Feb-2012
3 3 5
1 1  1.5652
3 1  1.4488
2 2  2.0551
3 2  2.1969
3 3  2.7813

Note that each non-diagonal edges is written only once.

Sparse vectors:
Here is an example for sparse vector:
v =

     1     0     0     1

%%MatrixMarket matrix coordinate real general
% Generated 27-Feb-2012
1 4 2
1 1 1
1 4 1

Working with matlab:
download the files http://graphlab.org/mmwrite.m and http://graphlab.org/mmread.m
In Matlab you can save a dense matrix using:
>> mmwrite('filename', full(matrixname));
And save a sparse matrix using:
>> mmwrite('filename', sparse(matrixname));
For reading a sparse or dense matrix you can:
>> A = mmread('filename');

Writing a conversion function in Java
This section explains how to convert Mahout 0.4 sequence vectors into matrix market format.

Create a file named Vec2mm.java with the following content:
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.util.Iterator;

import org.apache.mahout.math.SequentialAccessSparseVector;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.VectorWritable;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;

/**
 * Code for converting Hadoop seq vectors to matrix market format
 * @author Danny Bickson, CMU
 *
 */

public class Vec2mm{
 
 
 public static int Cardinality;
 
 /**
  * 
  * @param args[0] - input svd file
  * @param args[1] - output matrix market file
  * @param args[2] - number of rows
  * @param args[3] - number of columns
  * @param args[4] - number oi non zeros
  * @param args[5] - transpose
  */
 public static void main(String[] args){
 
   try {
     if (args.length != 6){
        System.err.println(Usage: java Vec2mm <input seq vec file> < output matrix market file> <number of rows> <number of cols> <number of non zeros> <transpose output>);
           System.exit(1);
        }

        final Configuration conf = new Configuration();
        final FileSystem fs = FileSystem.get(conf);
        final SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(args[0]), conf);
        BufferedWriter br = new BufferedWriter(new FileWriter(args[1]));
        int rows = Integer.parseInt(args[2]);
        int cols = Integer.parseInt(args[3]);
        int nnz = Integer.parseInt(args[4]);
        boolean transpose = Boolean.parseBoolean(args[5]);
        IntWritable key = new IntWritable();
        VectorWritable vec = new VectorWritable();
        br.write("%%MatrixMarket matrix coordinate real general\n");    
        if (!transpose)
          br.write(rows + " " +cols + " " + nnz + "\n");
        else br.write(cols + " " + rows + " " +  nnz + "\n");
        while (reader.next(key, vec)) {
          //System.out.println("key " + key);
          SequentialAccessSparseVector vect = (SequentialAccessSparseVector)vec.get();
          //System.out.println("key " + key + " value: " + vect);
          Iterator iter = vect.iterateNonZero();

          while(iter.hasNext()){
            Vector.Element element = iter.next();
           //note: matrix market offsets starts from one and not from zero
           if (!transpose)
             br.write((element.index()+1) + " " + (key.get()+1)+ " " + vect.getQuick(element.index())+"\n");
           else 
             br.write((key.get()+1) + " " + (element.index()+1) + " " + vect.getQuick(element.index())+"\n");
           }
       }
  
       reader.close();
       br.close();
   } catch(Exception ex){
     ex.printStackTrace();
   }
 }
}
Compile this code using the command:
javac -cp /mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core-3.1.1.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-core-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-math-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli-1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2-core.jar *.java

Now run it using the command:
java -cp .:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core-3.1.1.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-core-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-math-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli-1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2-core.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-logging-1.0.4.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-logging-api-1.0.4.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/google-collections-1.0-rc2.jar Vec2mm A.seq A.mm 25 100 2500 false

Where A.seq is the input file in SequentialAccessSparseVector key/value store. A.mm
will be the resulting output file in matrix market format. 25 is the number of rows, 100 columns, and 2500 non zeros. false - not to transpose the resulting matrix.

Depends on the saved vector type, you may want to change my code from SequentialAccessSparseVector to the specific type you need to convert.

6 comments:

  1. In the post,
    "TIP: Sparse matrices should include zero entries!"
    maybe have typo. should --> should not

    Right?

    ReplyDelete
  2. For the dense matrix matrices, I think you miss the first two entries: 0.8147 and 0.9058.

    ReplyDelete
  3. I didn't get about dense matrix:

    I suppose (correct me if i'm wrong) that all elements (including zero elements) must follow in order of rows and columns then.

    So I don't understand why

    1) there are strange indents
    2) several elements have difference for 0.0001

    ReplyDelete
    Replies
    1. Hi Alex,
      1) Indents were a blog formatting issue - now fixed.
      2) I used matlab print code which did some rounding of the full numbers, I fixed it now to match.

      Thanks for detecting the typos!

      Delete