Sparse Matrix Libraries for C++: A Tour

In my last post I described my ideal sparse matrix library. In this post I’ll demonstrate the use of some real life libraries.

The Test Case

In days past I was a VLSI circuit designer, and later, an EDA software engineer. On-chip electrical circuits are naturally represented as sparse matrices, as there tend to be many circuit nodes but relatively few connections between them. Model Order Reduction is a technique that can further reduce the data requirements by generating a smaller model that is a good approximation to the original.

Of particular interest for me is an M.O.R. algorithm called Prima. In Matlab/Octave the first steps look like this:

G = [  ]                    # conductance matrix (large and sparse)

B = [  ]                    # source matrix (less large, but sparse)

A = sparse(G) \ sparse(B)   # calculate G^-1*B

[Q, R] = qr(A)              # extract Q (an "orthonormal basis" for A)

This amounts to creating two sparse matrices, solving for a third (with an LU decomposition), and then performing a QR decomposition on the result.

Eigen Implementation

Eigen comes the closest, in terms of expressiveness, to the original code. We start by collecting the sparse, i.e., nonzero entries:

using namespace Eigen;

vector<Triplet<double>> Gentries{...};     // 15x15, sparse
vector<Triplet<double>> Bentries{...};     // 15x3, sparse

// build sparse matrices from triplet lists
SparseMatrix<double> G(15, 15);
G.setFromTriplets(Gentries.begin(), Gentries.end());
SparseMatrix<double> B(15, 3);
B.setFromTriplets(Bentries.begin(), Bentries.end());
    
// calculate A = G^-1*B with Sparse LU
SparseLU<SparseMatrix<double>, COLAMDOrdering<int>> LU(G);
assert(LU.info() == Success);
SparseMatrix<double> A = LU.solve(B);

// run QR on A
SparseQR<SparseMatrix<double>, COLAMDOrdering<int>> QR(A);
assert(QR.info() == Success);

// extract thin Q as dense matrix
MatrixXd Q = QR.matrixQ() * MatrixXd::Identity(15, QR.rank());

It’s pretty straightforward, and has little boilerplate.

I find there are a few drawbacks to Eigen:

The API for decompositions is inconsistent between sparse and dense - and even within the dense decompositions
The implementations of the sparse decompositions are specific to Eigen and not accelerated. There are wrappers for some external libraries, but those lack some of the features of the native modules
The C++ is good but a bit dated - very C++98 - due to the age of the codebase and the desire to support legacy users
To reduce copying in expression template code, references are kept to inner expressions and results are calculated from them lazily. Users have to be very careful to convert to a concrete Matrix type before exiting a scope. In particular, the use of “auto” requires care.

Still, Eigen is my choice for prototyping in C++, under the “first, make it correct” philosophy.

CSparse Implementation

CSparse occupies the other end of the spectrum from Eigen. It is basically a C library with some grudging accommodation for C++ users. However, it also has some advantages:

High accuracy and decent performance, thanks to its authorship by a well-known expert in sparse direct methods, Tim Davis
It is very well documented in a book and a series of lectures (though less so elsewhere!). It’s a good place to start if you want to understand how this kind of code works.

It’s also quite a bit more verbose, thanks to its C heritage. Creating a sparse matrix is a two step process:

// create a triplet matrix
cs * TG = cs_spalloc(15, 15, Gentries.size(), 1, 1);

for ( size_t i = 0; i < Gentries.size(); ++i) {
    assert(cs_entry(TG, Gentries[i].row, Gentries[i].col,Gentries[i].value));
}

// create a "cs" structure from the triplet matrix
cs *G = cs_compress(TG); cs_spfree(TG);

Notice the manual memory management (cs_spalloc, cs_spfree). We can clean this up in C++ with unique_ptr and custom deleters, but it’s extra work. Next, we create a dense matrix to hold the solve result, and initialize it with the input B:

// initialize A (the eventual result) with the contents of B
vector<double> Adense(15*3);
fill(Adense.begin(), Adense.end(), 0);
for ( size_t i = 0; i < Bentries.size(); ++i) {
    Adense[15*Bentries[i].col+Bentries[i].row] = Bentries[i].value;
}

We have to supply some temporary space for the decompositions to use, and separately call symbolic (nonzero-tracing) and numeric (calculation) parts of the decomposition:

vector<double> workspace(15);
// symbolic factorization first
css * S = cs_sqr ( 3,   // order amd(A'*A)
                   G,
                   0 ); // for use by LU,  not QR
// then numeric
csn * N = cs_lu ( G, S, numeric_limits<double>::epsilon() );

There doesn’t seem to be a canned way to have a matrix (as opposed to vector) right hand side, so we iterate over the columns:

// now use the result to solve G^-1*B one column at a time
for ( int j = 0; j < 3; j++ ) {
    cs_ipvec ( N->pinv, Adense.data()+15*j, workspace.data(), 15 );
    cs_lsolve ( N->L, workspace.data() ) ;
    cs_usolve ( N->U, workspace.data() ) ;
    cs_ipvec ( S->q, workspace.data(), Adense.data()+15*j, 15 );
}

After converting A back to sparse we proceed with the QR decomposition:

    S = cs_sqr( 3,
                A,
                1 ); // for QR decomposition

    csn * result = cs_qr(A, S);

then manually apply the resulting Householder vectors to form a dense Q:

// create a *dense* identity matrix of the right size
double * I = (double *)cs_calloc (15*3, sizeof(double));
I[0*15+0] = 1;
I[1*15+1] = 1;
I[2*15+2] = 1;

// apply householder vectors to I to produce Q
double * Q = I;

for ( csi j = 0; j <= 2; j++) {
    double * col = Q + 15*j;
    // make x a copy of the jth column of I
    copy(col, col+15, x);

    // apply the Householder vectors that comprise Q
    for (csi k = j; k >= 0; k--) {
        cs_happly( V, k, result->B[k], x );
    }
    // apply the row permutation
    cs_ipvec( P, x, col, 15 );

}

That was quite tedious! On the plus side, it’s pretty easy to see what the code is doing underneath - the code is relatively flat and has no “magic” anywhere. If you don’t need acceleration, and the license (LGPL v2.1) is acceptable to you, this may be a good choice, though some good wrappers (and possibly additional code to avoid dense conversions) would be important.

SuiteSparse

SuiteSparse is a collection of libraries from the same author as CSparse, but is more advanced in functionality, performance, and C++ usage. I implemented my test case using the CHOLMOD, SPQR, and KLU components. The first two have built-in CUDA acceleration and the last is described as particularly suitable for electrical circuits, which is what I have. It begins like this:

cholmod_common Common, *cc ;
cc = &Common ;
cholmod_l_start (cc) ;

cholmod_triplet * Gct = cholmod_l_allocate_triplet(
    15, 15, Gentries.size(),
    0,  // stype: both upper and lower are stored
    CHOLMOD_REAL,
    cc);
for ( size_t i = 0; i < Gentries.size(); ++i) {
    reinterpret_cast<long *>(Gct->i)[i] = Gentries[i].row;
    reinterpret_cast<long *>(Gct->j)[i] = Gentries[i].col;
    reinterpret_cast<double *>(Gct->x)[i] = Gentries[i].value;
}
Gct->nnz = Gentries.size();

// convert triplet matrix to sparse
cholmod_sparse * G = cholmod_l_triplet_to_sparse(Gct, Gentries.size(), cc);

Notice the mandatory cholmod_common object that is passed to all methods. It has what amounts to a separate constructor (initializer?) cholmod_l_start and destructor cholmod_l_finish which we can take care of with an RAII wrapper. cholmod_l_allocate_triplet plays a constructor role as well, and has a complementary cholmod_l_free_triplet “destructor”. unique_ptr with a custom deleter can handle this as it did in CSparse, though the deleter has to retain a reference to the common object.

Also note the _l in the functions. This nomenclature distinguishes C functions that use long ints from those that use ints, something we would normally handle with a template argument in C++.

As in the CSparse case, I could not find a way to avoid using an intermediate dense matrix (although I did find one library, SuperLU, that can produce a sparse solve output).

// initialize A with contents of B
cholmod_dense * Adense = cholmod_l_zeros( 15, 3, CHOLMOD_REAL, cc );
for ( size_t i = 0; i < Bentries.size(); ++i) {
    reinterpret_cast<double*>(Adense->x)[15*Bentries[i].col+Bentries[i].row] = Bentries[i].value;
}

The LU decomposition with KLU goes much the same way as it did in CSparse. Like CHOLMOD, KLU uses a common object:

klu_l_common kcommon;
klu_l_defaults( &kcommon );
klu_l_symbolic * KS = klu_l_analyze( 15,
                                     reinterpret_cast<long*>(G->p),
                                     reinterpret_cast<long*>(G->i),
                                     &kcommon );
klu_l_numeric  * KN = klu_l_factor(  reinterpret_cast<long*>(G->p),
                                     reinterpret_cast<long*>(G->i),
                                     reinterpret_cast<double*>(G->x),
                                     KS,
                                     &kcommon );
klu_l_solve ( KS,          // Symbolic factorization
              KN,          // Numeric
              15,
              3,
              reinterpret_cast<double*>(Adense->x),
              &kcommon );

// convert to cholmod_sparse
cholmod_sparse * A = cholmod_l_dense_to_sparse( Adense, 1, cc );

Now we use SPQR to get the Q matrix:

cholmod_sparse *Q, *R;   // outputs
assert( SuiteSparseQR<double> ( SPQR_ORDERING_DEFAULT, SPQR_DEFAULT_TOL, 3, A,
                                &Q, &R, nullptr, cc ) >= 0);

Notice in this case the “thin” (15x3) Q result is directly available as a sparse matrix, a feature not offered by either Eigen or CSparse.

Aside: Implementation and Raw Loops

One place where I think Modern C++ could make a big impact in all of these libraries is in the use of standard algorithms. When you think about it, operations on sparse matrices come down to manipulating sorted sequences and efficiently handling containers - the bread and butter of the standard library. I suspect many of the core algorithms could be replaced with much more elegant code using <algorithm>, and with the Parallelism TS you’d have a natural upgrade path to acceleration.

Conclusions

If I were starting a new project I might find a way to re-use the architecture of a recent C++11/14 library, like Blaze, and add bindings for SuiteSparse to the back end. If a more permissive license than LGPL is needed, working to improve Eigen might be the better bet.

The full versions of the above code can be found on GitHub.