While investigating the implementation of an interesting EDA algorithm I became familiar with the functionality of the Eigen linear algebra library. This is the story of how I uncovered a potentially useful speedup in its code.

Background

The decomposition of an matrix is a pair of matrices with special properties, whose product is the original matrix:

The matrix is upper triangular, while the matrix is orthogonal, that is, it has the following properties:

  1. Each column is perpendicular to every other column
  2. The columns of are normalized: the norm (magnitude) of each is

Further, thanks to the decomposition itself the columns are a basis for : every column in the original matrix is a linear combination of the columns of .

The Decomposition

(Simplified for clarity)

The matrix is created from the input by introducing zeros below the diagonal of column on each iteration, through multiplication by a Householder matrix . There is a procedure for producing this matrix, the important parts of which are a Householder vector and a scalar :

Because the Householder matrix creates one fewer zero each iteration, the essential part (not or ) of is smaller each time, of length .

We store the values and the vectors as the factored form of . Importantly, we do not explicitly calculate . This saves some storage ( vs. ), but also has a performance advantage.

Multiplying by Q

The matrix is created by multiplying the original matrix by :

And since :

The matrices are their own inverses (because they are reflections), so:

To multiply some matrix by involves multiplying by each of the matrices :

We use the factored form on each iteration:

This is usually faster, especially for thin (fewer columns than rows) .

Extracting Q

The orthogonal basis encoded in the matrix has many useful applications.

Eigen implements Q extraction by applying its multiplication functionality to an identity matrix. Specifically:

MatrixXd Q = sparseqr.matrixQ();    // get Q as a dense matrix

calls:

template< typename DstXprType, typename SparseQRType>
struct Assignment<DstXprType, SparseQRMatrixQReturnType<SparseQRType>,
                  internal::assign_op<typename DstXprType::Scalar,
                                      typename DstXprType::Scalar>,
                  Sparse2Dense>
{
  ...
  static void run(DstXprType &dst, const SrcXprType &src,
                  const internal::assign_op<Scalar,Scalar> &)
  {
    dst = src.m_qr.matrixQ() * DstXprType::Identity(src.m_qr.rows(), src.m_qr.rows());
  }

which in turn calls the general multiplication function for . One day as I was reading my favorite math book, “Matrix Computations”, Golub and van Loan, 4th Ed., I noticed the following comment (page 238) on computing :

Recall that the leading (j-1)-by-(j-1) portion of is the identity… This pattern can be exploited to reduce the number of required flops.

It then gives an algorithm taking advantage of backward accumulation (applying in reverse order), plus the fact that the right hand side is the identity, to work on a smaller portion of the matrix on each iteration. Since Eigen’s sparse multiplication is fully general, this seemed like an opportunity to speed up the code with a specialized algorithm.

Specializing for Identity RHS

Creating a special implementation of an algorithm for certain types is one of the big advantages of C++’s template system. In this case I needed to specialize the SparseQR_QProduct template to supply a different evalTo method when the right hand side is of the type returned by Matrix<>::Identity(). I came up with this:

template<typename SparseQRType, typename Derived>
struct SparseQR_QProduct<SparseQRType,
                         CwiseNullaryOp<internal::scalar_identity_op<typename Derived::Scalar>,
                                        Derived> >
...
{
  ...
  template<typename DesType>
  void evalTo(DesType& res) const
  {
    // textbook code

It’s not exactly pretty but I think it’s correct. C++11 and beyond would clean this up, but Eigen supports legacy compilers.

Update

An Eigen maintainer got back to me quite quickly with some good feedback. The biggest one was the observation that a full class specialization was unnecessary, since most of the code was the same. The only difference was in a loop iteration limit, which we can calculate at compile time from the SparseQR_QProduct template parameters:

Index mincol = internal::is_identity<Derived>::value ? k : 0;

The compiler should optimize away the choice of limit, and we get all the benefit of specialization without code duplication.

Optimizing Q Multiplication in General

While comparing the algorithm implemented in Eigen with my textbook I noticed another interesting difference: In Eigen, the outer loop is over the columns of the result, while in my textbook it’s over the Householder vectors . In Eigen, (for ) looks like this:

R = C
for j = 1:p
    for k = min(m, n):1
        \(R_j = R_j - {\beta}_k h_k ({h_k}^* \cdot R_j) \)
    end
end

which seems reasonable. Reordering the loops has a couple of advantages, though:

  1. We can multiply and once and reuse the product.
  2. Because is generally dense, storing this intermediate product as a dense vector is faster to access.

So the new product code looks like:

R = C
for k = min(m, n):1
    beta_hk = \({\beta}_k h_k\)
    for j = 1:p
        \(R_j = R_j \; - \) beta_hk \({h_k}^* \cdot R_j \)
    end
end

Results

These changes produce a nice speedup, particularly for extraction. Results on my laptop for a random matrix with 20% density are shown below:

Performance

Resources

My benchmarking (based on Google Benchmark) and random testing code for this change can be found on Github.

The PR itself is on Bitbucket. Fingers crossed. Edit: results were mixed. Details in a new post.