While investigating the implementation of an interesting EDA algorithm I became familiar with the \(QR\) functionality of the Eigen linear algebra library. This is the story of how I uncovered a potentially useful speedup in its code.

Background

The \(QR\) decomposition of an \(m \times n\) matrix \(A\) is a pair of matrices with special properties, whose product is the original matrix:

\[A = \overbrace{\begin{bmatrix} q_{10} & q_{11} & \cdots & q_{1m} \\ q_{20} & q_{21} & \cdots & q_{2m} \\ \vdots & \vdots & \ddots & \vdots \\ q_{m0} & q_{m1} & \cdots & q_{mm} \end{bmatrix}}^{\displaystyle Q} \quad \overbrace{\begin{bmatrix} r_{10} & r_{11} & \cdots & r_{1n} \\ 0 & r_{21} & \cdots & r_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & r_{mn} \end{bmatrix}}^{\displaystyle R}\]

The \(m \times n\) matrix \(R\) is upper triangular, while the matrix \(Q\) is orthogonal, that is, it has the following properties:

  1. Each column is perpendicular to every other column
  2. The columns of \(Q\) are normalized: the norm (magnitude) of each is \(1.0\)

Further, thanks to the decomposition itself the columns are a basis for \(A\): every column in the original matrix is a linear combination of the columns of \(Q\).

The Decomposition

(Simplified for clarity)

The matrix \(R\) is created from the input \(A\) by introducing \(m-k\) zeros below the diagonal of column \(k\) on each iteration, through multiplication by a Householder matrix \(Q_k\). There is a procedure for producing this matrix, the important parts of which are a Householder vector \(h_k\) and a scalar \({\beta}_k\):

\[Q_k = I - {\beta}_k h_k {h_k}^*\]

Because the Householder matrix creates one fewer zero each iteration, the essential part (not \(1\) or \(0\)) of \(h_k\) is smaller each time, of length \(m-k\).

We store the \({\beta}_k\) values and the \(m-1\) \(h_k\) vectors as the factored form of \(Q\). Importantly, we do not explicitly calculate \(Q\). This saves some storage (\(\frac{m(m-1)}{2}\) vs. \(m^2\)), but also has a performance advantage.

Multiplying by Q

The matrix \(R\) is created by multiplying the original matrix \(A\) by \(Q_1 \cdots Q_m\):

\[R = Q_m Q_{m-1} \cdots Q_2 Q_1 A\]

And since \(A = QR\):

\[\begin{align*} Q &= (Q_m Q_{m-1} \cdots Q_2 Q_1)^{-1} \\ Q &= {Q_1}^{-1} {Q_2}^{-1} \cdots {Q_{m-1}}^{-1} {Q_m}^{-1} \end{align*}\]

The matrices \(Q_k\) are their own inverses (because they are reflections), so:

\[Q = Q_1 Q_2 \cdots Q_{m-1} Q_m\]

To multiply some matrix \(C\) by \(Q\) involves multiplying by each of the matrices \(Q_k\):

\[Q C = Q_1 \cdots Q_m C\]

We use the factored form on each iteration:

\[\begin{align*} C_k & = Q_k C_{k+1} \\ C_k & = (I - {\beta}_k h_k {h_k}^*) C_{k+1} \\ C_k & = C_{k+1} - {\beta}_k h_k {h_k}^* C_{k+1} \\ \end{align*}\]

This is usually faster, especially for thin (fewer columns than rows) \(C\).

Extracting Q

The orthogonal basis encoded in the \(Q\) matrix has many useful applications.

Eigen implements Q extraction by applying its \(Q\) multiplication functionality to an identity matrix. Specifically:

MatrixXd Q = sparseqr.matrixQ();    // get Q as a dense matrix

calls:

template< typename DstXprType, typename SparseQRType>
struct Assignment<DstXprType, SparseQRMatrixQReturnType<SparseQRType>,
                  internal::assign_op<typename DstXprType::Scalar,
                                      typename DstXprType::Scalar>,
                  Sparse2Dense>
{
  ...
  static void run(DstXprType &dst, const SrcXprType &src,
                  const internal::assign_op<Scalar,Scalar> &)
  {
    dst = src.m_qr.matrixQ() * DstXprType::Identity(src.m_qr.rows(), src.m_qr.rows());
  }

which in turn calls the general multiplication function for \(Q\). One day as I was reading my favorite math book, “Matrix Computations”, Golub and van Loan, 4th Ed., I noticed the following comment (page 238) on computing \(Q\):

Recall that the leading (j-1)-by-(j-1) portion of \(Q_j\) is the identity… This pattern can be exploited to reduce the number of required flops.

It then gives an algorithm taking advantage of backward accumulation (applying \(Q_k\) in reverse order), plus the fact that the right hand side is the identity, to work on a smaller portion of the matrix on each iteration. Since Eigen’s sparse \(Q\) multiplication is fully general, this seemed like an opportunity to speed up the code with a specialized algorithm.

Specializing for Identity RHS

Creating a special implementation of an algorithm for certain types is one of the big advantages of C++’s template system. In this case I needed to specialize the SparseQR_QProduct template to supply a different evalTo method when the right hand side is of the type returned by Matrix<>::Identity(). I came up with this:

template<typename SparseQRType, typename Derived>
struct SparseQR_QProduct<SparseQRType,
                         CwiseNullaryOp<internal::scalar_identity_op<typename Derived::Scalar>,
                                        Derived> >
...
{
  ...
  template<typename DesType>
  void evalTo(DesType& res) const
  {
    // textbook code

It’s not exactly pretty but I think it’s correct. C++11 and beyond would clean this up, but Eigen supports legacy compilers.

Update

An Eigen maintainer got back to me quite quickly with some good feedback. The biggest one was the observation that a full class specialization was unnecessary, since most of the code was the same. The only difference was in a loop iteration limit, which we can calculate at compile time from the SparseQR_QProduct template parameters:

Index mincol = internal::is_identity<Derived>::value ? k : 0;

The compiler should optimize away the choice of limit, and we get all the benefit of specialization without code duplication.

Optimizing Q Multiplication in General

While comparing the algorithm implemented in Eigen with my textbook I noticed another interesting difference: In Eigen, the outer loop is over the columns of the result, while in my textbook it’s over the Householder vectors \(h_k\). In Eigen, \(R = Q C\) (for \(C \in \mathbb{C}^{m \times p}\)) looks like this:

R = C
for j = 1:p
    for k = min(m, n):1
        \(R_j = R_j - {\beta}_k h_k ({h_k}^* \cdot R_j) \)
    end
end

which seems reasonable. Reordering the loops has a couple of advantages, though:

  1. We can multiply \({\beta_k}\) and \(h_k\) once and reuse the product.
  2. Because \(h_k\) is generally dense, storing this intermediate product as a dense vector is faster to access.

So the new product code looks like:

R = C
for k = min(m, n):1
    beta_hk = \({\beta}_k h_k\)
    for j = 1:p
        \(R_j = R_j \; - \) beta_hk \({h_k}^* \cdot R_j \)
    end
end

Results

These changes produce a nice speedup, particularly for \(Q\) extraction. Results on my laptop for a random \(1000 \times 1000\) matrix with 20% density are shown below:

Performance

Resources

My benchmarking (based on Google Benchmark) and random testing code for this change can be found on Github.

The PR itself is on Bitbucket. Fingers crossed. Edit: results were mixed. Details in a new post.