Numerical Linear Algebra
I took these notes in fall 2016 for CME 302, taught by Professor Eric Darve.
The Sherman-Morrison-Woodbury identity calculates the inverse of a matrix after a low-rank update:
Important vector norms include the -norms:
The induced norm from a vector norm is defined as They satisfy . In some cases, the induced norm has a special form: The Frobenius norm is defined as These norms are all submultiplicative, i.e. .
\todo Norm equivalence?
\todo \todo spectral radius: symmetric implies
The unit roundoff for floating-point arithmetic is half the gap between and the next largest floating-point number. It is for double-precision ( bits of precision). The relative error after an elementary floating point operation is bounded above by . Be aware of catastrophic cancellation: when adding two numbers of different magnitude, loss of precision may occur. The condition number of a matrix is defined as , with if is not invertible. Note that . When solving , the relative error in is bounded above by the times the sum of the relative errors in and , for small error (with negligible). The condition number is minimized for orthogonal matrices , with . The number of bits of accuracy lost is roughly . The relative error of with true value is . The forward error of a numerical algorithm is . The backward error is , where ; that is, is the perturbed such that the exact algorithm would compute the computed answer. If the relative backward error of an algorithm is , then the algorithm is said to be backward stable.
Singular value decomposition
where and are unitary, and . They satisfy Note that and .
\todo Properties of SVD?
\todo Thin SVD
For a square matrix , we can often compute its LU decomposition where is unit lower-triangular and is upper-triangular. This factorization is an algebraic description of Gaussian elimination. With the LU decomposition , solving becomes simple. We first solve by forward substitution, then by backsubstitution, both with complexity . To upper-triangularize , we repeatedly apply Gauss transformations, where the th Gauss transformation zeroes out the lower part of the th column. Thus, we will have where the first columns of are upper-triangular, and the first columns of are unit lower-triangular. The th Gauss transformation is an outer product update where the vector has components and is the th column of . The entry is called the pivot and therefore must be non-zero. Conveniently, has the form We repeat this, if possible, for to obtain the LU decomposition. The th pivot is non-zero if and only if the first -by- submatrix of is invertible. Therefore, the LU factorization exists if and only if this condition is satisfied for . If itself is also invertible, then the LU factorization is unique. Pivots that are equal to or close to zero present numerical difficulties (i.e. LU is not backward stable). To avoid this, before the th Gauss transformation, we can exchange rows and/or columns so that the pivot becomes larger (by magnitude). This computes the decomposition , where permutes rows and permutes columns. In this case, the decomposition always exists (if is invertible). There are three common strategies for choosing a row and/or column to swap the th row and/or column with: Partial pivoting: find the largest (by magnitude) entry within Complete pivoting: find the largest entry within Rook pivoting: search for an entry that is the largest within its row and column by starting from and alternately ascending within its row, then its column, etc. until convergence With any of these strategies, the entries of will have magnitude at most , leading to a stable algorithm. In the LU algorithm (with or without pivoting), the naive algorithm requires time for steps, the same as Gaussian elimination’s for steps. However, in partial or rook pivoting, we can reduce this to if we assume that the pivot is found within linear searches. In complete pivoting, we can stop once the pivot is approximately , indicating the rank of the matrix. A useful tool in analysis is the Schur complement. In the partial decomposition it will turn out that , the Schur complement of .
If is a symmetric positive-definite matrix, then there exists a unique lower-triangular matrix with positive diagonal entries such that . This is the Cholesky decomposition, easily calculable from the LU decomposition. No pivoting will be required (because is positive-definite), even for stability! Closely related is the LDL decomposition, , where is unit lower-triangular and is diagonal.
For an -by- matrix , we can always compute its QR decomposition where and are square, is orthogonal and is upper-triangular. The thin QR decomposition is unique if is full column-rank; that is, and are unique up to the sign of each column. To upper-triangularize , we can apply orthogonal transformations to :
A Householder reflection is an orthogonal transformation that has the ability to transform
If we define
for , then , which is exactly this property. Geometrically, it reflects vectors across the hyperplane determined by . Notice this is an outer product update; it takes time to compute and apply . Thus, a sequence of Householder reflections suffices to compute the QR decomposition, the th one choosing to be the last components of the th column. Thus it takes time to compute . Often it suffices to store only the vectors , thus storing in factored form.
A Givens rotation is an orthogonal transformation that has the ability to transform
It’s a Householder reflection restricted to two dimensions. The calculation and application of a Givens rotation only affects two rows and so is ; similarly, a Givens rotation only affects two columns and so is . Givens rotations are useful to compute the QR decomposition of sparse matrices.
The Gram-Schmidt process orthonormalizes a set of vectors by subtracting off components in already processed directions. To compute the th vector, we compute: with normalized to unit norm. To compute the QR decomposition, we apply this to the columns of . Then the entries form an upper-triangular matrix, because so if . This algorithm, classical Gram-Schmidt, is numerically unstable. Modified Gram-Schmidt has better numerical properties by orthonormalizing against any error introduced, using a recursive implementation rather than an iterative one. To compute the th vector, we compute These two algorithms require operations. Computing the QR decomposition with Gram-Schmidt has the advantage of producing orthonormal vectors after iterations, whereas computing the QR decomposition with Householder reflections only produces the full set of vectors after the whole process is complete.
Dense linear systems
To solve exactly, we calculate , and the problem becomes solving two easy triangular systems.
The least-squares problem finds the that minimizes . Geometrically, the error must be orthogonal to the range of , i.e. . We can try to solve directly by computing the Cholesky decomposition of . However, , so this is ill-conditioned if is large already. For a better-conditioned algorithm, we can write the thin QR decomposition and express the orthogonality condition as , or . The error has magnitude . If is not injective, then can only be minimized up to a vector in the kernel of . We can additionally minimize to uniquely specify a least-squares solution. In this case, we can write the thin SVD and express the orthogonality condition as , or . Similarly, the error has magnitude .
Dense eigenvalue problems
There are many methods to find the eigenvectors of a matrix . Once an eigenvector is estimated, an estimate for the eigenvalue can be found by for normalized. Power iteration finds the eigenvector with the largest eigenvalue by iterating on a random vector, normalizing the vector after each step. The error decays as . Inverse iteration finds the eigenvector with eigenvalue nearest a guess by applying power iteration to . This works because has an eigenvector with eigenvalue approximately , namely the relevant eigenvector of . The error decays as where is the next nearest eigenvalue. Rayleigh quotient iteration works only on symmetric matrices. We perform inverse iteration as usual, but at every step, we set the guess eigenvalue with the latest estimated eigenvalue . This yields cubic convergence, . For symmetric matrices, it converges almost always, but the same is not true for non-symmetric matrices. Orthogonal iteration is a batch version of power iteration. Let be an orthogonal basis for a subspace; then iterating on , we should obtain an invariant subspace. Note that we are also implicitly iterating on the nested subspaces for all . If we keep orthogonal at every step with the QR decomposition, we prevent the subspaces from collapsing into fewer dimensions. If has columns, the convergence to the correct subspace is . Since a fixed point of this iteration is for upper-triangular, we see that this algorithm computes the Schur decomposition of . Since is similar to , they have the same eigenvalues, and since it is triangular, they appear on the diagonal. Thus, we can interpret the entries on the diagonal of as approximations to the eigenvalues of , even though the columns of are not necessarily eigenvectors (unless is symmetric). Notice that we can compute directly from . From , obtain and . This motivates the QR algorithm. The QR algorithm computes directly from the QR decomposition of . (Note that here is from orthogonal iteration.) We see from the equation that the eigenvalues are preserved at each step. Again, converges to be upper-triangular with the eigenvalues of on the diagonal.
There are several optimizations that make this algorithm practical:
Reduction to upper-Hessenberg form: In the general case, each step takes time, prohibitively expensive. However, we can preconvert into upper-Hessenberg form by conjugating with Householder reflections, in operations. Then at each QR step, we can compute the QR decomposition with only Givens rotations, so that each step requires only operations. Each QR step preserves the upper-Hessenberg form (). Deflation: If at some point becomes unreduced upper-Hessenberg (that is, one of the subdiagonal entries becomes ), then the matrix becomes block upper-triangular, and the eigenvalues of the whole matrix can be found by finding the eigenvalues of the diagonal blocks. Bulge chasing: At each QR step, the usual method is to calculate Givens rotations and apply them first on the left and then on the right: This is wasteful as it requires storing the Givens rotations. It turns out that it is possible to achieve the same result with an alternate sequence of Givens rotations, multiplied from in to out This relies on the upper-Hessenberg form of the matrix. We conjugate by Givens rotations, where we calculate the sequence of Givens rotations as follows (rotating into ): These two methods turn out to be equivalent by the implicit Q theorem, which says that if and are both unreduced upper-Hessenberg for and orthogonal, then if the first columns of and are the same, then all the columns are the same up to sign. Shifted QR: If at each step we compute and , we arrive at shifted QR iteration. The eigenvalues are preserved as in usual QR step, except the rate of convergence, which now is , is accelerated if is close to an eigenvalue. Thus, in a single-shift strategy, we choose to be the last entry in , which is (likely the best) approximate eigenvalue. If the last -by- block has complex eigenvalues, however, this strategy doesn’t converge. In this case, we can use a double-shift strategy, which shifts by both eigenvalues (which are complex conjugate pairs) sequentially. We can remain in real arithmetic by instead computing the QR decomposition of , where is the trace and is the determinant of the subblock respectively, and setting . The cost of calculating can be avoided by appealing to the implicit theorem: we can calculate just the first column of and finish using Householder reflections in a bulge chasing pattern, in time. The error decays as or for symmetric matrices.
Sparse eigenvalue problems
When applying QR iteration to sparse matrices, reduction to upper-Hessenberg form using Householder reflections destroys sparsity. Instead, it is possible to achieve the reduction iteratively using the Gram-Schmidt process.
The th Krylov subspace of a matrix and a starting vector is Arnoldi iteration calculates an orthogonal basis that spans via modified Gram-Schmidt. is upper-Hessenberg by construction (see below), and we can use its eigenvalues to approximate the eigenvalues of . These eigenvalues are called Ritz eigenvalues and often converge to the extremal eigenvalues of . Therefore, after steps of Arnoldi iteration, we may apply the usual QR algorithm to the -by- upper-Hessenberg matrix to find the extremal eigenvalues of . To obtain , it suffices to orthonormalize instead of , since To see this, note that By induction, only the component can be orthogonal to , so we may consider only it instead of . Since is orthonormalized against , this means that is upper-Hessenberg, and indeed, so is . Note that , but is a projection into . Thus we may write . Lanczos iteration is the same process for symmetric matrices. In this case, is symmetric as well and thus tridiagonal. Remarkably, this means that in the Lanczos basis, only has components in the , , and directions. Hence, in each Gram-Schmidt step, we need to subtract the projections onto and . This works in theory, but when using inexact arithmetic, the calculated vectors do not tend to stay orthogonal to each other. Lanczos iteration terminates in at most steps, where is the number of distinct eigenvalues of . To see this, decompose into its components in the eigenspaces. The powers can be written as a linear combination of these components, so can grow to be at most -dimensional. Each iteration requires a matrix-vector product and orthogonalizing against vectors. Thus, the work per iteration is plus the work required for a matrix-vector product (which may be cheap, since is usually sparse).
Sparse linear systems
Sparse matrix formats
For sparse direct methods, it’s important how the matrices are stored. The compressed sparse row (CSR) format allow efficient matrix-vector multiplications and row access. It keeps a list of non-zero entries and their column numbers in row-major order and a list of row pointers that indicate the start of each row in the first list. The compressed sparse column (CSC) format is analogous for columns.
Sparse triangular solve
The most basic ingredient for solving sparse linear systems is solving a triangular system , where and are both sparse. This is done in two steps: determining which entries of can be non-zero and then computing the actual numerical solution. To compute which entries of can be non-zero, note that the solution via forward substitution is so we find two rules: if is non-zero, then may be non-zero; if and are both non-zero, then may be non-zero. The reach of a is the set of entries that may be non-zero if is non-zero. We can visualize these rules in two ways. Visually on the matrix , to find the reach of , first mark . Now trace down the th column of the matrix, and if the th entry is non-zero, then mark . Then repeat down the th column, for every entry so marked, recursively until out of columns to trace. More abstractly, we can consider the graph whose adjacency matrix is , directed low to high. Then the reach of is the set of nodes reachable from the node .
Cholesky decomposition for sparse matrices
If is symmetric positive-definite, we can solve linear systems using the Cholesky decomposition. To derive the up-looking Cholesky factorization, we note that it must satisfy From this, we find that Therefore: may be solved recursively from , is a sparse triangular solve involving the corresponding row , and involves a sparse dot product. Thus each row of is a sparse triangular solve involving the submatrix above it, allowing us to compute row-by-row. We can think of the resulting sparsity pattern of as a “filling in” of . When we calculate the sparsity pattern of the th row (according to ), we would normally compute the graph of and compute the reach of in it. However, here we can leverage the special structure of the problem. There are two optimizations: is a submatrix of , so when we build the graph of , we may simply take the graph of and add on the edges arising from the entries in . Instead of using the full graph, we may consider only the elimination tree, which is the subgraph of the full adjacency graph that indicates only the first non-diagonal entry of each column. A node in the elimination tree has the same reach as in the full graph of : suppose two entries in the same column, and , are non-zero. When computing the th row, exists, so the would force the fill-in of . If is the first entry in its column, then an edge exists, making the edge redundant. Otherwise, we may repeat the argument to show that a path exists, again making the edge redundant.
To solve , we can split for invertible. Then we can rewrite , so that the solution is a fixed point of the function , where is the iteration matrix. We easily see that the error at the th iteration . Thus, the iterative method converges (to ) if and only if the spectral radius of satisfies . One way to establish the bound is to show that for some matrix norm. Thus, for a method to be useful, it must be that for a nice class of matrices, and must be easily invertible. Several classical examples are the following, where , , and denote the strictly lower-triangular, diagonal, and strictly upper-triangular parts of : Jacobi iteration: We choose . If is strictly diagonally dominant, i.e. for all , , then Jacobi iteration converges. Gauss-Seidel iteration: We choose . If is symmetric positive-definite, then Gauss-Seidel iteration converges. Successive over-relaxation: We choose and , where we choose to minimize .
Krylov subspace methods
By the Cayley-Hamilton theorem, we can always find the solution in the th Krylov subspace. For a symmetric positive-definite matrix , conjugate gradient is an iterative method that finds the that minimizes , where is used as an inner product matrix (or, equivalently, ). The Hilbert space projection theorem says that the error is -orthogonal to the subspace , or If we transform to the Lanczos basis via and using the tridiagonal matrix , then we have To solve this efficiently, we develop a recurrence for the LDL factorization . Then we can develop a recurrence for , where solves . Then we conclude that , where has a particularly nice recurrence too. We can even easily calculate the residual . Each step requires work plus the time required for a matrix-vector product. It is possible to show that the total work is , where is the number of entries of , and is the condition number. To develop a convergence theory, notice that can be written as for some degree- polynomial . Thus, we can view the minimization of as the minimization of over degree polynomials , or the minimization of over degree- polynomials with . Thus, intuitively, we would like to be small for eigenvalues of . Roughly, convergence occurs in steps, where is the number of clusters of eigenvalues has. We can derive a bound on the error as
MINRES (“minimum residual”) is another Krylov subspace method similar to conjugate gradient, but it works for all symmetric invertible matrices, even indefinite. It instead minimizes the residual for (or ). In the Lanczos basis, this becomes for , which is a least-squares problem solvable using a QR decomposition that has a nice recurrence, like conjugate gradient does for LDL.
GMRES (“generalized minimum residual”) is another Krylov subspace method that works for general square invertible matrices. It is the same as MINRES but uses Arnoldi iteration instead of Lanczos iteration. The updates to the QR decompositions, however, requires the use of Givens rotations. This is much more expensive in terms of time and storage; there are ways to get around it, e.g. restarting after steps. It’s much more subtle! Using similar polynomial analysis as in conjugate gradient, this minimization problem is implicitly minimizing over degree- polynomials such that . If is diagonalizable, we get a bound on the residual
The problem is identical to the problem , where is a preconditioner. A preconditioner is good if is not too far from normal (so ) and its eigenvalues are clustered (so can be made small for small ).
Iterative refinement improves the accuracy of a solution method by repeatedly solving the problem on different scales. Suppose we use some method to find that approximately solves . Then we solve to receive the improved solution .