Casey Chu
• Numerical Linear Algebra
• I took these notes in fall 2016 for CME 302, taught by Professor Eric Darve.
• \tableofcontents
• Preliminaries
• The Sherman-Morrison-Woodbury identity calculates the inverse of a matrix after a low-rank update: $(A + UV^T)^{-1} = A^{-1} - A^{-1}U(I + V^TA^{-1}U)^{-1}V^T A^{-1}.$
• Vector norms
• Important vector norms include the $p$-norms: $||x||_p = \Big( \sum_i |x_i|^p \Big)^{1/p}, \qquad ||x||_\infty = \max |x_i|.$
• Matrix norms
• The induced norm from a vector norm is defined as $||A|| = \sup_{x \ne 0} \frac{||Ax||}{||x||} = \max_{||x|| = 1} ||Ax||.$ They satisfy $||Ax|| \le ||A||\, ||x||$. In some cases, the induced norm has a special form: $||A||_1 = \max_{j} \sum_i |a_{ij}|, \qquad ||A||_\infty = \max_i \sum_j |a_{ij}| .$
• The Frobenius norm is defined as $||A||_F = \Big( \sum_{ij} a_{ij}^2 \Big)^{1/2} = \sqrt{\mathrm{Tr}(A^\dagger A)}$
• These norms are all submultiplicative, i.e. $||AB|| \le ||A||\, ||B||$.
• \todo Norm equivalence?
• \todo $||A||_2 = \sigma_{\text{max}}$
• \todo spectral radius: symmetric implies $\rho(A) = ||A||_2$
• Rayleigh quotient
• Error analysis
• The unit roundoff $u$ for floating-point arithmetic is half the gap between $1$ and the next largest floating-point number. It is $u = 2^{-53} \approx 10^{-16}$ for double-precision ($52$ bits of precision). The relative error after an elementary floating point operation is bounded above by $u$. Be aware of catastrophic cancellation: when adding two numbers of different magnitude, loss of precision may occur.
• The condition number of a matrix is defined as $\kappa(A) = ||A||\, ||A^{-1}||$, with $\kappa(A) = \infty$ if $A$ is not invertible. Note that $\kappa_2(A) = \frac{\sigma_{\text{max}}(A)}{\sigma_{\text{min}}(A)}$. When solving $Ax = b$, the relative error in $x$ is bounded above by the $\kappa(A)$ times the sum of the relative errors in $A$ and $b$, for small error (with $O(\epsilon^2)$ negligible). The condition number is minimized for orthogonal matrices $A$, with $\kappa(A) = 1$. The number of bits of accuracy lost is roughly $\log_2 \kappa$.
• The relative error of $\tilde{x}$ with true value $x$ is $\frac{||\tilde{x} - x ||}{||x||}$. The forward error of a numerical algorithm $\tilde{f}$ is $|| f(x) - \tilde{f}(x) ||$. The backward error is $|| x - \tilde{x} ||$, where $f(\tilde{x}) = \tilde{f}(x)$; that is, $\tilde{x}$ is the perturbed $x$ such that the exact algorithm would compute the computed answer. If the relative backward error of an algorithm is $O(u)$, then the algorithm is said to be backward stable.
• Matrix decompositions
• Eigenvalue decomposition
• $A = P \Lambda P^{-1} = \sum_{i=1}^n \lambda_i v_i v_i^*$
• Singular value decomposition
• $A = U\Sigma V^\dagger = \sum_{i=1}^r \sigma_i u_i v_i^\dagger,$
• where $U = [u_1, \ldots, u_n]$ and $V = [v_1, \ldots, v_m]$ are unitary, and $\Sigma = \mathrm{diag}(\sigma_1, \ldots, \sigma_r, 0, \ldots, 0)$. They satisfy $A^T A v_i = \sigma_i^2 v_i \qquad AA^T u_i = \sigma_i^2 u_i.$ Note that $||A||_2 = \sigma_1$ and $||A||_F = \sqrt{\sum_{i} \sigma_i^2}$.
• \todo Properties of SVD?
• \todo Thin SVD
• LU decomposition
• For a square matrix $A$, we can often compute its LU decomposition
• $A = LU$
• where $L$ is unit lower-triangular and $U$ is upper-triangular. This factorization is an algebraic description of Gaussian elimination. With the LU decomposition $A = LU$, solving $Ax = b$ becomes simple. We first solve $Ly = b$ by forward substitution, then $Ux = y$ by backsubstitution, both with complexity $O(n^2)$.
• To upper-triangularize $A$, we repeatedly apply Gauss transformations, where the $k$th Gauss transformation $G_k$ zeroes out the lower part of the $k$th column. Thus, we will have $G_k \cdots G_1 A = U_k \quad \rightarrow \quad A = L_k U_k$where the first $k$ columns of $U_k$ are upper-triangular, and the first $k$ columns of $L_k = G_1^{-1} \cdots G_k^{-1}$ are unit lower-triangular.
• The $k$th Gauss transformation is an outer product update $G_k = I - g_k e_k^T$ where the vector $g_k$ has components $g_k[i] = \begin{cases} 0 & i \le k \\ \frac{v[i]}{v[k]} & i > k \end{cases},$
• and $v$ is the $k$th column of $G_{k-1} \cdots G_1 A = U_{k-1}$. The entry $v[k]$ is called the pivot and therefore must be non-zero. Conveniently, $L_k$ has the form $L_k = \begin{bmatrix} 1 & \\ | & 1 & & \\ | & | & 1 & & \\ | & | & | & 1\\ g_1 & \cdots & g_k & & \ddots\\ | & | & | & & & 1 \end{bmatrix}.$
• We repeat this, if possible, for $k = 1, \ldots, n-1$ to obtain the LU decomposition. The $k$th pivot is non-zero if and only if the first $k$-by-$k$ submatrix of $A$ is invertible. Therefore, the LU factorization exists if and only if this condition is satisfied for $k = 1, \ldots, n-1$. If $A$ itself is also invertible, then the LU factorization is unique.
• Pivots that are equal to or close to zero present numerical difficulties (i.e. LU is not backward stable). To avoid this, before the $k$th Gauss transformation, we can exchange rows and/or columns so that the pivot becomes larger (by magnitude). This computes the decomposition $PAQ = LU$, where $P$ permutes rows and $Q$ permutes columns. In this case, the decomposition always exists (if $A$ is invertible).
• There are three common strategies for choosing a row and/or column to swap the $k$th row and/or column with:
• Partial pivoting: find the largest (by magnitude) entry within $A[k:n, k]$
• Complete pivoting: find the largest entry within $A[k:n, k:n]$
• Rook pivoting: search $A[k:n, k:n]$ for an entry that is the largest within its row and column by starting from $A[k,k]$ and alternately ascending within its row, then its column, etc. until convergence
• With any of these strategies, the entries of $L$ will have magnitude at most $1$, leading to a stable algorithm.
• In the LU algorithm (with or without pivoting), the naive algorithm requires $O(n^2k)$ time for $k$ steps, the same as Gaussian elimination’s $O(n^3)$ for $n$ steps. However, in partial or rook pivoting, we can reduce this to $O(nk^2)$ if we assume that the pivot is found within $O(1)$ linear searches. In complete pivoting, we can stop once the pivot is approximately $0$, indicating the rank of the matrix.
• A useful tool in analysis is the Schur complement. In the partial decomposition $\begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} = \begin{bmatrix} L_{11} & 0 \\ L_{21} & I \end{bmatrix} \begin{bmatrix} U_{11} & U_{12} \\ 0 & U_{22} \end{bmatrix},$ it will turn out that $U_{22} = A_{22} - A_{21}A_{11}^{-1}A_{12}$, the Schur complement of $A$.
• Cholesky decomposition
• If $A$ is a symmetric positive-definite matrix, then there exists a unique lower-triangular matrix $L$ with positive diagonal entries such that $A = LL^T$. This is the Cholesky decomposition, easily calculable from the LU decomposition. No pivoting will be required (because $A$ is positive-definite), even for stability! Closely related is the LDL decomposition, $A = LDL^T$, where $L$ is unit lower-triangular and $D$ is diagonal.
• QR decomposition
• For an $m$-by-$n$ matrix $A$, we can always compute its QR decomposition $A = QR = \begin{bmatrix} Q_1 & Q_2 \end{bmatrix}\begin{bmatrix} R_1 \\ 0 \end{bmatrix} = Q_1 R_1,$ where $Q$ and $R_1$ are square, $Q$ is orthogonal and $R_1$ is upper-triangular. The thin QR decomposition is unique if $A$ is full column-rank; that is, $Q_1$ and $R_1$ are unique up to the sign of each column.
• To upper-triangularize $A$, we can apply orthogonal transformations to $A$: $Q_k \cdots Q_1 A = R \quad \rightarrow \quad A = Q R.$
• Householder reflections
• A Householder reflection is an orthogonal transformation that has the ability to transform
• $H \begin{bmatrix} \times \\ \times \\ \vdots \\ \times \end{bmatrix} = \begin{bmatrix} + \\ 0 \\ \vdots \\ 0 \end{bmatrix}.$
• If we define
• $H = I - \frac{2}{v^T v} vv^T$
• for $v = x - ||x||_2 e_1$, then $Hx = ||x||_2 e_1$, which is exactly this property. Geometrically, it reflects vectors across the hyperplane determined by $v$. Notice this is an outer product update; it takes $O(mn)$ time to compute $v$ and apply $H$.
• Thus, a sequence of $n-1$ Householder reflections suffices to compute the QR decomposition, the $k$th one choosing $x$ to be the last $n-k+1$ components of the $k$th column. Thus it takes $O(mn^2)$ time to compute $R$. Often it suffices to store only the vectors $v$, thus storing $Q$ in factored form.
• Givens rotations
• A Givens rotation is an orthogonal transformation that has the ability to transform
• $\begin{bmatrix} \cos\theta & \sin\theta \\ -\sin\theta & \cos\theta \end{bmatrix}^T \begin{bmatrix} a \\ b \end{bmatrix} = \begin{bmatrix} r \\ 0 \end{bmatrix}.$
• with
• $\cos\theta = \frac{a}{\sqrt{a^2 + b^2}} \qquad \sin\theta = \frac{-b}{\sqrt{a^2 + b^2}}.$
• It’s a Householder reflection restricted to two dimensions. The calculation and application of a Givens rotation $G^T A$ only affects two rows and so is $O(n)$; similarly, a Givens rotation $AG$ only affects two columns and so is $O(m)$. Givens rotations are useful to compute the QR decomposition of sparse matrices.
• Gram-Schmidt process
• The Gram-Schmidt process orthonormalizes a set of vectors $v_1, \ldots, v_n$ by subtracting off components in already processed directions. To compute the $k$th vector, we compute: $q_{k} \propto v_{k} - \sum_{i=1}^{k-1} q_i q_i^T v_{k},$ with $q_k$ normalized to unit norm. To compute the QR decomposition, we apply this to the columns of $A$. Then the entries $r_{ij} = q_i^T v_j$ form an upper-triangular matrix, because $\mathrm{span}\{ v_1, \ldots, v_k \} = \mathrm{span}\{ q_1, \ldots, q_k \},$ so $q_i^T v_j = 0$ if $i > j$. This algorithm, classical Gram-Schmidt, is numerically unstable.
• Modified Gram-Schmidt has better numerical properties by orthonormalizing against any error introduced, using a recursive implementation rather than an iterative one. To compute the $k$th vector, we compute \begin{aligned} &q_{k} \leftarrow v_{k} \\ &\text{for } i = 1, \ldots, k-1: \\ &\qquad q_{k} \leftarrow q_{k} - q_i q_i^T q_{k} \\ & q_{k} \leftarrow \frac{q_{k}}{||q_{k}||}. \end{aligned}
• These two algorithms require $O(mn^2)$ operations. Computing the QR decomposition with Gram-Schmidt has the advantage of producing $k$ orthonormal vectors after $k$ iterations, whereas computing the QR decomposition with Householder reflections only produces the full set of $n$ vectors after the whole process is complete.
• Dense linear systems
• LU decomposition
• To solve $Ax = b$ exactly, we calculate $A = LU$, and the problem becomes solving two easy triangular systems.
• Least-squares
• The least-squares problem finds the $x$ that minimizes $||b - Ax||_2$. Geometrically, the error must be orthogonal to the range of $A$, i.e. $A^T(b - Ax) = 0$.
• We can try to solve $A^T A x = A^T b$ directly by computing the Cholesky decomposition of $A^T A$. However, $\kappa(A^T A) = \kappa(A)^2$, so this is ill-conditioned if $\kappa(A)$ is large already.
• For a better-conditioned algorithm, we can write the thin QR decomposition $A = Q_1 R_1$ and express the orthogonality condition as $Q_1^T (b - Ax) = 0$, or $R_1 x = Q_1^T b$. The error has magnitude $|| Q_2^T b||_2$.
• If $A$ is not injective, then $||b - Ax||_2$ can only be minimized up to a vector in the kernel of $A$. We can additionally minimize $||x||_2$ to uniquely specify a least-squares solution. In this case, we can write the thin SVD $A = U_1 \Sigma_1 V_1^T$ and express the orthogonality condition as $U_1^T (b- Ax) = 0$, or $\Sigma_1 V_1^Tx = U_1^T b$. Similarly, the error has magnitude $|| U_2^T b ||_2$.
• Dense eigenvalue problems
• There are many methods to find the eigenvectors of a matrix $A$. Once an eigenvector $x$ is estimated, an estimate for the eigenvalue can be found by $x^T A x$ for $x$ normalized.
• Power iteration finds the eigenvector with the largest eigenvalue by iterating $x \mapsto Ax$ on a random vector, normalizing the vector after each step. The error decays as $O(|\frac{\lambda_2}{\lambda_1}|^k)$.
• Inverse iteration finds the eigenvector with eigenvalue nearest a guess $\mu$ by applying power iteration to $(A - \mu I)^{-1}$. This works because $A - \mu I$ has an eigenvector with eigenvalue approximately $0$, namely the relevant eigenvector of $A$. The error decays as $O(|\frac{\lambda - \mu}{\lambda' - \mu}|^k)$ where $\lambda'$ is the next nearest eigenvalue.
• Rayleigh quotient iteration works only on symmetric matrices. We perform inverse iteration as usual, but at every step, we set the guess eigenvalue $\mu$ with the latest estimated eigenvalue $\frac{x^T A x}{x^T x}$. This yields cubic convergence, $O(\epsilon^{3k})$. For symmetric matrices, it converges almost always, but the same is not true for non-symmetric matrices.
• Orthogonal iteration is a batch version of power iteration. Let $Q$ be an orthogonal basis for a subspace; then iterating $A$ on $Q$, we should obtain an invariant subspace. Note that we are also implicitly iterating $A$ on the nested subspaces $Q[1:k]$ for all $k$. If we keep $Q$ orthogonal at every step with the QR decomposition, we prevent the subspaces from collapsing into fewer dimensions. If $Q_0$ has $r$ columns, the convergence to the correct subspace is $O(|\frac{\lambda_{r+1}}{\lambda_r}|^k)$.
• Since a fixed point of this iteration is $A = Q T Q^T$ for $T$ upper-triangular, we see that this algorithm computes the Schur decomposition of $A$. Since $T$ is similar to $A$, they have the same eigenvalues, and since it is triangular, they appear on the diagonal. Thus, we can interpret the entries on the diagonal of $Q_k^T AQ_k = T_k$ as approximations to the eigenvalues of $A$, even though the columns of $Q_k$ are not necessarily eigenvectors (unless $A$ is symmetric).
• Notice that we can compute $T_{k+1}$ directly from $T_k$. From $AQ_k = Q_{k+1}R_{k+1}$, obtain $T_k = (Q_k^T Q_{k+1}) R_{k+1}$ and $T_{k+1} = R_{k+1}(Q_k^T Q_{k+1})$. This motivates the QR algorithm.
• The QR algorithm computes $T_{k+1} = R_k Q_k$ directly from the QR decomposition of $T_k = Q_k R_k$. (Note that $Q_k$ here is $Q_k^T Q_{k+1}$ from orthogonal iteration.) We see from the equation $T_{k+1} = Q_k^T T_k Q_k$ that the eigenvalues are preserved at each step. Again, $T_{k}$ converges to be upper-triangular with the eigenvalues of $A$ on the diagonal.
• There are several optimizations that make this algorithm practical:
• Reduction to upper-Hessenberg form: In the general case, each step takes $O(n^3)$ time, prohibitively expensive. However, we can preconvert $A = Q^T H Q$ into upper-Hessenberg form by conjugating with $n - 2$ Householder reflections, in $O(n^3)$ operations. Then at each QR step, we can compute the QR decomposition with only $n-1$ Givens rotations, so that each step requires only $O(n^2)$ operations. Each QR step preserves the upper-Hessenberg form ($T_{k+1} = R T_k R^{-1}$).
• Deflation: If at some point $T_k$ becomes unreduced upper-Hessenberg (that is, one of the subdiagonal entries becomes $0$), then the matrix becomes block upper-triangular, and the eigenvalues of the whole matrix can be found by finding the eigenvalues of the diagonal blocks.
• Bulge chasing: At each QR step, the usual method is to calculate Givens rotations $G_1, \ldots, G_{n-1}$ and apply them first on the left and then on the right: $T_{k+1} = ((G_{n-1}^T \cdots (G_1^T T_k)) G_1) \cdots G_{n-1}.$ This is wasteful as it requires storing the Givens rotations. It turns out that it is possible to achieve the same result with an alternate sequence of Givens rotations, multiplied from in to out $T_{k+1} = G_{n-1}^T \cdots (G_1^T T_k G_1) \cdots G_{n-1}.$ This relies on the upper-Hessenberg form of the matrix. We conjugate by Givens rotations, where we calculate the sequence of Givens rotations as follows (rotating $\oplus$ into $+$): $\begin{bmatrix} + & \times & \times & \times \\ \oplus & \times & \times & \times \\ & \times & \times & \times \\ & & \times& \times \end{bmatrix} \to^{G^{T}XG} \begin{bmatrix} \times & \times & \times& \times \\ + & \times & \times & \times \\ \oplus & \times & \times & \times \\ & & \times& \times \end{bmatrix} \to^{G^{T}XG} \begin{bmatrix} \times & \times & \times & \times \\ \times & \times & \times & \times \\ & + & \times & \times \\ & \oplus & \times& \times \end{bmatrix} \to T_{k+1}.$ These two methods turn out to be equivalent by the implicit Q theorem, which says that if $Q_1^T H Q_1$ and $Q_2^T H Q_2$ are both unreduced upper-Hessenberg for $Q_1$ and $Q_2$ orthogonal, then if the first columns of $Q_1$ and $Q_2$ are the same, then all the columns are the same up to sign.
• Shifted QR: If at each step we compute $T_k - \mu I = Q_{k+1}R_{k+1}$ and $T_{k+1} = R_{k+1}Q_{k+1} + \mu I$, we arrive at shifted QR iteration. The eigenvalues are preserved as in usual QR step, except the rate of convergence, which now is $O((\frac{\lambda_{i+1} - \mu}{\lambda_i - \mu})^k)$, is accelerated if $\mu$ is close to an eigenvalue.
• Thus, in a single-shift strategy, we choose $\mu_k$ to be the last entry in $T_k$, which is (likely the best) approximate eigenvalue. If the last $2$-by-$2$ block has complex eigenvalues, however, this strategy doesn’t converge.
• In this case, we can use a double-shift strategy, which shifts by both eigenvalues (which are complex conjugate pairs) sequentially. We can remain in real arithmetic by instead computing the QR decomposition $ZR$ of $(T_k-\mu I)(T_k - \bar\mu I) = T_k^2 - s T_k + t I$, where $s$ is the trace and $t$ is the determinant of the subblock respectively, and setting $T_{k+1} = Z^T T_k Z$. The cost of calculating $T_k^2$ can be avoided by appealing to the implicit $Q$ theorem: we can calculate just the first column of $T_k^2 - s T_k + t I$ and finish using Householder reflections in a bulge chasing pattern, in $O(n^2)$ time. The error decays as $O(\epsilon^2)$ or $O(\epsilon^3)$ for symmetric matrices.
• Sparse eigenvalue problems
• When applying QR iteration to sparse matrices, reduction to upper-Hessenberg form using Householder reflections destroys sparsity. Instead, it is possible to achieve the reduction iteratively using the Gram-Schmidt process.
• The $k$th Krylov subspace of a matrix $A$ and a starting vector $b$ is $K_k(A,b) = \mathrm{span}\,\{ b, Ab, \ldots, A^{k-1}b \}.$
• Arnoldi iteration calculates an orthogonal basis $Q_k = \{ q_1, \ldots, q_k \}$ that spans $K_k(A,b)$ via modified Gram-Schmidt. $Q_k^T A Q_k$ is upper-Hessenberg by construction (see below), and we can use its eigenvalues to approximate the eigenvalues of $A$. These eigenvalues are called Ritz eigenvalues and often converge to the extremal eigenvalues of $A$. Therefore, after $k$ steps of Arnoldi iteration, we may apply the usual QR algorithm to the $k$-by-$k$ upper-Hessenberg matrix to find the extremal eigenvalues of $A$.
• To obtain $q_{k+1}$, it suffices to orthonormalize $Aq_{k}$ instead of $A^{k}b$, since $K_{k+1}(A,b) = \mathrm{span}\,\{ b, Aq_1, \ldots, A q_{k} \}.$ To see this, note that $A^k b = A(A^{k-1}b) = A(a_1q_1 + \cdots + a_k q_k) = a_1 Aq_1 + \cdots + a_k Aq_k.$ By induction, only the $Aq_k$ component can be orthogonal to $K_k(A,b)$, so we may consider only it instead of $A^k b$.
• Since $q_{k+1}$ is orthonormalized against $Aq_1, \ldots, Aq_{k-1}$, this means that $H_k = Q_k^T A Q_k$ is upper-Hessenberg, and indeed, so is $\tilde{H}_k = Q_{k+1}^T A Q_k$. Note that $Q_k^T Q_k = I$, but $Q_k Q_k^T$ is a projection into $K_k(A,b)$. Thus we may write $Q_{k+1} \tilde{H}_k = AQ_k$.
• Lanczos iteration is the same process for symmetric matrices. In this case, $H_k = Q_k^T A Q_k$ is symmetric as well and thus tridiagonal. Remarkably, this means that in the Lanczos basis, $Aq_k$ only has components in the $q_{k-1}$, $q_k$, and $q_{k+1}$ directions. Hence, in each Gram-Schmidt step, we need to subtract the projections onto $q_{k-1}$ and $q_k$. This works in theory, but when using inexact arithmetic, the calculated vectors do not tend to stay orthogonal to each other.
• Lanczos iteration terminates in at most $p$ steps, where $p$ is the number of distinct eigenvalues of $A$. To see this, decompose $b$ into its components in the $p$ eigenspaces. The powers $A^k b$ can be written as a linear combination of these $k$ components, so $K_k(A,b)$ can grow to be at most $p$-dimensional.
• Each iteration requires a matrix-vector product and orthogonalizing against $2$ vectors. Thus, the work per iteration is $O(n)$ plus the work required for a matrix-vector product (which may be cheap, since $A$ is usually sparse).
• Sparse linear systems
• Sparse matrix formats
• For sparse direct methods, it’s important how the matrices are stored. The compressed sparse row (CSR) format allow efficient matrix-vector multiplications and row access. It keeps a list of non-zero entries and their column numbers in row-major order and a list of row pointers that indicate the start of each row in the first list. The compressed sparse column (CSC) format is analogous for columns.
• Direct methods
• Sparse triangular solve
• The most basic ingredient for solving sparse linear systems is solving a triangular system $Lx = b$, where $L$ and $b$ are both sparse. This is done in two steps: determining which entries of $x$ can be non-zero and then computing the actual numerical solution.
• To compute which entries of $x$ can be non-zero, note that the solution via forward substitution is $x_i = \frac{1}{\ell_{ii}} \Big[ b_i - \sum_{j=1}^{i-1} \ell_{ij} x_j \Big],$ so we find two rules:
• if $b_i$ is non-zero, then $x_i$ may be non-zero;
• if $\ell_{ij}$ and $x_j$ are both non-zero, then $x_i$ may be non-zero.
• The reach of a $b_i$ is the set of entries $x_j$ that may be non-zero if $b_i$ is non-zero. We can visualize these rules in two ways.
• Visually on the matrix $L$, to find the reach of $b_i$, first mark $x_i$. Now trace down the $i$th column of the matrix, and if the $k$th entry is non-zero, then mark $x_k$. Then repeat down the $k$th column, for every entry so marked, recursively until out of columns to trace.
• More abstractly, we can consider the graph whose adjacency matrix is $L$, directed low to high. Then the reach of $b_i$ is the set of nodes reachable from the node $i$.
• Cholesky decomposition for sparse matrices
• If $A$ is symmetric positive-definite, we can solve linear systems using the Cholesky decomposition.
• To derive the up-looking Cholesky factorization, we note that it must satisfy $\begin{bmatrix} A_{11} & a_{21} \\ a_{21}^T & a_{22} \end{bmatrix} = \begin{bmatrix} L_{11} & 0 \\ \ell_{21}^T & \ell_{22} \end{bmatrix}\begin{bmatrix} L_{11}^T & \ell_{21} \\ 0 & \ell_{22} \end{bmatrix}.$ From this, we find that $L_{11} L_{11}^T = A_{11}, \qquad L_{11} \ell_{21} = a_{21}, \qquad \ell_{21}^T \ell_{21} + \ell_{22}^2 = a_{22}.$ Therefore:
• $L_{11}$ may be solved recursively from $A_{11}$,
• $\ell_{21}$ is a sparse triangular solve involving the corresponding row $a_{21}$, and
• $\ell_{22}$ involves a sparse dot product.
• Thus each row of $L$ is a sparse triangular solve involving the submatrix above it, allowing us to compute $L$ row-by-row. We can think of the resulting sparsity pattern of $L$ as a “filling in” of $A$.
• When we calculate the sparsity pattern of the $k$th row $\ell_k$ (according to $L_{k-1} \ell_k = a_k$), we would normally compute the graph of $L_{k-1}$ and compute the reach of $a_k$ in it. However, here we can leverage the special structure of the problem. There are two optimizations:
• $L_{k-1}$ is a submatrix of $L_k$, so when we build the graph of $L_k$, we may simply take the graph of $L_{k-1}$ and add on the edges arising from the entries in $\ell_{k}$.
• Instead of using the full graph, we may consider only the elimination tree, which is the subgraph of the full adjacency graph that indicates only the first non-diagonal entry of each column. A node in the elimination tree has the same reach as in the full graph of $L_k$: suppose two entries in the same column, $\ell_{ji}$ and $\ell_{ki}$, are non-zero. When computing the $k$th row, $i \to j$ exists, so the $\ell_{ki}$ would force the fill-in of $\ell_{kj}$. If $\ell_{kj}$ is the first entry in its column, then an $j \to k$ edge exists, making the $i \to k$ edge redundant. Otherwise, we may repeat the argument to show that a $j \rightsquigarrow k$ path exists, again making the $i \to k$ edge redundant.
• Splitting methods
• To solve $Ax = b$, we can split $A = M - N$ for $M$ invertible. Then we can rewrite $x = M^{-1}Nx + M^{-1}b$, so that the solution $x$ is a fixed point of the function $x \mapsto Gx + M^{-1}b$, where $G = M^{-1}N$ is the iteration matrix. We easily see that the error at the $k$th iteration $e_{k} = x_k - x = G^ke_0$. Thus, the iterative method converges (to $A^{-1}b$) if and only if the spectral radius of $G$ satisfies $\rho(G) < 1$. One way to establish the bound is to show that $||G|| < 1$ for some matrix norm.
• Thus, for a method to be useful, it must be that $\rho(G) < 1$ for a nice class of matrices, and $M$ must be easily invertible. Several classical examples are the following, where $L$, $D$, and $U$ denote the strictly lower-triangular, diagonal, and strictly upper-triangular parts of $A$:
• Jacobi iteration: We choose $M = D$. If $A$ is strictly diagonally dominant, i.e. for all $i$, $\sum_{j\ne i}^n |a_{ij}| < |a_{ii}|$, then Jacobi iteration converges.
• Gauss-Seidel iteration: We choose $M = D + L$. If $A$ is symmetric positive-definite, then Gauss-Seidel iteration converges.
• Successive over-relaxation: We choose $M_\omega = \frac{1}{\omega}D + L$ and $N_\omega = (\frac{1}{\omega} - 1)D - U$, where we choose $\omega$ to minimize $\rho(M_\omega^{-1}N_\omega)$.
• Krylov subspace methods
• By the Cayley-Hamilton theorem, we can always find the solution $A^{-1}b \in K_n(A, b)$ in the $n$th Krylov subspace. For a symmetric positive-definite matrix $A$, conjugate gradient is an iterative method that finds the $x_k \in K_k(A,b)$ that minimizes $||x_k - A^{-1}b||_A^2$, where $A$ is used as an inner product matrix (or, equivalently, $||b - Ax_k||_{A^{-1}}^2$).
• The Hilbert space projection theorem says that the error is $A$-orthogonal to the subspace $K_k(A,b)$, or $Q_k^TA(x_k - A^{-1}b) = 0.$ If we transform to the Lanczos basis via $y_k = Q_{k}^T x_k \in {\mathbb{R}^k}$ and $||b||_2 e_1 = Q_{k}^T b$ using the tridiagonal matrix $H_k = Q_k^T A Q_k$, then we have $H_k y_k = ||b||_2 e_1.$
• To solve this efficiently, we develop a recurrence for the LDL factorization $H_k = L_k D_k L_k^T$. Then we can develop a recurrence for $v_k$, where $v_k$ solves $L_k D_k v_k = ||b||e_1$. Then we conclude that $x_k = Q_k L_k^{-T} v_k$, where $Q_k L_k^{-T}$ has a particularly nice recurrence too. We can even easily calculate the residual $||b - Ax_k||_2 = |H[k,k-1]\, e_k^T y_k|$.
• Each step requires $O(n)$ work plus the time required for a matrix-vector product. It is possible to show that the total work is $O(m\sqrt{\kappa_2(A)})$, where $m$ is the number of entries of $A$, and $\kappa_2$ is the condition number.
• To develop a convergence theory, notice that $x_k \in K_k(A,b)$ can be written as $q(A)b$ for some degree-$(k-1)$ polynomial $q$. Thus, we can view the minimization of $||x_k - A^{-1}b||_A^2$ as the minimization of $||(q(A)A - I)\,A^{-1}b||_A$ over degree $(k-1)$ polynomials $q$, or the minimization of $||p(A)\,A^{-1}b||_A$ over degree-$k$ polynomials $p$ with $p(0) = 1$. Thus, intuitively, we would like $p(\lambda)$ to be small for eigenvalues $\lambda$ of $A$. Roughly, convergence occurs in $k$ steps, where $k$ is the number of clusters of eigenvalues $A$ has. We can derive a bound on the error $e_k = x_k - A^{-1}b$ as $|| e_k ||_A \le 2\Big( \frac{\sqrt{\kappa_2(A)} - 1}{\sqrt{\kappa_2(A)} + 1} \Big)^{k} ||e_0||_A.$
• MINRES
• MINRES (“minimum residual”) is another Krylov subspace method similar to conjugate gradient, but it works for all symmetric invertible matrices, even indefinite. It instead minimizes the residual $||b - Ax_k ||_2$ for $x_k \in K_k(A, b)$ (or $||x_k - A^{-1}b||_{A^TA}$). In the Lanczos basis, this becomes $||\,||b||e_1 - \tilde{H}_k y_k ||_2$ for $y_k = Q_{k+1}^T x_k \in {\mathbb{R}^k}$, which is a least-squares problem solvable using a QR decomposition that has a nice recurrence, like conjugate gradient does for LDL.
• GMRES
• GMRES (“generalized minimum residual”) is another Krylov subspace method that works for general square invertible matrices. It is the same as MINRES but uses Arnoldi iteration instead of Lanczos iteration. The updates to the QR decompositions, however, requires the use of Givens rotations. This is much more expensive in terms of time and storage; there are ways to get around it, e.g. restarting after $m$ steps. It’s much more subtle!
• Using similar polynomial analysis as in conjugate gradient, this minimization problem is implicitly minimizing $||p(A)\,b||_2$ over degree-$k$ polynomials $p$ such that $p(0) = 1$. If $A = X\Lambda X^{-1}$ is diagonalizable, we get a bound on the residual $||b - Ax_k ||_2 = ||p(A)\,b||_2 \le ||b||_2\,\kappa_2(X) \inf_p p(\Lambda).$
• Preconditioning
• The problem $Ax = b$ is identical to the problem $M^{-1}Ax = M^{-1}b$, where $M$ is a preconditioner. A preconditioner $M$ is good if $M^{-1}A$ is not too far from normal (so $\kappa_2(X) \approx 1$) and its eigenvalues are clustered (so $p(\Lambda)$ can be made small for small $k$).
• Iterative refinement
• Iterative refinement improves the accuracy of a solution method by repeatedly solving the problem on different scales. Suppose we use some method to find that $x_*$ approximately solves $Ax = b$. Then we solve $Ad = (b- Ax_*)$ to receive the improved solution $x_* + d_*$.
• References
• http://epubs.siam.org.ezproxy.stanford.edu/doi/book/10.1137/1.9780898718881