A Course on Numerical Linear Algebra¶

Instructor: Deniz Bilman

Textbook: Numerical Linear Algebra, by Trefethen and Bau.

Lecture 5: More on the SVD¶

At the end of last lecture, the following question was raised:

If the thin SVD does the job and it is the default in many implementations, why do we bother with appending orthonogmal columns to $\hat{U}$ and getting our hands at a full SVD?

Here is a good reason. Let $A\in\mathbb{C}^{m\times n}$ with the SVD $$ A = U \Sigma V^*. $$

Since $U$ is an $m\times m$ unitary matrix, its columns form a basis for $\mathbb{C}^m$. Similarly, since $V$ is an $n\times n$ unitary matrix, its columns form a basis for $\mathbb{C}^n$.

So now think of $A$ as the linear transformation from $\mathbb{C}^n$ to $\mathbb{C}^m$. SVD makes it possible for us to say that every matrix is diagonal (it acts as just dilation) -- if only one uses the proper bases for the domain and range spaces.

Here is how it works.

Change of Bases¶

Any $x\in\mathbb{C}^n$ can be expanded in the basis of the right singular vectors of $A$ (columns of $V$). Let $x'$ denote the coordinate vector for the expansion of $x$ in that basis. Then $$ x = V x' $$ which gives $$ x' = V^* x $$ since $V^{-1}=V^*$.

Similarly, any $x\in\mathbb{C}^m$ can be expanded in the basis of the left singular vectors of $A$ (columns of $U$). The coordinate vector $b'$ in that basis satisfies $$ b = U b' $$ and again $$ b' = U^* b. $$ Thus, $$ b=A x \quad \Longleftrightarrow U^* b=U^* A x=U^* U \Sigma V^* x \quad \Longleftrightarrow \quad b^{\prime}=\Sigma x^{\prime} . $$

We see that $A$ reduces to the diagonal matrix $\Sigma$ when the range is expressed in the basis of columns of $U$ and the domain is expressed in the basis of columns of $V$.

SVD vs. Eigenvalue Decomposition¶

The theme above is diagonalizing a matrix by expressing it in terms of a new basis. This underlies the study of eigenvalues of a square matrix.

Indeed, a non-defective $m\times m$ square matrix (meaning that the square matrix has $m$ linearly independent eigenvectors) can be expressed as a diagonal matrix of eigenvalues $\Lambda$ if the range and the domain are represented in a basis of eigenvectors.

If the columns of a matrix $X \in \mathbb{C}^{m \times m}$ contain linearly independent eigenvectors of $A \in \mathbb{C}^{m \times m}$, the eigenvalue decomposition of $A$ is

$$ A=X \Lambda X^{-1} $$

where $\Lambda$ is an $m \times m$ diagonal matrix whose entries are the eigenvalues of $A$. This implies that if we define, for $b, x \in \mathbb{C}^m$ satisfying $b=A x$, with the new coordinates

$$ b^{\prime}=X^{-1} b, \quad x^{\prime}=X^{-1} x $$

as above, then the newly expanded vectors $b^{\prime}$ and $x^{\prime}$ satisfy $b^{\prime}=\Lambda x^{\prime}$.

Fundamental Differences¶

The SVD uses two different bases (the sets of left and right singular vectors), whereas the eigenvalue decomposition uses just one (the eigenvectors). Another is that the SVD uses orthonormal bases.

The SVD uses orthonormal bases, whereas the eigenvalue decomposition uses a basis that generally is not orthogonal.

Not all matrices (even square ones) have an eigenvalue decomposition, but all matrices (even rectangular ones) have a singular value decomposition.

In applications, eigenvalues tend to be relevant to problems involving the behavior of iterated forms of $A$, such as matrix powers $A^k$ or exponentials $e^{t A}$, whereas singular vectors tend to be relevant to problems involving the behavior of $A$ itself, or its inverse.

Courant-Fischer's Min-Max Principle¶

The Courant-Fischer Min-Max Principle provides a variational characterization of the eigenvalues of a Hermitian matrix. It describes each eigenvalue in terms of the maximum and minimum values of a quadratic form over subspaces of a given dimension.

Let $A$ be an $m \times m$ Hermitian matrix with eigenvalues $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_m$. Then one has the following two characterizations of the eigenvalues:

$$ \lambda_j = \min_{S: \operatorname{dim}(S)=m-j+1}\left( \max_{\substack{x\in S \\ \|x\|_2=1}} x^* A x \right). $$

and alternatively,

$$ \lambda_j = \max_{S: \operatorname{dim}(S)=j}\left( \min_{\substack{x\in S \\ \|x\|_2=1}} x^* A x \right) $$

This means in particular for the largest eigenvalue $\lambda_1$:

Among all unit vectors $x$, the one that maximizes $x^* A x$ is the eigenvector corresponding to $\lambda_1$.

Likewise, for the smallest eigenvalue $\lambda_m$:

Among all unit vectors $x$, the one that minimizes $x^* A x$ is the eigenvector corresponding to $\lambda_1$.

For intermediate eigenvalues $\lambda_j$, we have similar considerations but restricted to subspaces $S$ of dimension related to $j$.

Matrix Properties via the SVD¶

Let $A \in \mathbb{C}^{m\times n}$. Let $p=\min(\{m,n\})$ and let $r\leq p$ denote the number of nonzero singular values of $A$.

Let $\operatorname{span}(\{ x, y, \ldots, z\})$ denote the space spanned by the vectors $x,y,\ldots,z$.

Theorem 5.1:¶

The rank of $A$ is $r$, the number of nonzero singular values.

Proof:¶

Done in class.

Theorem 5.2:¶

$\operatorname{range}(A) = \operatorname{span}(\{ u_1, u_2, \ldots, u_r\})$ and $\operatorname{null}(A) = \operatorname{span}(\{ v_{r+1}, \ldots, v_{n}\})$.

Proof:¶

Done in class.

Theorem 5.3:¶

$\|A\|_2=\sigma_1$ and $\|A\|_F=\sqrt{\sigma_1^2+\sigma_2^2+\cdots+\sigma_r^2}$

Proof:¶

Done in class

Theorem 5.4:¶

The nonzero singular values of $A$ are the square roots of the nonzero eigenvalues of $A^* A$ or $A A^*$. (These matrices have the same nonzero eigenvalues.)

Proof:¶

Done in class.

Theorem 5.5:¶

If $A=A^*$, then the singular values of $A$ are the absolute values of the eigenvalues of $A$.

Proof:¶

Done in class.x`

Theorem 5.6:¶

For $A \in \mathbb{C}^{m \times m}$, $\displaystyle|\operatorname{det}(A)|=\prod_{i=1}^m \sigma_i$.

Proof:¶

Done in class.

Low-Rank Approximations¶

This relates to the applications we have seen in the previous lecture.

One way to see what SVD is that it describes how a matrix $A$ might be represented as a sum of rank-one matrices.

Theorem 5.7:¶

$A$ is the sum of $r$ rank-one matrices:

$$ A=\sum_{j=1}^r \sigma_j u_j v_j^* $$

Proof:¶

Done in class.

There are of course many ways to express an $m\times n$ matrix $A$ as a sum of rank-one matrices.

For example, $A$ could be written as the sum of its $m$ rows, or, as a sum of its $n$ columns, or its $mn$ entries (padding zeros for the complementing entries).

What makes the SVD way $$ A=\sum_{j=1}^r \sigma_j u_j v_j^* $$ special? Is it special or is this just some observation?

This formula represents a decomposition into rank-one matrices with a deeper property: the $\nu$ th partial sum captures as much of the energy of $A$ as possible. This statement holds with "energy" defined by either the 2-norm or the Frobenius norm.

Let's make it precise by formulating a problem of best approximation of a matrix $A$ by matrices of lower rank.

Theorem 5.8:¶

For any $\nu$ with $0 \leq \nu \leq r$, define

$$ A_\nu=\sum_{j=1}^\nu \sigma_j u_j v_j^* $$

if $\nu=p=\min( \{m, n\} )$, define $\sigma_{\nu+1}=0$. Then

$$ \left\|A-A_\nu\right\|_2=\inf _{\substack{B \in \mathbb{C}^{m \times n} \\ \operatorname{rank}(B) \leq \nu}}\|A-B\|_2=\sigma_{\nu+1}. $$

Proof:¶

Done in class.

Computation of the SVD¶

So far we have not considered how SVD can be computed. This is a subject on its own. See Lecutre 31 in Trefethen & Bau.

Most of the theorems of this lecture have computational consequences. Here are some facts about SVD:

The best method for determining the rank of a matrix is to count the number of singular values greater than a judiciously chosen tolerance (Theorem 5.1).

The most accurate method for finding an orthonormal basis of a range or a nullspace is via Theorem 5.2. (For both of these examples, QR factorization provides alternative algorithms that are faster but not always as accurate.)

Theorem 5.3 represents the standard method for computing $\|A\|_2$.

Theorems 5.8 and 5.9 , the standards for computing low-rank approximations with respect to $\|\cdot\|_2$ and $\|\cdot\|_F$.

Besides these examples, the SVD is also an ingredient in robust algorithms for least squares fitting, intersection of subspaces, regularization, and numerous other problems.

With this lecture we conclude the first part of the course.