A Course on Numerical Linear Algebra¶

Instructor: Deniz Bilman

Textbook: Numerical Linear Algebra, by Trefethen and Bau.

Lecture 11: Least Squares Problems¶

The problem here is the solution of an overdetermined system of equations:

$n$ unknowns, but $m>n$ equations.

We want to find a vector $x\in \mathbb{C}^n$ that satisfies $Ax = b$ where $A\in\mathbb{C}^{m\times n}$ and $b\in\mathbb{C}^m$.

In general, this problem has no solution. A suitable vector $x$ exists only if $b\in\operatorname{range}(A)$. But note that $b$ is an $m$-vector and $\operatorname{range}(A)$ is of dimension at most $n$, where $m>n$. So, $b\in\operatorname{range}(A)$ is possible only for exceptional choices of $b$.

Therefore, the vector known as the residual

$$ r=b-A x \in \mathbb{C}^m $$

can perhaps be made quite small by a suitable choice of $x$, but in general it cannot be made equal to zero.

So the aim is to make the residual as small as possible in some sense.

We choose to measure the smallness of $r$ in terms of the $2$-norm. Then the problem we are facing is of the following form:

Given $A \in \mathbb{C}^{m \times n}$, $m \geq n$, $b \in \mathbb{C}^m$, find $x \in \mathbb{C}^n$ such that $\|b-A x\|_2$ is minimized.

This is called a least squares problem. The choice of the 2 -norm can be defended by various geometric and statistical arguments. It certainly leads to simple algorithmsultimately because the derivative of a quadratic function, which must be set to zero for minimization, is linear.

Since the $2$-norm is the Euclidean distance, we are seeking a vector $x\in\mathbb{C}^n$ such that $Ax \in \mathbb{C}^m$ is the closest point in $\operatorname{range}(A)$ to $b$.

Example: Polynomial Data-Fitting¶

We are given $m$ distinct points $x_1,\ldots,x_m\in\mathbb{C}$ and data $y_1,\ldots,y_m\in\mathbb{C}$ at these points. Then there exists a unique polynomial interpolant of degree at most $m-1$ to these data points:

$$ p(x) = c_0 + c_1 x + \cdots + c_{m-1}x^{m-1} $$

with the property that $p(x_i)=y_i$.

The relationship of the data points to the coefficients $\{x_i\}$ can be expressed by the square Vandermonde system:

$$ \begin{bmatrix} 1 & x_1 & x_1^2 & & x_1^{m-1} \\ 1 & x_2 & x_2^2 & \cdots & x_2^{m-1} \\ 1 & x_3 & x_3^2 & & x_3^{m-1} \\ & \vdots & & & \vdots \\ 1 & x_m & x_m^2 & \cdots & x_m^{m-1} \end{bmatrix} \begin{bmatrix} c_0 \\ c_1 \\ c_2 \\ \vdots \\ c_{m-1} \end{bmatrix} = \begin{bmatrix} y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_m \end{bmatrix} $$

So, to determine the coefficients $\{c_i\}$, for a given set of data, we can solve this system of equations, which is guaranteed to be nonsingular as long as the points $\{x_i\}$ are distinct.

In [6]:
using LinearAlgebra

function vandermonde_matrix(x)
    # Construct the Vandermonde matrix
    m = length(x)
    return [x[i]^(j-1) for i in 1:m, j in 1:m]  
end

# Function to solve the Vandermonde system for polynomial coefficients
function solve_vandermonde(x, y)
    V = vandermonde_matrix(x)  # Construct Vandermonde matrix
    c = V \ y  # Solve the linear system V * c = y for c
    return c  # Coefficients of the interpolating polynomial
end
Out[6]:
solve_vandermonde (generic function with 1 method)
In [8]:
# Example polynomial fitting:
using Plots
using Random

Random.seed!(42)  # Fix the random seed for reproducibility
x = collect(1.0:12)  # Sample x values
x|>display
y = 10*randn(12) # Sample y values
y|>display  

coefficients = solve_vandermonde(x,y)
function evaluate_polynomial(c, x)
    m = length(c)
    return sum(c[j] * x.^(j-1) for j in 1:m)  # Horner’s method could be used for efficiency
end

# Generate fine grid for smooth curve
x_fine = range(minimum(x), maximum(x), length=200)
y_fine = evaluate_polynomial(coefficients, x_fine)

# Plot the polynomial and data points with fixed y-axis limits
plt1 = plot(x_fine, y_fine, label="Interpolating Polynomial", lw=2, legend=false, ylims=(-45, 65))
scatter!(x, y, label="Data Points", markersize=6, color=:orange)
Precompiling IJuliaExt
  ✓ Plots → FileIOExt
  ✓ Plots → UnitfulExt
  ✓ Plots → IJuliaExt
  3 dependencies successfully precompiled in 4 seconds. 176 already precompiled.
12-element Vector{Float64}:
  1.0
  2.0
  3.0
  4.0
  5.0
  6.0
  7.0
  8.0
  9.0
 10.0
 11.0
 12.0
12-element Vector{Float64}:
  -3.6335748145177753
   2.517372155742292
  -3.1498797116895605
  -3.1125240132442067
   8.163067649323274
   4.76738379831878
  -8.595553820616212
 -14.692882055065464
  -2.0661311448119264
  -3.1074387308373415
  -0.40473400248354907
   1.047707917889211
Out[8]:

See the wild oscillations near the ends? This poor performance is typical in polynomial interpolation on uniformly spaced points. We will say more about it later.

Now, an idea: One can actually do better by reducing the degree of the polynomial and foregoing the interpolant! Given the same $m$ pair of data points, consider now a polynomial of degree $n-1$:

$$ p(x)=c_0+c_1 x+\cdots+c_{n-1} x^{n-1} $$

for some $n<m$.

Such a polynomial is a least squares fit to the data if it minimizes the sum of the square of the deviation from the data:

$$ \sum_{i=1}^m\left|p\left(x_i\right)-y_i\right|^2. $$

Note that this is nothing but the norm-squared $\|r\|_2^2$ of the residual $r$ for the rectangular Vandermonde system:

$$ \begin{bmatrix} 1 & x_1 & & x_1^{n-1} \\ 1 & x_2 & \cdots & x_2^{n-1} \\ 1 & x_3 & & x_3^{n-1} \\ & \vdots & & \vdots \\ 1 & x_m & \cdots & x_m^{n-1} \end{bmatrix} \begin{bmatrix} c_0 \\ c_1 \\ \vdots \\ c_{n-1} \end{bmatrix} \approx \begin{bmatrix} y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_m \end{bmatrix}. $$

Question: Well, how do we compute such $c_0,\ldots, c_{n-1}$?

In [10]:
# First the Vandermonde function with optional argument n
function vandermonde_matrix(x, n=length(x))
    m = length(x)
    @assert n ≤ m "n must be less than or equal to the length of x"
    return [x[i]^(j-1) for i in 1:m, j in 1:n]  
end

vandermonde_matrix(x, 6)
Out[10]:
12×6 Matrix{Float64}:
 1.0   1.0    1.0     1.0      1.0       1.0
 1.0   2.0    4.0     8.0     16.0      32.0
 1.0   3.0    9.0    27.0     81.0     243.0
 1.0   4.0   16.0    64.0    256.0    1024.0
 1.0   5.0   25.0   125.0    625.0    3125.0
 1.0   6.0   36.0   216.0   1296.0    7776.0
 1.0   7.0   49.0   343.0   2401.0   16807.0
 1.0   8.0   64.0   512.0   4096.0   32768.0
 1.0   9.0   81.0   729.0   6561.0   59049.0
 1.0  10.0  100.0  1000.0  10000.0  100000.0
 1.0  11.0  121.0  1331.0  14641.0  161051.0
 1.0  12.0  144.0  1728.0  20736.0  248832.0

Orthogonal Projections and the Normal Equations¶

Our goal is to find the closest point $A x$ in range $(A)$ to $b$, so that the norm of the residual $r=b-A x$ is minimized.

It is clear geometrically that this will occur provided $A x=P b$, where $P \in \mathbb{C}^{m \times m}$ is the orthogonal projector (Lecture 6) that maps $\mathbb{C}^m$ onto $\operatorname{range}(A)$.

In other words, the residual $r=b-A x$ must be orthogonal to $\operatorname{range}(A)$. We formulate this condition as the following theorem.

Theorem:¶

Let $A \in \mathbb{C}^{m \times n}(m \geq n)$ and $b \in \mathbb{C}^m$ be given. A vector $x \in \mathbb{C}^n$ minimizes the residual norm $\|r\|_2=\|b-A x\|_2$, thereby solving the least squares problem (11.2), if and only if $r \perp \operatorname{range}(A)$, that is,

$$ A^* r=0, $$

or equivalently,

$$ A^* A x=A^* b, $$

or again equivalently,

$$ P b=A x, $$

where $P \in \mathbb{C}^{m \times m}$ is the orthogonal projector onto $\operatorname{range}(A)$. The $n \times n$ system of equations $A^* A x=A^* b$, known as the normal equations, is nonsingular if and only if $A$ has full rank. Consequently the solution $x$ is unique if and only if $A$ has full rank.

Proof:¶

Done in class.

Pseudoinverse¶

So we know that if $A\in\mathbb{C}^{m\times n}$, $m>n$, has full rank, then the solution $x$ to the least squares problem for $A x = b$ is unique and is given by

$$ x=\left(A^* A\right)^{-1} A^* b. $$

For this reason, the matrix $\left(A^* A\right)^{-1} A^*$ is known as the pseudoinverse of $A$, denoted by $A^{+}$:

$$ A^{+}=\left(A^* A\right)^{-1} A^* \in \mathbb{C}^{n\times m}. $$

This matrix maps vectors $b \in \mathbb{C}^m$ to vectors $x \in \mathbb{C}^n$, which explains why it has dimensions $n \times m$ --- more columns than rows.

So our goal is to compute one (or both) of the vectors

$$ x=A^{+} b, \qquad y=P b, $$

where $A^{+}$is the pseudoinverse of $A$ and $P$ is the orthogonal projector onto range $(A)$.

We now describe the three leading algorithms for doing this.

Normal Equations¶

The classical way to solve the least squares problems is to solve the nomral equations.

If $A$ has full rank, $A^* A x=A^* b$ is a Hermitian positive definite system of equations of dimension $n$.

The standard method of solving such a system is by Cholesky factorization (will be dicussed later). This method constructs a factorization of the coefficient matrix (for the normal equations)

$$ A^* A = R^* R, $$

where $R$ is upper-triangular. So the normal equations turn into

$$ R^* R x=A^* b. $$

Algorithm: Least Squares via Normal Equations

  1. Form the matrix $A^* A$ and the vector $A^* b$.
  2. Compute the Cholesky factorization $A^* A=R^* R$.
  3. Solve the lower-triangular system $R^* w=A^* b$ for $w$.
  4. Solve the upper-triangular system $R x=w$ for $x$.

The work for this algorithm is $~ mn^2 + \frac{1}{3}n^3$ flops.

QR Factorization¶

The "modern classical" (quoting Trefethen & Bau) method is based upon the thin QR factorization.

By either Gram-Schmidt or Householder triangularization, construct the thin QR factorization: $A=\hat{Q}\hat{R}$.

The orthogonal projector $P$ can then be written as $P=\hat{Q}\hat{Q}^*$. Then

$$ y=P b=\hat{Q} \hat{Q}^* b. $$

Since $y \in \operatorname{range}(A)$, the system $A x=y$ has an exact solution.

Combining the thin QR factorization with the formula for $y$ above gives

$$ \hat{Q} \hat{R} x=\hat{Q} \hat{Q}^* b. $$

Left-multiply by $\hat{Q}^*$ to get

$$ \hat{R} x= \hat{Q}^* b. $$

So, we can solve this upper-triangular system by backsubstitution.

Incidentaly, notw that multiplying by $\hat{R}^{-1}$ now gives the formula $A^{+}=\hat{R}^{-1} \hat{Q}^*$ for the pseudoinverse $A^+$.

Algorithm: Least Squares via QR Factorization

  1. Compute the reduced QR factorization $A=\hat{Q} \hat{R}$.
  2. Compute the vector $\hat{Q}^* b$.
  3. Solve the upper-triangular system $\hat{R} x=\hat{Q}^* b$ for $x$.

The work for this is dominated by the cost of the QR factorization. If Householder reflections are used for this step, we have

Work for Algorithm: $ \sim 2 m n^2-\frac{2}{3} n^3$ flops.

SVD¶

We will also cover an algorithm to compute the thin SVD later.

Given $A=\hat{U} \hat{\Sigma} V^*$, $P$ is now represented in the form $P=\hat{U} \hat{U}^*$, resulting in

$$ y=P b=\hat{U} \hat{U}^* b. $$

Then $A x = P b$ reads

$$ \hat{U} \hat{\Sigma} V^* x=\hat{U} \hat{U}^* b $$

and left-multiplication by $\hat{U}^*$ yields

$$ \hat{\Sigma} V^* x= \hat{U}^* b. $$

Again we obtain a different formula for the pseudoinverse upon multiplying by $V \hat{\Sigma}^{-1}$: $A^{+}=V \hat{\Sigma}^{-1} \hat{U}^*$.

Algorithm: Least Squares via SVD

  1. Compute the reduced SVD $A=\hat{U} \hat{\Sigma} V^*$.
  2. Compute the vector $\hat{U}^* b$.
  3. Solve the diagonal system $\hat{\Sigma} w=\hat{U}^* b$ for $w$.
  4. Set $x=V w$.

Note that whereas QR factorization reduces the least squares problem to a triangular system of equations, the SVD reduces it to a diagonal system of equations, which is of course trivially solved. If $A$ has full rank, the diagonal system is nonsingular.

Work for Algorithm: $\sim 2 m n^2+11 n^3$ flops.

In [13]:
using LinearAlgebra

# Function to solve the least squares problem using SVD

function least_squares_svd(A, b)
    # Step 1: Compute reduced SVD A = Û Σ̂ V*
    U, S, V = svd(A)  # U is m×n, S is a vector of singular values, V is n×n
    
    # Step 2: Compute Û^* b
    Ub = U' * b
    
    # Step 3: Solve Σ̂ w = Û^* b (element-wise division since Σ̂ is diagonal)
    w = Ub[1:length(S)] ./ S  # Only take the first n components

    # Step 4: Compute x = V w
    x = V * w

    return x
end
Out[13]:
least_squares_svd (generic function with 1 method)
In [15]:
# Use this to revisit the polynomial fitting problem with the same data points x and y.
x|>display
y|>display
12-element Vector{Float64}:
  1.0
  2.0
  3.0
  4.0
  5.0
  6.0
  7.0
  8.0
  9.0
 10.0
 11.0
 12.0
12-element Vector{Float64}:
  -3.6335748145177753
   2.517372155742292
  -3.1498797116895605
  -3.1125240132442067
   8.163067649323274
   4.76738379831878
  -8.595553820616212
 -14.692882055065464
  -2.0661311448119264
  -3.1074387308373415
  -0.40473400248354907
   1.047707917889211
In [17]:
n = 8
coefficients_least_squares = least_squares_svd(vandermonde_matrix(x, n), y)
Out[17]:
8-element Vector{Float64}:
 -180.95384934171895
  371.4626500562532
 -276.2511655190259
   99.16945195753624
  -19.02016209096471
    1.9904689306837795
   -0.10709889505600328
    0.0023176423686821234
In [19]:
# Generate fine grid for smooth curve
y_fine_least_squares = evaluate_polynomial(coefficients_least_squares, x_fine)

# Plot the polynomial and data points with fixed y-axis limits
plt2 = plot(x_fine, y_fine_least_squares, label="Interpolating Polynomial", lw=2, legend=false, ylims=(-45, 65))
scatter!(x, y, label="Data Points", markersize=6, color=:orange)
Out[19]:
In [21]:
# Compare with the interpolant:
display(plt1)

Thanks¶

Thanks goes to

  • Ethan Davis

for finding typos in this lecture.

In [ ]: