A Course on Numerical Linear Algebra¶

Instructor: Deniz Bilman

Textbook: Numerical Linear Algebra, by Trefethen and Bau.

Recommended reading: Section 1.1, Fundamentals of Numerical Computing, Driscoll and Braun for a more general and theoretical treatment of floating point arithmetic system.

Lecture 13: Floating Point Arithmetic¶

The subject is representation of real numbers on a computer (on the floating-point unit (FPU), which is a part of the processor core).

We avoid the general treatment referenced above and focus on what's used in practice in today's computers following the IEEE standard.

There are two constraints when we represent real numbers on a computer.

The represented numbers cannot be arbitrarily large or arbitrarily small.
There must be gaps between them --- there is no continuum on a computer.

Modern computers represent numbers sufficiently large and small that the first constraint rarely poses difficulties.

The IEEE standard for double precision arithmetic permits numbers as large as $1.79 \times 10^{308}$ and as small as $2.23 \times 10^{-308}$.

This range is large enough for most of the problems one faces in numerical linear algebra. So, overflow and underflow are usually not a serious obstacle (warning: be careful if you are trying to evaluate a determinant though!)

The second contraint (gaps between number represented) is a serious concern across scientific computing.

In IEEE double precision arithmetic, the interval $[1,2]$ is represented by the discrete subset:

$$ 1, \quad 1+2^{-52}, \quad 1+2 \times 2^{-52}, \quad 1+3 \times 2^{-52}, \quad \ldots, \quad 2. $$

As you can see the numbers in $[1,2)$ which are represented in a computer are of the form

$$ 1 + k\times 2^{-52},\qquad k=0,1,2,\ldots, 2^{52}-1. $$

So there are only $2^{52} \approx 4.50e+15$ of the numbers that are represented in $[1,2)$ and they are equally spaced. Each number represented has a gap of $2^{-52}$ with its nearest neighbor in $[1,2)$. The next greater element after $1$ is $1+2^{-52}$. For this reason $2^{-52}$ is called the machine epsilon, denoted by

$$ \epsilon_\text{M}:= 2^{-52} \approx 2.22 \times 10^{-16}. $$

In [13]:

@show eps()

eps() = 2.220446049250313e-16

2.220446049250313e-16

From this point on, the coverage on the real axis is simple doubling procedure. The interval $[2,4)$ is represented by the same numbers represented in $[1,2)$ multiplied by 2:

$$ 2,2+2^{-51}, 2+2 \times 2^{-51}, 2+3 \times 2^{-51}, \ldots, 4. $$

In other words, the numbers represented in $[2,4)$ are

$$ \left( 1+ k \times 2^{-52} \right)\times 2,\qquad k=0,1,2,\ldots,2^{52}-1. $$

This means that the number of real numbers represented in $[1,2)$ is the same as the number of real numbers represented in $[2,4)$ --- so the represented numbers are more spread out! The gaps are now of size $2\epsilon_M$.

This is generalized naturally. The numbers that are represented in the interval $[2^j, 2^{j+1})$ are precisely

$$ \left( 1+ k \times 2^{-52} \right)\times 2^j,\qquad k=0,1,2,\ldots,2^{52}-1 $$

and the gaps are now of size $2^j \epsilon_M$.

As you see, the represented numbers get more and more spread out as they increase. Thus, in the IEEE double precision arithmetic, the gaps between adjacent numbers are relative (as opposed to fixed hence the floating point terminology as opposed to fixed point). However, in the relative sense the gaps are never larger than $\epsilon_\text{M}=2^{-52}\approx 2.22\times 10^{-16}$.

For numbers between $0$ and $1$, the intervals are halved instead in a similar manner.

Floating Point Numbers¶

We cover the double-precition IEEE 754 standard.

Let $\mathbb{F}$ denote the floating points --- the real numbers represented (exactly) on a computer via the IEEE 754 standard.

For $x\in\mathbb{F}$, we have the representation

$$ x = (-1)^s \times \left(1+f\right) \times 2^e, $$

where

$s\in\{0,1\}$ is the sign and requires one bit to be encoded on the processor,
$n\in\{-1022,\ldots,1023\}$ is an integer called the exponent and requires 11 bits to be encoded,
$f=1+2^{-52}m$, where $m\in\{0,1,2,\ldots,2^{52}-1\}$ is the mantissa or significand, and requires 52 bits to be encoded. $f$ is referred to as the fractional part.

So the total number of bits to represent $x$ is 64. This is why the double-precision floating points are usually referred to as float64.

In single precision, the mantissa is $1+2^{-23}$ instead of $1+2^{-52}$ (of the double-precision), so requires 23 bits to be encoded. The range of exponents allowed is also smaller, requiring only 8 bits to be encoded. The total number of bits then becomes 32. This is why the single-precision floating points are usually referred to as float32. Single-precision floats need less storage space and can be operated on more quickly than double-precision floats, so using them can be useful in low-precision applications.

Back to double-precision: we wrote

$$ x = (-1)^s \times \left(1+f\right) \times 2^n. $$

This floating point is represented as a sequence of 64 bits encoding the quantities described above.

See Julia Documentation for Floating Points.

In [ ]:

# Standard is double-precision
typeof(1.0)

Float64

In [16]:

# Be careful with typing integers
typeof(1)

Int64

In [ ]:

# Look at the bitstring
bitstring(1.0)

"0011111111110000000000000000000000000000000000000000000000000000"

In [18]:

# Check the next number:
bitstring(1.0+eps())

"0011111111110000000000000000000000000000000000000000000000000001"

In [23]:

# Check the next number:
bitstring(1.0+2*eps())

"0011111111110000000000000000000000000000000000000000000000000010"

In [31]:

# It is an overkill, but we can implement the IEEE754 double-precision 
# and access to relevant parts of the bit string.

function ieee_64(x::Float64)
    # Get the bitstring and extract parts
    b = bitstring(x)
    s = (b[1:1], parse(Int, b[1:1]; base=2))
    e = (b[2:12], parse(Int, b[2:12]; base=2)-1023)
    f = (b[13:64], parse(Int, b[13:64]; base=2))
    return s, e, f
end

println(ieee_64(1.0+2*eps()))
println("The output returned is of the form (bitstring, value in base 10)")

(("0", 0), ("01111111111", 0), ("0000000000000000000000000000000000000000000000000010", 2))
The output returned is of the form (bitstring, value in base 10)

How About the Real Numbers in Between? Resolution of Floating Points¶

The resolution of $\mathbf{F}$ is traditionally summarized by the machine epsilon. (Trefethen & Bau takes $\frac{1}{2}\varepsilon_\text{M}$ as the machine epsilon.)

For all $x \in \mathbb{R}$, there exists $x^{\prime} \in \mathbb{F}$ such that $\left|x-x^{\prime}\right| \leq \frac{1}{2} \epsilon_{\text{M}}|x|$.

Let $\operatorname{fl}: \mathbb{R} \rightarrow \mathbb{F}$ be a function giving the closest floating point approximation to a real number, its rounded equivalent in the floating point system.

Then the statement above can be written as:

For all $x \in \mathbb{R}$, there exists $\epsilon$ with $|\epsilon| \leq \frac{1}{2}\epsilon_{\text{M}}$ such that $\operatorname{fl}(x) = x(1+\varepsilon)$.

That is, the difference between a real number and its closest floating point approximation is always smaller than $\frac{1}{2}\epsilon_{\text{M}}$ in relative terms. We have

$$ \frac{|x-\operatorname{fl}(x)|}{|x|} \leq \frac{1}{2} \epsilon_\text{M}. $$

In practice, the largest finite value in $\mathbb{F}$ is:

In [34]:

@show LARGE = (2.0^1023)*(1 + (2^52-1)/2^52)
ieee_64(LARGE)

LARGE = 2.0 ^ 1023 * (1 + (2 ^ 52 - 1) / 2 ^ 52) = 1.7976931348623157e308

(("0", 0), ("11111111110", 1023), ("1111111111111111111111111111111111111111111111111111", 4503599627370495))

Results larger than LARGE would be assigned Inf. This is called overflow.

In [35]:

nextfloat(LARGE)

Inf

Similarly, there is the smallest positive number represented in $\mathbb{F}$. A quantity smaller than this results in undeflow.

In [36]:

@show SMALL = 2.0^-1022
ieee_64(SMALL)

SMALL = 2.0 ^ -1022 = 2.2250738585072014e-308

(("0", 0), ("00000000001", -1022), ("0000000000000000000000000000000000000000000000000000", 0))

Floating Point Arithmetic¶

It is common practice to denote the floating point operations by $\oplus$, $\ominus$, $\otimes$, and $\odot$.

A computer might be built on the following design principle.

Let $x$ and $y$ be arbitrary floating point numbers, that is, $x, y \in \mathbb{F}$. Let $\ast$ be one of the operations $+$, $-$, $\times$, or $\div$, and let $*$ be its floating point analogue. Then $x \circledast y$ must be given exactly by

$$ x \circledast y=\operatorname{fl}(x * y). $$

If this holds, then we find that the computer has a simple and powerful property.

Fundamental Axiom of Floating Point Arithmetic¶

For all $x, y \in \mathbb{F}$, there exists $\epsilon$ with $|\epsilon| \leq \frac{1}{2}\epsilon_{\text{M}}$ such that

$$ x \circledast y=(x \ast y)(1+\epsilon). $$

This means that every operation of floating point arithmetic is exact up to a relative error of size at most $\frac{1}{2}\epsilon_\text{M}$.

There is more to this for complex floating points (hw: read page 100 in Trefethen & Bau).