*If you enjoy this post, subscribe using the form to the left! I try to make new posts every weekend, though sometimes life gets in the way. This one took 3 months!*

Since my last post was on the fast Fourier transform I thought I would continue the theme of signal processing and write a post on wavelet analysis. However after looking over the subject for a couple ~~weekends~~ months I’ve decided to split this topic into several posts.

This post is heavily informed by Gilbert Strang and Troung Nguyen’s “Wavelets and Filter Banks”. To be honest, the book is hard to read, so I’ll try to digest it the best I can for you here. Maybe I’m being hypocritical. I hope you can read this!

# Introduction

Like the Fourier transform, the wavelet transform outputs a different *representation* of a set of data. The original data is usually represented as a series of measurements in time. For example, in an audio recording the data is a time series of voltages across the capacitor plate of a microphone.

```
## Time CapacitorVoltage
## 00:000000 0.1
## 00:000001 0.13
## 00:000002 0.15
## ... ...
```

The Fourier transform outputs a represention of the data as a combination of constant tones:

```
## Frequency Volume
## C-tone 0.1
## D-tone 0.01
## F-tone 0
## ... ...
```

Think of music. It might be more efficient to represent data as a set of tones than a series of millions of voltages. However, the Fourier Transform has a weird trait that it represents the signal as a series of *constant* tones that play the *entire length of the recording*. There is a math theorem that says any function can be fully represented in that way, but that’s not how people think of music. Instead, people think of “notes” that happen at specific times in the recording, like sheet music. That is what the wavelet transform intends to represent. It would look something like this:

```
## Note Time Volume
## C-note 0 1
## D-note 0 0
## D-note 1 1
## ... ... ...
```

Representing certain types of data in this way is very efficient. Think of how sheet music is much easier to store than spreadsheets of billions of capacitor voltages. This is why many compression techniques for the web, like JPEG 2000, use it to send data efficiently over the wire without losing the “gist” of the image.

So how do we actually transform data into this representation? We need some sort of operation that will give us a compromise between the perfect time-resolution of original representation and the perfect frequency resolution of the Fourier transform.

What if I told you there already exists an operation that has both high-resolution in frequency and high resolution in time? It’s an elementary circuit in analog electronics, called the high-pass filter. The high pass filter takes a signal and filters out lower frequency components of that signal. An example output is illustrated below. I’ve drawn the results of a high-pass filter, labeled “HPF” and a complementary low-pass filter, labeled “LPF”, with input coming from the right (same for both), and output going out to the left:

Notice that the low pass output is “smoothed out” compared to the original signal. The spiky details are all in the output of the high-pass filter, i.e. those spikes are the high-frequency components. We could label the output of HPF with a label (“C”) and the output would be the volume of that note at each moment in time. If this was a real music track and we wanted to compress and denoise, we might then filter out any entry of the data that is below a certain volume, and potentially get a good compression ratio. Unfortunately, since this diagram contains a stream of random noise, nothing really stands out.

What about the other notes? Notice that there are still bumps in the output of the low-pass filter. What if we applied another high pass filter on its output, with a lower frequency bound? That would give us another “band” of frequencies we could represent with a “note”. That’s what the wavelet transform is: applying successive high-pass filters until we’ve completely separated the signal into frequency bands. And that’s it. You understand wavelet analysis! (qualitatively).

# Quantitative description

To get a quantative understanding of the wavelet transform, first let’s build a model of the input signal. Let’s say there is some signal with amplitude \(x(t)\) that starts at \(t=0\). In other words, \(x(t) = 0\) or is undefined for \(t < 0\). Let’s also assume that \(x\) comes in equally spaced measurements, 1 unit of \(t\) apart, so we have \(x(0), x(1), x(2),\dots\) as a stream of data and nothing else.

Our input stream \(x(t)\) has a Fourier transform we’ll call \(X(\omega)\). Note that since \(x\) has a time resolution of 1 entry per second, \(X\) is only meaningfully defined between \(-\pi\) and \(\pi\). This can be seen be the definition of the Fourier transform:

\[ X(\omega) = \sum_{0}^{\infty} x(t) e^{-i \omega t}\]

Extending \(\omega\) past some range \(R\):

\[ X(\omega + R) = \sum_{0}^{\infty} x(t) e^{-i (\omega + R) t} = \sum_{0}^{\infty} x(t) e^{-i \omega t} e^{-i R t},\]

we see that \(X(\omega) = X(\omega + R)\) whenever \(Rt\) is an integer multiple of \(2\pi\). Therefore, if \(t\) is an integer, \(X(\omega)\) is redundant beyond \(R= 2\pi\). In general, if \(t = Tt'\), where \(t'\) is a natural number and \(T\) is the sample spacing, \(X\) is redundant beyond \(R = 2\pi/T\). The smaller the sample spacing in time, the higher the frequency resolution.

Define a *filter* as an operation on \(X(\omega)\) that produces some output signal \(Y(\omega)\). This is usually some window function, \(H(\omega)\), so that:

\[ Y(\omega) = H(\omega) X(\omega).\]

One example of a window function is

\[ H(\omega) = \begin{cases} 1 \quad |\omega| > \pi/2 \\ 0 \quad \mbox{otherwise} \end{cases}.\]

This window function flattens any frequencies less than \(\pi/2\), and lets them through otherwise. This is an example of a *high-pass filter*. It’s not the one we will use though, because the Fourier transform of a discontinuous function has infinite components and we don’t want that when we are implementing the filters in time-space.

Let’s look at the output and window functions in time-space. To do this, I’ll just use the definition of the fourier transform:

\[ \begin{align} Y(\omega) &= H(\omega) X(\omega) \\ \sum_t y(t) e^{-i \omega t} &= \left ( \sum_k h(k) e^{-i \omega k} \right ) \left ( \ \sum_{t'} x(t') e^{-i \omega t'} \right ) \\ &= \sum_{t'}\sum_{k}h(k)x(t')e^{-i \omega (k + t')}. \end{align}\]

So now on the left we have some expression with \(y(t)\), and on the right we have an expression of \(h(k)\) and \(x(t)\), with some exponential components. Notice that if we let \(t' = t - k\), the exponential components become the same:

\[ \sum_t y(t) e^{-i \omega t} = \sum_{t}\sum_{k}h(k)x(t-k)e^{-i \omega t}. \]

Therefore we can isolate a relationship between \(y(t)\), \(h(k)\), and \(x(t)\):

\[ y(t) = \sum_{k}h(k)x(t-k). \]

How can we interpret this? First, let’s try an example. Since we can choose any filter that we want, let’s try a filter where \(h(0) = 1/2\) and \(h(1) = 1/2\), 0 otherwise. Then

\[ y(t) = \frac{1}{2}x(t) + \frac{1}{2} x(t-1). \]

This is a 2-point moving average. Going back to the functional form of \(y(t)\), you can see that the filter is just some *rolling statistic* on \(x\). This concept struck me when I first realized it.

## Moving average and difference as low and high pass filters

The 2-point moving average and difference, which I will call \(y_l\) and \(y_h\), are very simple:

\[ y_l(t) = \frac{1}{2}x(t) + \frac{1}{2} x(t-1),\mbox{ and} \] \[ y_h(t) = \frac{1}{2}x(t) - \frac{1}{2} x(t-1). \]

It turns out that they function as complementary filters.

### Moving average and difference in frequency space

Let’s see how these rolling statistics act in frequency space. Remember that we just need to take the Fourier transform on \(h\) to get to the function \(H\) that multiplies the frequency spectrum \(X\).

Let’s do that. For \(h_l(k)\), the Fourier transform is

\[ \begin{align} H_l(\omega)& = \sum_k h_l(k) e^{-i \omega k}\\ &= h_l(0) + h_l(1) e^{-i \omega} \\ &= \frac{1}{2} (1 + e^{-i \omega}) \\ &= \frac{1}{2}e^{-i\omega/2} (e^{i \omega/2} + e^{-i \omega 2}) \\ &= e^{-i \omega /2} cos(\omega/2) . \end{align}\]

To see what frequencies are being selected, we plot the magnitude \(|H_l(\omega)| = cos(\omega/2)\) from \(-\pi\) to \(\pi\):

The magnitude is near 1 for frequencies \(\omega\) near 0, and goes to zero near the limits, \(-\pi\) and \(\pi\).

We can do the same for \(h_h(k)\). The fourier transform is:

\[ \begin{align} H_h(\omega)& = \sum_k h_h(k) e^{-i \omega k}\\ &= h_h(0) + h_h(1) e^{-i \omega} \\ &= \frac{1}{2} (1 - e^{-i \omega}) \\ &= \frac{1}{2}e^{-i\omega/2} (e^{i \omega/2} - e^{-i \omega 2}) \\ &= e^{-i \omega /2} i sin(\omega/2) . \end{align}\]

The plot of the maginitude \(H_h(\omega)\) is seems to be the complement of \(|H_l(\omega)|\):

This function goes to 1 at the endpoints \(-\pi\) and \(\pi\), but goes to zero for frequencies near zero.

Notice that neither of these filters are not even close to “ideal filters”, i.e. they still let through some opposing frequency. That’s okay, because this is just a simple example. It would be an interesting exercise to find different rolling statistics that act as much better filters. Probably could do so via reverse engineering a desired frequency spectrum.

### Complementarity

These two filters complement each other. Notice that we can reconstruct the full signal by taking the sum or the difference of these two filter signals.

\[ y_l(t) + y_h(t) = \frac{1}{2}\left ( x(t) + x(t-1) + x(t) - x(t-1)\right ) = x(t). \] \[ y_l(t) - y_h(t) = \frac{1}{2}\left ( x(t) + x(t-1) - x(t) + x(t-1) \right )= x(t-1). \]

It is **very important** that the difference results in a series lagged by one time unit. This means that we can reconstruct the full signal using only the even members of both \(y_l(t)\) and \(y_h(t)\). This means if we have \(N\) samples of \(x(t)\), we only need \(N/2\) samples of each \(y_l(t)\) to reconstruct the signal. Remember from the introduction how we planned to apply another high pass filter to the low pass filter output? Then we’ll get two series with \(N/4\) samples. Applying to the low-pass output again will give us two with \(N/8\) samples, eventually getting down to 1 sample, at which point we’ll stop. The complementarity of these two filters is important because it means our algorithm will *terminate*.

## Building the Wavelet pyramid

So now we have a complementary low-pass filter and a high-pass filter. We have a plan to run a recursive algorithm: split a signal (blue) into a high-pass output (yellow) and low-pass output (red):downsample each sample so they only contain the even points, all we need to fully reproduce the symbol:

and run the algorithm on the low-pass output,

until we are left with the ultimate low pass filter, the full average.

For 8 input signal points, we have output 8 “wavelet” points, each level in the y-direction indicating an individual frequency band, each point in the x-direction is a measurement of the volume of that frequency band. Notice that there are fewer measurements the lower the frequency goes, they have less “time-resolution”. Signals that vary less in time have lower time resolution.

I hope these diagrams are useful for you to understand the concept, now I’m going to put them into matrix form. I’m **skipping** the pre-downsampling step, in the interest of space. I’m going to use the following notation: \(y_h\) is the high-pass output after downsampling, \(y_l\) is the low-pass output after downsampling, \(y_{lh}\) is the output after the low-pass filter, downsampling, doing a high-pass filter on that, and downsampling, and so on for \(y_{ll}\), \(y_{llh}\), \(y_{lll}\). Assuming that \(x\) is 8 samples long, the matrix equation is going to look something like this for the first step:

\[ \begin{bmatrix} y_h(1) \\ y_h(3) \\ y_h(5) \\ y_h(7) \\ y_l(1) \\y_l(3) \\ y_l(5) \\ y_l(7) \end{bmatrix} = \frac{1}{2} \begin{bmatrix} -1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & -1 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & -1 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & -1 & 1 \\ 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 \\ \end{bmatrix} \begin{bmatrix} x(0) \\ x(1) \\ x(2) \\ x(3) \\ x(4) \\ x(5) \\ x(6) \\ x(7) \end{bmatrix} \]

Now we have 4 high-pass points and 4 low-pass points, what will this matrix look like after doing a recursive pass on the low-pass? This problem looks the same, just using \(y_l\) as the starting point and half as big:

\[ \begin{bmatrix} y_{lh}(3) \\ y_{lh}(7) \\y_{ll}(3) \\ y_{ll}(7) \end{bmatrix} = \frac{1}{2} \begin{bmatrix} -1 & 1 & 0 & 0 \\ 0 & 0 & -1 & 1 \\ 1 & 1 & 0 & 0 \\ 0 & 0 & 1 & 1 \end{bmatrix} \begin{bmatrix} y_l(1) \\ y_l(3) \\y_l(5) \\ y_l(7) \end{bmatrix} \]

We can plug in the definition of \(y_l\) to get \(y_{lh}\) and \(y_{ll}\) in terms of \(x\).

\[ \begin{align} \begin{bmatrix} y_{lh}(3) \\ y_{lh}(7) \\y_{ll}(3) \\ y_{ll}(7) \end{bmatrix} &= \frac{1}{4} \begin{bmatrix} -1 & 1 & 0 & 0 \\ 0 & 0 & -1 & 1 \\ 1 & 1 & 0 & 0 \\ 0 & 0 & 1 & 1 \end{bmatrix} \begin{bmatrix} 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 \\ \end{bmatrix} \begin{bmatrix} x(0) \\ x(1) \\ x(2) \\ x(3) \\ x(4) \\ x(5) \\ x(6) \\ x(7) \end{bmatrix} \\ &= \frac{1}{4} \begin{bmatrix} -1 & -1 & 1 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & -1 & -1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 1 & 1 & 1 \\ \end{bmatrix} \begin{bmatrix} x(0) \\ x(1) \\ x(2) \\ x(3) \\ x(4) \\ x(5) \\ x(6) \\ x(7) \end{bmatrix} \\ \end{align} \]

You can see that that \(y_{lh}(3)\) is \(x(3) + x(2) - x(1) - x(0)\), it is a 4 point moving difference. It’s like a stretched out version of the original high-pass filter. Likewise \(y_{ll}\) is a 4-point moving difference, a stretched out version of the original low-pass filter.

Let me recursively apply the algorithm one last time:

\[ \begin{align} \begin{bmatrix} y_{llh}(7) \\ y_{lll}(7) \end{bmatrix} &= \frac{1}{2} \begin{bmatrix} -1 & 1 \\ 1 & 1 \\ \end{bmatrix} \begin{bmatrix} y_{ll}(3) \\ y_{ll}(7) \end{bmatrix} \\ \begin{bmatrix} y_{llh}(7) \\ y_{lll}(7) \end{bmatrix} &= \frac{1}{8} \begin{bmatrix} -1 & 1 \\ 1 & 1 \\ \end{bmatrix} \begin{bmatrix} 1 & 1 & 1 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} x(0) \\ x(1) \\ x(2) \\ x(3) \\ x(4) \\ x(5) \\ x(6) \\ x(7) \end{bmatrix} \\ &= \frac{1}{8} \begin{bmatrix} -1 & -1 & -1 & -1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} x(0) \\ x(1) \\ x(2) \\ x(3) \\ x(4) \\ x(5) \\ x(6) \\ x(7) \end{bmatrix} \\ \end{align} \]

The third pass of the algorithm gives us an 8-point moving difference and average.

The recursion stops here because you can no longer do any filters on one point of data. If we put all the high-pass results together in one matrix, it looks something like:

\[ \begin{bmatrix} y_h(1) \\ y_h(3) \\ y_h(5) \\ y_h(7) \\ y_{lh}(3) \\y_{lh}(7) \\ y_{llh}(7) \\ y_{lll}(7) \end{bmatrix} = \begin{bmatrix} -h & h & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & -h & h & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & -h & h & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & -h & h \\ -h^2 & -h^2 & h^2 & h^2 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & -h^2 & -h^2 & h^2 & h^2 \\ -h^3 & -h^3 & -h^3 & -h^3 & h^3 & h^3 & h^3 & h^3 \\ h^3 & h^3 & h^3 & h^3 & h^3 & h^3 & h^3 & h^3 \\ \end{bmatrix} \begin{bmatrix} x(0) \\ x(1) \\ x(2) \\ x(3) \\ x(4) \\ x(5) \\ x(6) \\ x(7) \end{bmatrix} \]

where \(h = \frac{1}{2}\). The matrix on the right is the normalized Haar matrix, corresponding to the Haar wavelet transform. The Haar wavelet transform was discovered in 1909 by Alfred Haar as just an example of an “orthogonal system” one could project a function onto. It wasn’t considered a “wavelet” until around 75 years later, but it is. At each level of filter, we have filtered into some band of frequency which you can think of as a “tone”. We could label \(y_h = C\), \(y_{lh} = B\), and \(y_{llh}= A\) (and \(y_{lll}\) would just be some baseline background voltage). We also have some notion of “when” they happen in time, more for higher-frequency notes than for the lower-frequency notes. To complete the music analogy, our transform looks like this:

\[ \begin{bmatrix} \mbox{C-note}(t=1) \\ \mbox{C-note}(t=3) \\ \mbox{C-note}(t=5) \\ \mbox{C-note}(t=7) \\ \mbox{B-note}(t=3) \\\mbox{B-note}(t=7) \\ \mbox{A-note}(t=7) \\ \mbox{Background} \end{bmatrix} = \begin{bmatrix} -h & h & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & -h & h & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & -h & h & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & -h & h \\ -h^2 & -h^2 & h^2 & h^2 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & -h^2 & -h^2 & h^2 & h^2 \\ -h^3 & -h^3 & -h^3 & -h^3 & h^3 & h^3 & h^3 & h^3 \\ h^3 & h^3 & h^3 & h^3 & h^3 & h^3 & h^3 & h^3 \\ \end{bmatrix} \begin{bmatrix} \mbox{Voltage(0)} \\ \mbox{Voltage(1)} \\ \mbox{Voltage(2)} \\ \mbox{Voltage(3)} \\ \mbox{Voltage(4)} \\ \mbox{Voltage(5)} \\ \mbox{Voltage(6)} \\ \mbox{Voltage(7)} \end{bmatrix} \]

# What is a wavelet?

All this talk about wavelet transforms, and I haven’t talked about wavelets very much. A wavelet is a class of functions that become the new basis of a coordinate transform, like the complex exponential in Fourier series. Each of the rows of the Haar matrix is a basis vector. So, in our case, a wavelet looks like the function:

\[ f(x) = \begin{cases} 1 & 0<=x<1/2 \\ -1 & 1/2 <= x < 1 \\ 0 & \mbox{otherwise} \end{cases},\]

at different levels of scales and shift. When you have an incoming signal that is equal to a wavelet function, you should see the wavelet transform place the wavelet into an appropriate frequency bucket.

The following is an animation of the wavelet transform in action, sorting output into different wavelet frequencies. The input is 256 measurements rolling in from the left into a buffer that goes out 256 entries to the left. This output is split into \(\log(N) - 1 = 7\) wavelet scales using the filters we described above.

You might notice that some of the lower-frequency filters update less often. That’s because the wavelet transform is not “shift-invariant”: if you shift the input signal forward by one unit of time, you don’t get the output signal shifted forward by one unit of time, you actually get a different answer. Think of the signal where \(x(t=1) = 1\), but 0 everywhere else. When I operate the moving difference algorithm on this, \(y_h(1) = x(1) - x(0) = 1 - 0 = 1\). But let me shift the signal forward one, so that \(x(t=0) = 1\) now. Now \(y_h(1) = x(1) - x(0) = 0 - 1 = -1\). \(y_h(1)\) has flipped over the y-axis. You might say “hey, but now \(y_h(0) = x(0) - x(-1) = 1 - 0 = 1\) now”, but *we don’t have \(y_h(0)\) because we downsampled*. The only way to have shift invariance is to shift the signal one window length of the filter, in this case 2. **This is actually a huge problem and I really don’t like it. In any practical application, you aren’t going to be able to align your filter windows perfectly with the incoming signal and so you are going to pick up wavelet frequencies that are misleading. You can see examples of this phenomena by unchecking the box “Align Sent Wavelet Window”, wavelets get aliased at other frequencies. If anyone knows how to make a wavelet transform time-invariant or shift-invariant, I am super interested in that!**

Matias

Nice article!

Please take a look at DTCWT/DTCWPT for shift-invariance:

https://github.com/neurobiofisica/gymnotools/tree/master/dtcwpt

https://github.com/hgomersall/dtcwt