I’m interrupting my series on wavelets to do a post on something I’ve been interested in doing for a while. I thought it would be a cool “spy device” to create something that can filter out voices from a recording of a crowded room, and I think I can do that with concepts I’ve previously developed in this blog. My intuition:
Every person has a unique set of vocal chords that outputs a certain signature of frequencies.
As that person talks, the volume at those frequencies will increase and decrease.
A person’s voice should be the biggest source of volume change at these frequencies.
Therefore, since principal component analysis (PCA) can pick out sets of frequencies that vary with each other, I believe that individuals’ voices can be isolated from a crowded recording by dividing the recording into equallysized blocks, applying the Fourier transform to each block (these two steps together are commonly called shorttime Fourier transform), and then doing PCA in frequencyspace across all blocks. The individual voices should be the projection of the recording onto the principal components.
Note from the future: Turns out that I was wrong. But here’s how I did it:
I’ve downloaded a .wav
file of a crowded bar from Freesound.org user lonemonk. Give it a listen to see if you can pick anything out, author notes that there are around 50 people in the recording:
Let’s talk about the storage format. A .wav
file is basically raw sound data, meaning that the entries can be interpreted as voltages across a microphone’s sound plate. These voltages are stored in a preset number of bits, are sampled a certain amount of times per second, and typically come in 1 or 2 channels for mono or stereo (twoeared) sound. I’ll use Python to figure out these parameters for this file, using the wave
library.
I’ve saved off the audio to a file called crowded_bar.wav
:
import wave
import numpy
import struct
voices = wave.open('crowded_bar.wav', 'r')
print "Number of channels: " + str(voices.getnchannels())
print "Sample bytewidth: " + str(voices.getsampwidth())
print "Framerate: " + str(voices.getframerate())
print "Number of frames: " + str(voices.getnframes())
## Number of channels: 2
## Sample bytewidth: 2
## Framerate: 44100
## Number of frames: 15478784
So it looks like we have 2 audio channels, each sample is 2 bytes long so we can represent each sample as a 16bit integer, there are 44100 samples per second for a total number of 15478784 frames. That means our recording is \(15478784 \mbox{ frames}/44100 \mbox{ frames/s} \approx 351 \mbox{ seconds}\). This agrees with the runtime of the audio, 5 minutes and 51 seconds.
So let’s start coding this up. I’m separating the recording into 10,000 chunks, each chunk with 1547 samples.
num_chunks = 10000
chunk_width = voices.getnframes()/num_chunks
The samples are stored in 16bit, little endian format. Python has a nice function called struct.unpack
for “unpacking” this data into an int:
## unpack littleendian ints
samples = [[struct.unpack("<h",
## read the first 2 bytes of 1 frame. The
## total frame is 4 bytes long, 2 for each
## sample, and since the output is a tuple
## take the 0th output.
voices.readframes(1)[0:2])[0]
## for all samples in the chunk
for i in range(chunk_width)]
## for all chunks in the recording.
for y in range(num_chunks)]
Now apply the Fourier transform to each block:
print "Starting fft..."
def print_and_fft(x, i):
print i
return numpy.fft.fft(x)
transformed = numpy.matrix(map(print_and_fft, samples, range(len(samples))))
print "Done fft."
In the next step, I use only the 1st\(N/2\)th elements of the Fourier transform. The input and output of this algorithm are real numbers, so that means that the 2nd half of the Fourier transform components are going to be the complex conjugates of the first half, in reverse order, and therefore are redundant. The reason I leave the 0th component out is because that is those are the mean voltages of the chunk, and if I leave them in, in practice I’ve found that when I project onto a principle component, the mean becomes an imaginary number. I don’t want that, so I just set them to zero, effectively filtering out the lowest frequency.
means = [0 for x in transformed[:,0]]
transformed = transformed[:,1:(transformed.shape[1]/2+1)]
The next step is the PCA, finding the eigenvectors of the covariance matrix. See my recent post on PCA for an explanation of this calculation.
mean_zero = numpy.matrix(numpy.apply_along_axis(lambda x: x  numpy.mean(x), 0, transformed))
cov = numpy.dot(mean_zero.getH(), mean_zero)
print "Starting eigenvector solve..."
w, v = numpy.linalg.eig(cov)
print "Done eigenvector solve..."
Now I want to project the spectrum of each chunk on to the principle components, here’s how I do that. It’s basically using the formula for projection of vector \(\vec v\) onto unit vector \(\hat e\): \(\langle \vec v, \hat e \rangle \hat e\).
def projected_spectrum(component):
return numpy.dot(numpy.dot(transformed.conj(), v[:,component]), v.transpose()[component,:])
Now we want to invert these projections, and see what the output audio is:
def print_and_ifft(x, i):
print i
return numpy.fft.ifft(x)
def write_component_to_file(component):
projected = projected_spectrum(component)
projected = numpy.column_stack((means, projected, numpy.fliplr(projected.conj())))
inverted = map(print_and_ifft, projected.tolist(), range(len(projected)))
first_component = wave.open(str(component) + '.wav', 'w')
width = projected.shape[0]
height = projected.shape[1]
first_component.setparams(
(1, # nchannels
2, # sampwidth
voices.getframerate(), # framerate
width * height,
voices.getcomptype(),
voices.getcompname()))
frames = "".join(["".join([struct.pack("
Now let’s write the first 5 components to a file, are they voices?
write_component_to_file(0)
write_component_to_file(1)
write_component_to_file(2)
write_component_to_file(3)
write_component_to_file(4)
Nope!!
After fiddling around for a night with the settings, I’ve decided that my intuition was wrong. People’s voices don’t stick to a fixed set of frequencies, their voices change frequencies as they speak all the time. For example, a long, sinking oooooooohhh sound exhibits a pretty constant downward drift in frequency. Since everybody’s vocal frequencies are constantly changing and crossing paths with each other, I don’t think you can isolate one voice by looking at a constant cross section of frequencies. The problem is harder than that.
After doing some research I’ve discovered that this is indeed a pretty hard problem in signal processing, called blind signal seperation, also known as the cocktail party problem. Apparently deep learning has been used to solve the problem. That’s interesting, but I somewhat resent a “black box” solution. I’m going to table my exploration of this for now but was interesting to work with audio data.
]]>
http://michaeljflynn.net/2017/07/12/negativeresultisolatingvoicesfromacrowdedrecordingusingstftandpca/feed/
0
593

Wavelets – part 1
http://michaeljflynn.net/2017/06/26/waveletspart1/
http://michaeljflynn.net/2017/06/26/waveletspart1/#comments
Mon, 26 Jun 2017 04:21:45 +0000
http://michaeljflynn.net/?p=573
If you enjoy this post, subscribe using the form to the left! I try to make new posts every weekend, though sometimes life gets in the way. This one took 3 months!
Since my last post was on the fast Fourier transform I thought I would continue the theme of signal processing and write a post on wavelet analysis. However after looking over the subject for a couple weekends months I’ve decided to split this topic into several posts.
This post is heavily informed by Gilbert Strang and Troung Nguyen’s “Wavelets and Filter Banks”. To be honest, the book is hard to read, so I’ll try to digest it the best I can for you here. Maybe I’m being hypocritical. I hope you can read this!
Introduction
Like the Fourier transform, the wavelet transform outputs a different representation of a set of data. The original data is usually represented as a series of measurements in time. For example, in an audio recording the data is a time series of voltages across the capacitor plate of a microphone.
## Time CapacitorVoltage
## 00:000000 0.1
## 00:000001 0.13
## 00:000002 0.15
## ... ...
The Fourier transform outputs a represention of the data as a combination of constant tones:
## Frequency Volume
## Ctone 0.1
## Dtone 0.01
## Ftone 0
## ... ...
Think of music. It might be more efficient to represent data as a set of tones than a series of millions of voltages. However, the Fourier Transform has a weird trait that it represents the signal as a series of constant tones that play the entire length of the recording. There is a math theorem that says any function can be fully represented in that way, but that’s not how people think of music. Instead, people think of “notes” that happen at specific times in the recording, like sheet music. That is what the wavelet transform intends to represent. It would look something like this:
## Note Time Volume
## Cnote 0 1
## Dnote 0 0
## Dnote 1 1
## ... ... ...
Representing certain types of data in this way is very efficient. Think of how sheet music is much easier to store than spreadsheets of billions of capacitor voltages. This is why many compression techniques for the web, like JPEG 2000, use it to send data efficiently over the wire without losing the “gist” of the image.
So how do we actually transform data into this representation? We need some sort of operation that will give us a compromise between the perfect timeresolution of original representation and the perfect frequency resolution of the Fourier transform.
What if I told you there already exists an operation that has both highresolution in frequency and high resolution in time? It’s an elementary circuit in analog electronics, called the highpass filter. The high pass filter takes a signal and filters out lower frequency components of that signal. An example output is illustrated below. I’ve drawn the results of a highpass filter, labeled “HPF” and a complementary lowpass filter, labeled “LPF”, with input coming from the right (same for both), and output going out to the left:
Notice that the low pass output is “smoothed out” compared to the original signal. The spiky details are all in the output of the highpass filter, i.e. those spikes are the highfrequency components. We could label the output of HPF with a label (“C”) and the output would be the volume of that note at each moment in time. If this was a real music track and we wanted to compress and denoise, we might then filter out any entry of the data that is below a certain volume, and potentially get a good compression ratio. Unfortunately, since this diagram contains a stream of random noise, nothing really stands out.
What about the other notes? Notice that there are still bumps in the output of the lowpass filter. What if we applied another high pass filter on its output, with a lower frequency bound? That would give us another “band” of frequencies we could represent with a “note”. That’s what the wavelet transform is: applying successive highpass filters until we’ve completely separated the signal into frequency bands. And that’s it. You understand wavelet analysis! (qualitatively).
Quantitative description
To get a quantative understanding of the wavelet transform, first let’s build a model of the input signal. Let’s say there is some signal with amplitude \(x(t)\) that starts at \(t=0\). In other words, \(x(t) = 0\) or is undefined for \(t < 0\). Let’s also assume that \(x\) comes in equally spaced measurements, 1 unit of \(t\) apart, so we have \(x(0), x(1), x(2),\dots\) as a stream of data and nothing else.
Our input stream \(x(t)\) has a Fourier transform we’ll call \(X(\omega)\). Note that since \(x\) has a time resolution of 1 entry per second, \(X\) is only meaningfully defined between \(\pi\) and \(\pi\). This can be seen be the definition of the Fourier transform:
\[ X(\omega) = \sum_{0}^{\infty} x(t) e^{i \omega t}\]
Extending \(\omega\) past some range \(R\):
\[ X(\omega + R) = \sum_{0}^{\infty} x(t) e^{i (\omega + R) t}
= \sum_{0}^{\infty} x(t) e^{i \omega t} e^{i R t},\]
we see that \(X(\omega) = X(\omega + R)\) whenever \(Rt\) is an integer multiple of \(2\pi\). Therefore, if \(t\) is an integer, \(X(\omega)\) is redundant beyond \(R= 2\pi\). In general, if \(t = Tt'\), where \(t'\) is a natural number and \(T\) is the sample spacing, \(X\) is redundant beyond \(R = 2\pi/T\). The smaller the sample spacing in time, the higher the frequency resolution.
Define a filter as an operation on \(X(\omega)\) that produces some output signal \(Y(\omega)\). This is usually some window function, \(H(\omega)\), so that:
\[ Y(\omega) = H(\omega) X(\omega).\]
One example of a window function is
\[ H(\omega) = \begin{cases} 1 \quad \omega > \pi/2 \\ 0 \quad \mbox{otherwise} \end{cases}.\]
This window function flattens any frequencies less than \(\pi/2\), and lets them through otherwise. This is an example of a highpass filter. It’s not the one we will use though, because the Fourier transform of a discontinuous function has infinite components and we don’t want that when we are implementing the filters in timespace.
Let’s look at the output and window functions in timespace. To do this, I’ll just use the definition of the fourier transform:
\[ \begin{align} Y(\omega) &= H(\omega) X(\omega) \\
\sum_t y(t) e^{i \omega t} &= \left ( \sum_k h(k) e^{i \omega k} \right ) \left ( \ \sum_{t'} x(t') e^{i \omega t'} \right ) \\
&= \sum_{t'}\sum_{k}h(k)x(t')e^{i \omega (k + t')}.
\end{align}\]
So now on the left we have some expression with \(y(t)\), and on the right we have an expression of \(h(k)\) and \(x(t)\), with some exponential components. Notice that if we let \(t' = t  k\), the exponential components become the same:
\[
\sum_t y(t) e^{i \omega t} = \sum_{t}\sum_{k}h(k)x(tk)e^{i \omega t}.
\]
Therefore we can isolate a relationship between \(y(t)\), \(h(k)\), and \(x(t)\):
\[
y(t) = \sum_{k}h(k)x(tk).
\]
How can we interpret this? First, let’s try an example. Since we can choose any filter that we want, let’s try a filter where \(h(0) = 1/2\) and \(h(1) = 1/2\), 0 otherwise. Then
\[ y(t) = \frac{1}{2}x(t) + \frac{1}{2} x(t1). \]
This is a 2point moving average. Going back to the functional form of \(y(t)\), you can see that the filter is just some rolling statistic on \(x\). This concept struck me when I first realized it.
Moving average and difference as low and high pass filters
The 2point moving average and difference, which I will call \(y_l\) and \(y_h\), are very simple:
\[ y_l(t) = \frac{1}{2}x(t) + \frac{1}{2} x(t1),\mbox{ and} \] \[ y_h(t) = \frac{1}{2}x(t)  \frac{1}{2} x(t1). \]
It turns out that they function as complementary filters.
Moving average and difference in frequency space
Let’s see how these rolling statistics act in frequency space. Remember that we just need to take the Fourier transform on \(h\) to get to the function \(H\) that multiplies the frequency spectrum \(X\).
Let’s do that. For \(h_l(k)\), the Fourier transform is
\[ \begin{align} H_l(\omega)& = \sum_k h_l(k) e^{i \omega k}\\
&= h_l(0) + h_l(1) e^{i \omega} \\
&= \frac{1}{2} (1 + e^{i \omega}) \\
&= \frac{1}{2}e^{i\omega/2} (e^{i \omega/2} + e^{i \omega 2}) \\
&= e^{i \omega /2} cos(\omega/2) . \end{align}\]
To see what frequencies are being selected, we plot the magnitude \(H_l(\omega) = cos(\omega/2)\) from \(\pi\) to \(\pi\):
The magnitude is near 1 for frequencies \(\omega\) near 0, and goes to zero near the limits, \(\pi\) and \(\pi\).
We can do the same for \(h_h(k)\). The fourier transform is:
\[ \begin{align} H_h(\omega)& = \sum_k h_h(k) e^{i \omega k}\\
&= h_h(0) + h_h(1) e^{i \omega} \\
&= \frac{1}{2} (1  e^{i \omega}) \\
&= \frac{1}{2}e^{i\omega/2} (e^{i \omega/2}  e^{i \omega 2}) \\
&= e^{i \omega /2} i sin(\omega/2) . \end{align}\]
The plot of the maginitude \(H_h(\omega)\) is seems to be the complement of \(H_l(\omega)\):
This function goes to 1 at the endpoints \(\pi\) and \(\pi\), but goes to zero for frequencies near zero.
Notice that neither of these filters are not even close to “ideal filters”, i.e. they still let through some opposing frequency. That’s okay, because this is just a simple example. It would be an interesting exercise to find different rolling statistics that act as much better filters. Probably could do so via reverse engineering a desired frequency spectrum.
Complementarity
These two filters complement each other. Notice that we can reconstruct the full signal by taking the sum or the difference of these two filter signals.
\[ y_l(t) + y_h(t) = \frac{1}{2}\left ( x(t) + x(t1) + x(t)  x(t1)\right ) = x(t). \] \[ y_l(t)  y_h(t) = \frac{1}{2}\left ( x(t) + x(t1)  x(t) + x(t1) \right )= x(t1). \]
It is very important that the difference results in a series lagged by one time unit. This means that we can reconstruct the full signal using only the even members of both \(y_l(t)\) and \(y_h(t)\). This means if we have \(N\) samples of \(x(t)\), we only need \(N/2\) samples of each \(y_l(t)\) to reconstruct the signal. Remember from the introduction how we planned to apply another high pass filter to the low pass filter output? Then we’ll get two series with \(N/4\) samples. Applying to the lowpass output again will give us two with \(N/8\) samples, eventually getting down to 1 sample, at which point we’ll stop. The complementarity of these two filters is important because it means our algorithm will terminate.
Building the Wavelet pyramid
So now we have a complementary lowpass filter and a highpass filter. We have a plan to run a recursive algorithm: split a signal (blue) into a highpass output (yellow) and lowpass output (red):
downsample each sample so they only contain the even points, all we need to fully reproduce the symbol:
and run the algorithm on the lowpass output,
until we are left with the ultimate low pass filter, the full average.
For 8 input signal points, we have output 8 “wavelet” points, each level in the ydirection indicating an individual frequency band, each point in the xdirection is a measurement of the volume of that frequency band. Notice that there are fewer measurements the lower the frequency goes, they have less “timeresolution”. Signals that vary less in time have lower time resolution.
I hope these diagrams are useful for you to understand the concept, now I’m going to put them into matrix form. I’m skipping the predownsampling step, in the interest of space. I’m going to use the following notation: \(y_h\) is the highpass output after downsampling, \(y_l\) is the lowpass output after downsampling, \(y_{lh}\) is the output after the lowpass filter, downsampling, doing a highpass filter on that, and downsampling, and so on for \(y_{ll}\), \(y_{llh}\), \(y_{lll}\). Assuming that \(x\) is 8 samples long, the matrix equation is going to look something like this for the first step:
\[
\begin{bmatrix}
y_h(1) \\ y_h(3) \\ y_h(5) \\ y_h(7) \\ y_l(1) \\y_l(3) \\ y_l(5) \\ y_l(7)
\end{bmatrix}
=
\frac{1}{2}
\begin{bmatrix}
1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 1 & 1 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 \\
1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 1 & 1 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 \\
\end{bmatrix}
\begin{bmatrix}
x(0) \\ x(1) \\ x(2) \\ x(3) \\ x(4) \\ x(5) \\ x(6) \\ x(7)
\end{bmatrix}
\]
Now we have 4 highpass points and 4 lowpass points, what will this matrix look like after doing a recursive pass on the lowpass? This problem looks the same, just using \(y_l\) as the starting point and half as big:
\[ \begin{bmatrix}
y_{lh}(3) \\ y_{lh}(7) \\y_{ll}(3) \\ y_{ll}(7)
\end{bmatrix} =
\frac{1}{2}
\begin{bmatrix}
1 & 1 & 0 & 0 \\
0 & 0 & 1 & 1 \\
1 & 1 & 0 & 0 \\
0 & 0 & 1 & 1
\end{bmatrix}
\begin{bmatrix}
y_l(1) \\ y_l(3) \\y_l(5) \\ y_l(7)
\end{bmatrix}
\]
We can plug in the definition of \(y_l\) to get \(y_{lh}\) and \(y_{ll}\) in terms of \(x\).
\[ \begin{align} \begin{bmatrix}
y_{lh}(3) \\ y_{lh}(7) \\y_{ll}(3) \\ y_{ll}(7)
\end{bmatrix} &=
\frac{1}{4}
\begin{bmatrix}
1 & 1 & 0 & 0 \\
0 & 0 & 1 & 1 \\
1 & 1 & 0 & 0 \\
0 & 0 & 1 & 1
\end{bmatrix}
\begin{bmatrix}
1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 1 & 1 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 \\
\end{bmatrix}
\begin{bmatrix}
x(0) \\ x(1) \\ x(2) \\ x(3) \\ x(4) \\ x(5) \\ x(6) \\ x(7)
\end{bmatrix} \\
&=
\frac{1}{4}
\begin{bmatrix}
1 & 1 & 1 & 1 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 1 & 1 & 1 & 1 \\
1 & 1 & 1 & 1 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 1 & 1 & 1 & 1 \\
\end{bmatrix}
\begin{bmatrix}
x(0) \\ x(1) \\ x(2) \\ x(3) \\ x(4) \\ x(5) \\ x(6) \\ x(7)
\end{bmatrix} \\
\end{align}
\]
You can see that that \(y_{lh}(3)\) is \(x(3) + x(2)  x(1)  x(0)\), it is a 4 point moving difference. It’s like a stretched out version of the original highpass filter. Likewise \(y_{ll}\) is a 4point moving difference, a stretched out version of the original lowpass filter.
Let me recursively apply the algorithm one last time:
\[ \begin{align}
\begin{bmatrix}
y_{llh}(7) \\ y_{lll}(7)
\end{bmatrix} &=
\frac{1}{2}
\begin{bmatrix}
1 & 1 \\
1 & 1 \\
\end{bmatrix}
\begin{bmatrix}
y_{ll}(3) \\ y_{ll}(7)
\end{bmatrix} \\
\begin{bmatrix}
y_{llh}(7) \\ y_{lll}(7)
\end{bmatrix} &=
\frac{1}{8}
\begin{bmatrix}
1 & 1 \\
1 & 1 \\
\end{bmatrix}
\begin{bmatrix}
1 & 1 & 1 & 1 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 1 & 1 & 1 & 1
\end{bmatrix}
\begin{bmatrix}
x(0) \\ x(1) \\ x(2) \\ x(3) \\ x(4) \\ x(5) \\ x(6) \\ x(7)
\end{bmatrix} \\
&=
\frac{1}{8}
\begin{bmatrix}
1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\
1 & 1 & 1 & 1 & 1 & 1 & 1 & 1
\end{bmatrix}
\begin{bmatrix}
x(0) \\ x(1) \\ x(2) \\ x(3) \\ x(4) \\ x(5) \\ x(6) \\ x(7)
\end{bmatrix} \\
\end{align}
\]
The third pass of the algorithm gives us an 8point moving difference and average.
The recursion stops here because you can no longer do any filters on one point of data. If we put all the highpass results together in one matrix, it looks something like:
\[
\begin{bmatrix}
y_h(1) \\ y_h(3) \\ y_h(5) \\ y_h(7) \\ y_{lh}(3) \\y_{lh}(7) \\ y_{llh}(7) \\ y_{lll}(7)
\end{bmatrix}
=
\begin{bmatrix}
h & h & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & h & h & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & h & h & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & h & h \\
h^2 & h^2 & h^2 & h^2 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & h^2 & h^2 & h^2 & h^2 \\
h^3 & h^3 & h^3 & h^3 & h^3 & h^3 & h^3 & h^3 \\
h^3 & h^3 & h^3 & h^3 & h^3 & h^3 & h^3 & h^3 \\
\end{bmatrix}
\begin{bmatrix}
x(0) \\ x(1) \\ x(2) \\ x(3) \\ x(4) \\ x(5) \\ x(6) \\ x(7)
\end{bmatrix}
\]
where \(h = \frac{1}{2}\). The matrix on the right is the normalized Haar matrix, corresponding to the Haar wavelet transform. The Haar wavelet transform was discovered in 1909 by Alfred Haar as just an example of an “orthogonal system” one could project a function onto. It wasn’t considered a “wavelet” until around 75 years later, but it is. At each level of filter, we have filtered into some band of frequency which you can think of as a “tone”. We could label \(y_h = C\), \(y_{lh} = B\), and \(y_{llh}= A\) (and \(y_{lll}\) would just be some baseline background voltage). We also have some notion of “when” they happen in time, more for higherfrequency notes than for the lowerfrequency notes. To complete the music analogy, our transform looks like this:
\[
\begin{bmatrix}
\mbox{Cnote}(t=1) \\ \mbox{Cnote}(t=3) \\ \mbox{Cnote}(t=5) \\ \mbox{Cnote}(t=7) \\ \mbox{Bnote}(t=3) \\\mbox{Bnote}(t=7) \\ \mbox{Anote}(t=7) \\ \mbox{Background}
\end{bmatrix}
=
\begin{bmatrix}
h & h & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & h & h & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & h & h & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & h & h \\
h^2 & h^2 & h^2 & h^2 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & h^2 & h^2 & h^2 & h^2 \\
h^3 & h^3 & h^3 & h^3 & h^3 & h^3 & h^3 & h^3 \\
h^3 & h^3 & h^3 & h^3 & h^3 & h^3 & h^3 & h^3 \\
\end{bmatrix}
\begin{bmatrix}
\mbox{Voltage(0)} \\ \mbox{Voltage(1)} \\ \mbox{Voltage(2)} \\ \mbox{Voltage(3)} \\ \mbox{Voltage(4)} \\ \mbox{Voltage(5)} \\ \mbox{Voltage(6)} \\ \mbox{Voltage(7)}
\end{bmatrix}
\]
What is a wavelet?
All this talk about wavelet transforms, and I haven’t talked about wavelets very much. A wavelet is a class of functions that become the new basis of a coordinate transform, like the complex exponential in Fourier series. Each of the rows of the Haar matrix is a basis vector. So, in our case, a wavelet looks like the function:
\[ f(x) = \begin{cases} 1 & 0<=x<1/2 \\
1 & 1/2 <= x < 1 \\
0 & \mbox{otherwise} \end{cases},\]
at different levels of scales and shift. When you have an incoming signal that is equal to a wavelet function, you should see the wavelet transform place the wavelet into an appropriate frequency bucket.
The following is an animation of the wavelet transform in action, sorting output into different wavelet frequencies. The input is 256 measurements rolling in from the left into a buffer that goes out 256 entries to the left. This output is split into \(\log(N)  1 = 7\) wavelet scales using the filters we described above.
You might notice that some of the lowerfrequency filters update less often. That’s because the wavelet transform is not “shiftinvariant”: if you shift the input signal forward by one unit of time, you don’t get the output signal shifted forward by one unit of time, you actually get a different answer. Think of the signal where \(x(t=1) = 1\), but 0 everywhere else. When I operate the moving difference algorithm on this, \(y_h(1) = x(1)  x(0) = 1  0 = 1\). But let me shift the signal forward one, so that \(x(t=0) = 1\) now. Now \(y_h(1) = x(1)  x(0) = 0  1 = 1\). \(y_h(1)\) has flipped over the yaxis. You might say “hey, but now \(y_h(0) = x(0)  x(1) = 1  0 = 1\) now”, but we don’t have \(y_h(0)\) because we downsampled. The only way to have shift invariance is to shift the signal one window length of the filter, in this case 2. This is actually a huge problem and I really don’t like it. In any practical application, you aren’t going to be able to align your filter windows perfectly with the incoming signal and so you are going to pick up wavelet frequencies that are misleading. You can see examples of this phenomena by unchecking the box “Align Sent Wavelet Window”, wavelets get aliased at other frequencies. If anyone knows how to make a wavelet transform timeinvariant or shiftinvariant, I am super interested in that!
Align Sent Wavlet Window
]]>
http://michaeljflynn.net/2017/06/26/waveletspart1/feed/
1
573

History and Derivation of the Fast Fourier Transform
http://michaeljflynn.net/2017/03/20/historyandderivationofthefastfouriertransform/
http://michaeljflynn.net/2017/03/20/historyandderivationofthefastfouriertransform/#comments
Mon, 20 Mar 2017 06:01:20 +0000
http://michaeljflynn.net/?p=551
[Read more...]]]>
If you enjoy this post, subscribe using the form to the left! I try to make new posts every weekend, though sometimes life gets in the way (this one took 6 weekends).
This is the second post in my series on great algorithms. My last post was on the MetropolisHastings algorithm.
History and Motivation
The Fourier transform is a fundamental operation in applied math. The original Fourier transform is attributed to Joseph Fourier, a French mathematician and physicist, for solving partial differential equations involved in the transfer of heat [1]. Since then, the revelation that any function can be approximated by a series of sines and cosines has exploded far beyond the original application. The technique is used in applications from digital signal processing to medical imaging to detecting nuclear tests. Until the FFT, all applications used an algorithm that summed \(N\) terms for each of \(N\) output terms, and therefore had an asymptotic runtime of \(O(N^2)\). The \(O(N\log N)\) algorithm was first presented in a complete, packaged form in the 1965 paper by John Tukey and James Cooley: “An algorithm for the machine calculation of complex fourier series” [2]. There is evidence that the FFT had been discovered before. Gauss used the algorithm in the early 1800s to interpolate the trajectory of the asteroid Pallas but never published his results (aymptotic runtime matters when you’re doing all the computation by hand). The Ex Libris blog gives an interesting analysis of Gauss’s approach here. Others claim to have discovered the fundamental principle behind the FFT first, but no one got it in a form that made users realize they could perform their \(O(N^2)\) computations in \(O(N\log N)\) until Tukey and Cooley and for that they deserve credit.
The motivation behind the development of the FFT was not academic. This was the case for many algorithmic improvements of the 50’s and 60’s. In a time where businessscale mainframe computers like the IBM System/360 only had processing speeds of 35 kilohertz and memory sizes of 512 kilobytes, optimizing algorithms for memory and time performance had material economic benefits for companies. For the FFT there were national security reasons as well. James Cooley gives an interesting account of his development of the algorithm in a 1988 issue of Mikrochmica Acta: “The ReDiscovery of the Fast Fourier Transform Algorithm” [3]. The way he tells it, the (hydrogenbomb) physicist Richard Garwin had a huge part in putting the effort together, for ColdWar purposes:
I was working on a research project of my own when Richard Garwin came to the computing center of the laboratory with some notes he made while with John Tukey at a meeting of President Kennedy’s Scientific Advisory Committee, of which they were both members. John Tukey showed that if \(N\), is a composite, \(N = ab\), then the Fourier series can be expressed as an aterm series of subseries of b terms each. If one were computing all values of the series, this would reduce the number of operations from \(N^2\) to \(N(a+ b)\). Tukey also said that if this were iterated, the number of operations would be proportional to \(N\log (N)\) instead of \(N^2\). Garwin knew that this was a very important calculation and he wanted to have this idea developed and applied.
Garwin described his problem of determining the periodicities of the spin orientations in a 3D crystal of He\(^3\). Later, I found out that he was far more interested in improving the ability to do remote seismic monitoring of nuclear explosions since the Russians would not agree to inspections within their borders thereby hindering efforts at obtaining a nuclear test ban treaty. He also saw a need for the capability of long range acoustical detection of submarines. Like many others, I did not see the significance in this improvement and gave the job a little less priority than my own research. However, I was told of Garwin’s reputation and, prodded by his occasional telephone calls (some of them to my manager), I produced a 3dimensional FFT program. I put some effort into designing the algorithm so as to save storage and addressing by overwriting data and I spent some time working out a 3dimensional indexing scheme that was combined with the indexing within the algorithm.
Garwin publicized the program at first through his many personal contacts, producing a small but increasing stream of requests for copies of it.
This is not to say that Garwin came up with the idea or wrote the code or paper, but his presence here is indisputable. I am impressed that Garwin was able to pose a dummy physics problem for Cooley to solve that he could then use to detect nuclear tests and track nuclear submarines – but that is the immense power of the FFT.
Cooley’s article has lots of historical tidbits in it, including notes about the computational limits of the day. One detail is the “record” Fourier transform on a dataset with 2048 samples:
Another member of our department, Lee Alsop, who was a geophysicist and adjunct professor at the Lamont Geophysical Laboratory of Columbia University decided to try the new algorithm on a record of 2048 samples of a strain seismograph of the Rat Island earthquake.
Another is an example of computation that was still infeasible, even with the algorithmic speedups. When approached by a colleague with spectrum data to analyze he recounts:
One extraordinary thing about this was that a single record of data was about 512 000 points and all values of the spectrum were needed. This was beyond the capacity of the high speed memory of existing machines.
He also mentions that the FFT does give a theoretical speedup by a factor of 12,800 in this case. Finally the collaboration with Tukey was very limited (as well as the editing):
Thus, the paper made only one round trip between me and Tukey and our only collaboration was in a few telephone conversations.
Perhaps that explains why it is not very easy to read…
Since then, there have been too many applications of the FFT to count. The algorithm goes so far as to be one of the fundamental operations in gatebased quantum computing. Therefore, it’s probably worth a look.
Introduction
The idea is to take advantage of symmetries in the complex exponential to “factor” the problem into several smaller problems recursively, yeilding standard “divide and conquer” speedups. While the algorithm has been generalized to work for any composite number of datapoints, the presentation is easiest when the number of points is a power of 2, so that is a fundamental assumption I will make throughout this post.
I’m going to assume that readers are familiar with the Fourier series and it’s generalization, the Fourier transform at a superficial level. To review, you can approximate any function over an interval of length \(L\) with sines and cosines with period equal to \(L\):
\[ f(x) = \frac{1}{2} a_0 + \sum_{n = 1}^{\infty} a_n \sin \left (\frac{2\pi}{L}nx \right ) + \sum_{n=1}^{\infty} b_n \cos \left (\frac{2\pi}{L}nx \right ).\]
Using the identities \(e^{i\theta} = \cos\theta + i\sin\theta\), \(\sin\theta = \frac{i}{2}(e^{i\theta} + e^{i\theta})\), and \(\cos\theta = \frac{1}{2}(e^{i\theta} – e^{i\theta})\), these terms can be represented as complex exponentials:
\[ \begin{align}
f(x) =& \frac{1}{2} a_0 + \sum_{n = 1}^{\infty} a_n \sin \left (\frac{2\pi}{L}nx \right ) + \sum_{n=1}^{\infty} b_n \cos \left (\frac{2\pi}{L}nx \right ) \\
=& \frac{1}{2} a_0 + \frac{1}{2}\sum_{n = 1}^{\infty} ia_n (e^{2\pi i n x / L} – e^{2\pi i n x / L}) + \frac{1}{2}\sum_{n=1}^{\infty} b_n (e^{2\pi i n x / L} + e^{2\pi i n x / L}) \\
=& \frac{1}{2} a_0 e^{2\pi i 0 x / L} + \frac{1}{2}\sum_{n = 1}^{\infty} (b_n + ia_n) e^{2\pi i n x / L} + \frac{1}{2}\sum_{n=1}^{\infty} (b_n – i a_n) e^{2\pi i n x / L}
\end{align} \]
Notice that that if we define \(c_0 = \frac{1}{2} a_0\), \(c_n = \frac{1}{2} (b_n + \mbox{sign}(n) i a_n)\), we can simplify the above to the final form:
\[ f(x) = \sum_{n = \infty}^{\infty} c_n e^{2\pi i n x / L}.\]
The coefficients can be found by taking advantage of an indentity of the complex exponential: it integrates to zero over one full period, i.e. \(\int_0^L e^{2\pi i (mn) x / L} = L\) if \(m=n\) and \(0\) of \(m \neq n\). This can be used to isolate \(c_m\):
\[ \int_0^L f(x) e^{2 \pi i m x} dx = \sum_{n = \infty}^{\infty} c_n \int_0^L e^{2 \pi i n x / L} e^{2 \pi i m x / L}dx = \sum_{n = \infty}^{\infty} c_n \int_0^L e^{2 \pi i (n – m) x / L} = Lc_m\]
Therefore \[c_m = \frac{1}{L} \int_0^L f(x) e^{2\pi imx/L} dx.\]
This is the basis of the complex Fourier series. The output of the Fourier transform is the set of \(c_m\)‘s.
The Discrete Fourier Series
In practice, we cannot compute each of the infinite \(c_m\), and it turns out we do not have enough information to do so. In almost all applications, we do not have \(f(x)\), rather we have \(N\) evenlyspaced samples from \(f(x)\) stored in a computer. Unfortunately, this limits how far out in frequency space we can get.
For example, consider what happens when analyzing the coefficient of the frequency \(N\). The function goes through one full period before it reaches the next point, so you are effectively applying a constant displacement to each point, the same effect as the frequency \(n=0\). In general, each of the points will be effected in the same way for frequency \(n\) as the frequency \(n +N\), and we get no new information from analyzing beyond \(n = N1\).
Therefore, in our output we are only looking for frequencies \(0\) through \(N1\). If we have \(N\) points, we are looking for \(N\) frequencies, each computed via the discrete integral:
\[ c(n) = \frac{1}{N} \sum_{x = 0}^{N1} f(x) e^{2\pi i n x /N}\]
This is effectively computing the integral with a rectangular approximation \(\frac{1}{L} \int_0^L f(x) e^{2\pi i nx/L} dx \approx \frac{1}{L} \frac{L}{N} \sum_{x’=0}^{N1} f(x’) e^{2\pi i n x’ / N}.\) where \(x = L/N x’\).
From here, I’m going to drop the normalizing term \(\frac{1}{N}\) and assume it will be applied at the end. Also, it’s much easier from this point to look at the Fourier transform using a picture. We can imagine the discrete Fourier transform as a circuit taking in the \(N\) samples as input and giving the \(N\) Fourier coefficients as output. This is depicted below, with the inputs \(f(x)\) coming in from the left, and the outputs \(c(n)\) coming out to the right. Notice that each of the \(N\) terms of the output needs to compute a sum of \(N\) terms, and therefore the runtime of this circuit is \(O(N^2)\).
\[
\begin{array}{llcccllll}
\hline
f(0) & \rightarrow & & & & \rightarrow & \frac{1}{8} \sum\limits_{x = 0}^{7} f(x) & = & c(0)\\
f(1) & \rightarrow & & & & \rightarrow & \frac{1}{8} \sum\limits_{x = 0}^{7} f(x) e^{2\pi i x /8} & =& c(1) \\
f(2) & \rightarrow & & & & \rightarrow & \frac{1}{8} \sum\limits_{x = 0}^{7} f(x) e^{2\pi i 2 x /8} & =& c(2)\\
f(3) & \rightarrow & & \mbox{Discrete Fouirer Transform} & & \rightarrow & \frac{1}{8} \sum\limits_{x = 0}^{7} f(x) e^{2\pi i 3 x /8} & =& c(3) \\
f(4) & \rightarrow & & & & \rightarrow & \frac{1}{8} \sum\limits_{x = 0}^{7} f(x) e^{2\pi i 4 x /8} & =& c(4)\\
f(5) & \rightarrow & & & & \rightarrow & \frac{1}{8} \sum\limits_{x = 0}^{7} f(x) e^{2\pi i 5 x /8} & =& c(5)\\
f(6) & \rightarrow & & & & \rightarrow & \frac{1}{8} \sum\limits_{x = 0}^{7} f(x) e^{2\pi i 6 x /8} & =& c(6)\\
f(7) & \rightarrow & & & & \rightarrow & \frac{1}{8} \sum\limits_{x = 0}^{7} f(x) e^{2\pi i 7 x /8} & =& c(7)\\ \hline
\end{array}
\]
However, there seems to be redundancy here. Each of these terms is a sum of the \(f(x)\) terms, each with a complex exponential slapped on, but the complex exponential is periodic. Take \(c(0)\) and \(c(4)\) for example.
\[ c(0) = \sum_{x = 0}^7 f(x) \].
\[ c(n) = \sum_{x = 0}^7 f(x) e^{2\pi i 4 x /8} = \sum_{x = 0}^7 f(x)
e^{\pi i x} \]
They are the same sum, except each term of \(c(4)\) has an extra term of \(e^{\pi i x}\) slapped on. This term is going to be equal to 1 for even numbers of \(x\) and equal to \(1\) for odd number of \(x\). This behavior is independant of \(n\): if we pair up any Fourier coefficient with the coefficient that is \(N/2\) indices up from it (for example 1 and 5, 2 and 6, and 3 and 7), we will find the same relationship since
\[ e^{2\pi i (n + N/2) x / N} = e^{2\pi i n x / N} e^{\pi i x} .\]
This suggests a strategy: there are really only \(N\) terms to work out between \(c(n)\) and \(c(n + N/2)\) (instead of \(2N\)), we just need to remember to slap a 1 onto the odd terms for \(c(n+N/2)\). This leads us to the next symmetry we can exploit. We can divide these two sums into 2 sums of the form “full sum” = “even terms” + “odd terms”. I will do this out for the \(N=8\) case but you will see how it can easily generalize.
For \(c(n)\) this looks like:
\[\begin{align} c(n) = \sum\limits_{x = 0}^{7} f(x) e^{2\pi i n x /8} &= \overbrace{\sum\limits_{x = 0}^{4} f(2x) e^{2\pi i n (2x) /8}}^{\mbox{even terms}} + \overbrace{\sum\limits_{x = 0}^{4} f(2x + 1) e^{2\pi i n (2x+1) /8}}^{\mbox{odd terms}} \\
&= \sum\limits_{x = 0}^{4} f(2x) e^{2\pi i n x /4} + \sum\limits_{x = 0}^{4} f(2x + 1) e^{2\pi i n x /4} e^{\pi i n /4}.
\end{align}\]
For \(c(n + N/2)\) this looks like:
\[\begin{align} c(n + N/2) = \sum\limits_{x = 0}^{7} f(x) e^{2\pi i n x /8}e^{\pi i x} &= \overbrace{\sum\limits_{x = 0}^{4} f(2x) e^{2\pi i n (2x) /8}}^{\mbox{even terms}} – \overbrace{\sum\limits_{x = 0}^{4} f(2x + 1) e^{2\pi i n (2x+1) /8}}^{\mbox{odd terms}} \\
&= \sum\limits_{x = 0}^{4} f(2x) e^{2\pi i n x /4} – \sum\limits_{x = 0}^{4} f(2x + 1) e^{2\pi i n x /4} e^{\pi i n /4}.
\end{align}\]
The two terms in the resulting equation have remarkable symmetries to the original problem: they are the Fourier transform of the function using only \(N/2\) samples, one using the even indexed samples, the other using the odd indexed samples. Define two new variables for each \(n\) accordingly:
\[ \begin{align}
c_e(n) &= \sum\limits_{x = 0}^{4} f(2x) e^{2\pi i n x /4} \\
c_o(n) &= \sum\limits_{x = 0}^{4} f(2x + 1) e^{2\pi i n x /4},
\end{align}\]
we immediately see that
\[ \begin{align} c(n)& = c_e(n) + c_o(n) e^{\pi i n/4} \quad \mbox{ and} \\
c(n + N/2)& = c_e(n) – c_o(n) e^{\pi i n/4}.
\end{align}\]
This relationship underlies the “butterfly diagram” that the FFT algorithm is known for. Updating our circuit with \(f_e(x) = f(2x)\) and \(f_o(e) = f(2x+1)\):
\[\begin{array}{llllccclllll}
\hline
f(0) & = & f_e(0) & \rightarrow & & & & \rightarrow & \sum\limits_{x = 0}^{3} f_e(x) & = & c_e(0) & c(0) = c_e(0) + c_o(0)\\
f(2) & = & f_e(1) & \rightarrow & & \mbox{Discrete Fouirer Transform} & & \rightarrow & \sum\limits_{x = 0}^{3} f_e(x) e^{2\pi i x /4} & =& c_e(1) & c(1) = c_e(1) + c_o(1) e^{\pi i / 4}\\
f(4) & = & f_e(2) & \rightarrow & & & & \rightarrow & \sum\limits_{x = 0}^{3} f_e(x) e^{2\pi i 2 x /4} & =& c_e(2) & c(2) = c_e(2) + c_o(2) e^{2 \pi i / 4}\\
f(6) & = & f_e(3) & \rightarrow & & & & \rightarrow & \sum\limits_{x = 0}^{3} f_e(x) e^{2\pi i 3 x /4} & =& c_e(3) & c(3) = c_e(3) + c_o(3) e^{3 \pi i /4} \\ \hline
f(1) & = & f_o(0) & \rightarrow & & & & \rightarrow & \sum\limits_{x = 0}^{3} f_o(x) & =& c_o(0) & c(4) = c_e(0) – c_o(0)\\
f(3) & = & f_o(1)& \rightarrow & & \mbox{Discrete Fouirer Transform} & & \rightarrow & \sum\limits_{x = 0}^{3} f_o(x) e^{2\pi i x /4} & =& c_o(1) & c(5) = c_e(1) – c_o(1)e^{\pi i / 4} \\
f(5) & = & f_o(2) & \rightarrow & & & & \rightarrow & \sum\limits_{x = 0}^{3} f_o(x) e^{2\pi i 2 x /4} & =& c_o(2) & c(6) = c_e(2) – c_o(2) e^{2\pi i /4} \\
f(7) & = & f_o(3) & \rightarrow & & & & \rightarrow & \sum\limits_{x = 0}^{3} f_o(x) e^{2\pi i 3 x /4} & =& c_o(3) & c(7) = c_e(3) – c_o(3) e^{3\pi i /4} \\ \hline
\end{array}\]
We can do this process recursively until we hit the trivial case of a Fourier transfom with one sample. The number of times we will recursively call the algorithm is equal to \(\log N\), because each time the number of terms left in each gets divided by two. At each level of the division, we do \(N\) constanttime recombinations of the terms in the previous level. Therefore the asymptotic performance of this algorithm should be \(O(N \log N)\), much faster than \(O(N^2)\).
Let’s code this up and compare runtimes. Again we are going to assume that our input is always going to be a power of two in length. At level \(l\) back, there are going to be \(2^l\) Fourier transforms to compute, each with \(N/2^l\) samples. Indexing the output is a lot of fun, in the dwarf fortress sense, as you can see by the following table:
Level
x
Indices
0
0:7
\(\{x\}\)
1
0:3
\(\{2x\}\), \(\{2x+1\}\)
2
0:1
\(\{2(2x)\}\), \(\{2(2x) + 1\}\), \(\{2(2x + 1)\}\), \(\{2(2x+1) + 1\}\)
In general the set of indices at any level of the recursion are going to be \(\{2^lx +n\}\). Each seperate Fourier transform has a different value of \(n\), and for it \(x\) ranges between \(0\) and \(N/2^l\). I suggest working this out yourself to truly grok it.
Implementation
fast.fourier.transform < function(f) {
N < length(f)
levels < log(N, 2)
prev.level < f
level < complex(real = numeric(length(f)))
for(l in (levels1):0) {
for(n in 0:(2^l  1)) {
half.num.indices < 2^(levels  l  1)
for(x in 0:(half.num.indices  1)) {
## One catch here to be aware of: in the circuit diagram above, the two
## seperate Fourier transforms have been placed as if their indices are
## together, for ease of reading. In the implementation, these indices
## are going to be interleaved, i.e. $c_e(0)$ at $i=0$, $c_o(0)$ at $i=1$
## and so on. Therefore we vary x up to half of the indices, with evens=2x
## and odds=2x+1.
level[2^l * x + n + 1] = prev.level[2^(l) * (2*x) + n + 1] + prev.level[2^l * (2*x+1) + n + 1] * exp(complex(imaginary = pi*x/half.num.indices))
level[2^l * (x + half.num.indices) + n + 1] = prev.level[2^l * (2*x) + n + 1]  prev.level[2^l * (2*x + 1) + n + 1] * exp(complex(imaginary = pi*x/half.num.indices))
}
}
prev.level < level
}
level
}
slow.fourier.transform < function(f){
N < length(f)
result < numeric(N)
for(n in 0:(N1)) {
for(x in 0:(N1)) {
result[n+1] = result[n+1] + f[x+1] * exp(complex(imaginary = 2*pi * x * n / N))
}
}
result
}
Testing
Let’s test these two functions. Do they give the correct result? Is the fast Fourier transform really faster? For the first question, I’m going to test out the two new series against very simple Fourier series compositions, and also against R
’s fft
function. For the second question I will test the runtimes with several different length transforms and then compute the relationship between log time and log length. The slow Fourier transform should have a slope of 2, the fast Fourier transform should have a slope that is only slightly higher than 1.
What are 4 simple series to test. How about, for \(x=0..15\):
 Function 1: \(f(x) = \sin(\frac{2\pi}{16} x)\)
 Function 2: \(f(x) = \sin(\frac{2\pi}{16} x)  \cos(\frac{2\pi}{16}3 x) + \sin(\frac{2\pi}{16} 6 x)\)
 Function 3: \(f(x) = \sum_{n = 0}^{15} n \sin(\frac{2 \pi }{16}n x)\)
 Function 4: \(f(x) = \sum_{n = 0}^{15} \sin(\frac{2\pi}{16} n )\sin(\frac{2 \pi}{16}n x)\)
library(ggplot2)
library(dplyr)
library(tidyr)
plot_theme < theme(panel.background = element_blank(),
panel.grid.major = element_line(color = "grey90", size = 0.2),
panel.grid.minor = element_line(color = "grey95", size = 0.5),
strip.text.x = element_text(color = "white"),
strip.background = element_rect(fill = "black"),
axis.text = element_text(color = "black"),
axis.title = element_text(face = "bold"),
title = element_text(face = "bold"))
## Here are the functions we want to test, they should have very clear Fourier transforms
function.1 < function(x) sin(2*pi*x/16)
function.2 < function(x) sin(2*pi*x/16)  cos(2*pi*3*x/16) + sin(2*pi*6 * x/16)
function.3 < function(x) sapply(x, function(s) sum(sapply(0:15, function(n) n* sin(2*pi*n*s/16))))
function.4 < function(x) sapply(x, function(s) sum(sapply(0:15, function(n) sin(2*pi*n/16)* sin(2*pi*n*s/16))))
tests < list("Function 1" = function.1, "Function 2" = function.2, "Function 3" = function.3, "Function 4" = function.4)
## for each of these function, we want to apply the function
test.data < bind_rows(lapply(1:length(tests), function(i) {
fun < tests[[i]]
name < names(tests)[[i]]
x < 0:15
bind_rows(
data.frame(x = x, y = fun(x), func = name, transform = "None"),
data.frame(x = x, y = abs(fft(fun(x))), func = name, transform = "fft"),
data.frame(x = x, y = abs(slow.fourier.transform(fun(x))), func = name, transform = "slow.fourier.transform"),
data.frame(x = x, y = abs(fast.fourier.transform(fun(x))), func = name, transform = "fast.fourier.transform"))
}))
test.data$transform < factor(test.data$transform, levels = c("None", "slow.fourier.transform", "fast.fourier.transform", "fft"))
ggplot(data = test.data, aes(x=x, y = y)) + geom_point() + facet_grid(func~transform, scale = "free") + theme_bw()
As you can see via the above plot, the code that we’ve written agrees with R’s built in fft
function, which is good. Also good is that it gives the output that we expect.
You might be wondering why the first row of plots contain two nonzero dots each  wasn’t there just one frequency included in function 1? Remember that to put in this function, we used \(\sin(x) = (e^{ix}  e^{ix})/2i\), the other point that shows up is the “negative” frequency which has show up on the other side due to the periodic symmetry of the complex exponential.
Timing
Now let’s time the functions against several different length inputs and see how they behave as the length of the sequences get large.
sequence.lengths < sapply(0:13, function(x) 2^x)
timing.test < data.frame(
length = sequence.lengths,
slow = sapply(sequence.lengths, function(l) { dat < 0:l; system.time({slow.fourier.transform(dat)})[["elapsed"]]}),
fast = sapply(sequence.lengths, function(l) { dat < 0:l; system.time({fast.fourier.transform(dat)})[["elapsed"]]}),
fft = sapply(sequence.lengths, function(l) { dat < 0:l; system.time({fast.fourier.transform(dat)})[["elapsed"]]}))
timing.test < gather(timing.test, func, time, slow, fast, fft)
Normal plot:
ggplot(data = timing.test, aes(x = length, y = time, color = func)) + geom_point() + theme_bw()
LogLogplot:
ggplot(data = timing.test, aes(x = log(length), y= log(time), color = func)) + geom_point() + theme_bw() + geom_smooth(method = "lm") + coord_fixed(ratio = 1)
## Warning: Removed 22 rows containing nonfinite values (stat_smooth).
Linear models on the loglog data
slow < subset(timing.test, func %in% "slow" & time != 0)
fast < subset(timing.test, func %in% "fast" & time != 0)
fft < subset(timing.test, func %in% "fft" & time != 0)
lm.slow < lm(log(time) ~ log(length), data = slow)
lm.slow
##
## Call:
## lm(formula = log(time) ~ log(length), data = slow)
##
## Coefficients:
## (Intercept) log(length)
## 11.889 1.974
lm.fast < lm(log(time) ~ log(length), data = fast)
lm.fast
##
## Call:
## lm(formula = log(time) ~ log(length), data = fast)
##
## Coefficients:
## (Intercept) log(length)
## 10.334 1.132
lm.fft < lm(log(time) ~ log(length), data = fft)
lm.fft
##
## Call:
## lm(formula = log(time) ~ log(length), data = fft)
##
## Coefficients:
## (Intercept) log(length)
## 9.855 1.078
Via the coefficient on log(length)
we see that the slow version has slope approximately equal to 2, and the fast version has slope slightly greater than one. I’m especially suprised that my version in plain R seems to perform nearly as fast as R’s native fft
function, which I had assumed was programmed in optimized C. Thus we have derived the a way in which the discrete Fourier transform can be computed \(O(N\log N)\).
Further Reading
Gilbert Strang gives a really good lecture on FFT, using matrix factorization which I didn’t cover here. (Unfortunately lost me around 37:25). link
Citations
[1] Fourier, Joseph (1822). ThÃ©orie analytique de la chaleur (in French). Paris: Firmin Didot PÃ¨re et Fils. OCLC 2688081. link
[2] Cooley, James W.; Tukey, John W. (1965). “An algorithm for the machine calculation of complex Fourier series”. Math. Comput. 19: 297â€“301. doi:10.2307/2003354. link
[3] Cooley, J. W. (1987). The rediscovery of the fast Fourier transform algorithm. Microchimica Acta, 93(16), 3345. link
]]>
http://michaeljflynn.net/2017/03/20/historyandderivationofthefastfouriertransform/feed/
3
551

A Tutorial on Principal Component Analysis
http://michaeljflynn.net/2017/02/06/atutorialonprincipalcomponentanalysis/
http://michaeljflynn.net/2017/02/06/atutorialonprincipalcomponentanalysis/#respond
Mon, 06 Feb 2017 05:37:20 +0000
http://michaeljflynn.net/?p=499
[Read more...]]]>
Introduction
Machine learning is a buzzword these days. Just dropping it can impress people without requiring that you elaborate, perhaps because it feeds their fantasies of a robot takeover. Investors are particularly vulnerable because they get to act on these fantasies. The following exchange approximates one that actually happened in real life:
“We’ve got a couple cutting edge projects coming down the pipeline: cloud computing and machine learning!”
“Wow! I’m terribly excited about all this! Hail our new overlords! Take my money!”
Notice that money changed hands without inquiry into what these two concepts were being used for.
All this means is that, whoever you are, it is especially important to become familiar with the real tools of machine learning and data science, what problems they are meant to solve, and how they solve them.
This is a tutorial on “principal component analysis”, or PCA, a technique used to reduce redundancy in your data. In any given dataset, you might have two or more measurements that actually measure the same thing. One example would be a dataset measuring students’ hours spent doing schoolwork and their GPA. These 2 variables could be interpreted as measuring 1 underlying quality of a student: how much of a “good student” they are. We might want to reduce these 2 factors into this “good student” measure because this greatly simplifies statistical analysis. That’s where PCA comes in. PCA rotates your data’s coordinate system to one that aligns with the important questions i.e. whether a student is a good student or not (GPA + hours worked), or how naturally smart or efficient they are (GPA – hours worked). It also gives a score on how important these factor are.
In this tutorial I am going to set up a situation where there is one underlying phenomenon: the motion of a spring in one direction. I’m going to duplicate and obfuscate this phenomenon by having a fictional group of scientist measure this it with several different cameras. Then I’m going to use PCA to isolate the signal.
Setting up redundant data
Let’s say we have a ball attached to a spring. The balls motion might be described as \[x(t) = cos(t)\] around the equilibrium point. I can model this in R.
library(ggplot2)
library(tidyr)
library(dplyr)
theme_set(theme_bw()) # set default theme to blackwhite
t < seq(0, 6*pi, by = .1)
x < cos(t)
qplot(x = t, y = x)
The above plot is how the position \(x\) varies with time. Of course, in the real world the ball has 3 spatial coordinates and every measurement has noise. To account for these I’m going to add \(y\) and \(z\) coordinates, which are both 0 and I’m also going to add noise in every direction.
positions < data.frame(t = seq(0, 6*pi, by = .1),
x = x + rnorm(length(x), sd = .1),
y = rnorm(length(x), sd = .1),
z = rnorm(length(x), sd = .1))
ggplot(data = gather(positions, variable, value, x:z), aes(x = t, y = value)) +
geom_point() +
facet_wrap(~ variable, nrow = 1)
Now you can see 3 relationships, one of mostly signal, 2 of noise.
Now let’s say a team of scientists have set up 3 cameras to observe this process. They do not have priviledged access to the coordinate system that we know of, where the \(x\) coordinate contains all the signal. Rather, for each camera they get two coordinates: the \(x\) and \(y\) of the ball’s location in the camera’s image plane. For simplicity’s sake, I’m going to assume each camera is pointed directly at the midpoint of the ball’s motion. I’m going to assert, and you can prove to yourself, that a camera’s image plane coordinates can be expressed by rotating and scaling the original coordinate system, and then projecting the data onto two of the resulting dimensions.
One side note: In the process of writing this post, I discovered that generating a random 3Drotation is actually super annoying. I spent a little bit of time trying to tie the concept down and this is what I’ve come up with: a rotation can be uniquely specified by an axis and an angle to rotate around that axis. You can read about all that here and here, or you can just steal my code.
## random 3d rotation matrix
random.rotation.matrix < function(seed) {
set.seed(seed)
## angle around vector to rotate
psi < runif(1, max = 2*pi)
## select a random point on S^2
phi < runif(1, max = 2*pi)
theta < acos(2 * runif(1)  1)
## construct axis from random S^2 point
axis < c(cos(theta)*cos(phi),
cos(theta)*sin(phi),
sin(theta))
## cross product matrix for formula
axis.cp < matrix(c(0, axis[3], axis[2],
axis[3], 0, axis[1],
axis[2], axis[1], 0), nrow = 3, byrow = TRUE)
## create rotation matrix using wikipedia formula
R < cos(psi) * diag(c(1,1,1)) +
sin(psi) * axis.cp +
(1cos(psi)) * outer(axis, axis)
R
}
I’ve tested the above function and it does seem to create random rotations. So let’s go ahead an find the new coordinates!
new.camera.data < function(positions, seed) {
set.seed(seed)
## original data
original.coordinates < t(as.matrix(select(positions, x,y,z)))
## get scale
scale < 1/rexp(1, 1/3)
## get rotation
rotation < random.rotation.matrix(seed)
## new points
new.points < t(scale * rotation %*% original.coordinates)
## project
image.plane.projection < new.points[,1:2]
list(scale = scale, rotation = rotation, data = image.plane.projection)
}
camera.1 < new.camera.data(positions, 1)
camera.2 < new.camera.data(positions, 2)
camera.3 < new.camera.data(positions, 3)
camera.data = data.frame(t = seq(0, 6*pi, by = .1),
x1 = camera.1$data[,1],
y1 = camera.1$data[,2],
x2 = camera.2$data[,1],
y2 = camera.2$data[,2],
x3 = camera.3$data[,1],
y3 = camera.3$data[,2])
ggplot(data = gather(camera.data, variable, value, x1:y3), aes(x = t, y = value)) +
geom_point() +
facet_wrap(~ variable, nrow = 2)
It now looks that there are several signals here, but the scientists are suspicious because they seem very correlated. How will they isolate the signal?
Finding the Principal Components
PCA is based on the hypothesis that signals are things that cause your data to vary. Therefore, directions that possess large signals will also have data that varies in that direction. This leads us to the fundamental assumption of PCA:
Fundamental Assumption of PCA: The most important directions are the directions in which the data varies most.
Note that this implicitly assumes that the signal to noise ratio, \[\frac{\sigma_{signal}^2}{\sigma_{noise}^2} > 1.\] Only then will the signal directions be expected to be the directions of greatest variation. It many realworld applications, for example stock price forecasting, the noise overwhelms the signal. It would be an interesting exercise to see how this simple example breaks down as you add noise.
Thus, PCA becomes an optimization problem. Let the matrix of data with each column translated so that it is mean 0 (\(a \to a  mean(a)\)) be denoted \(\mathbf X\). You will soon see why this translation is important. Since translation does not effect variance, we are trying to find some unit vector \(\mathbf u\) such that the variance of \(\mathbf X \mathbf u\) is maximized.
\[ \mbox{Maximize Var(}{\mathbf X \mathbf u} \mbox{)} \quad \mbox{ subject to } \quad \mathbf u = 1,\]
repeatedly such that each \(\mathbf u_i \cdot \mathbf u_j = 0\) for \(i \neq j\) until our directions span the space of the data.
To solve this, let’s talk about variance for a bit. Consider two sets of measurements \[ \mathbf a = \{a_1, a_2,..., a_n\} \quad \mbox{ and
} \quad \mathbf b = \{b_1, b_2,...,b_n\} \] with means of \(\mu_a\) and \(\mu_b\), respectively. The variance and covariance are defined as average squared differences from the mean: \[ \mbox{Var}(\mathbf a) =
\sigma_{\mathbf a}^2 = \frac{1}{n} \sum_i(a_i  \mu_a)^2 = \frac{1}{n}
\sum_i a_i^2 = \frac{1}{n} \mathbf a \cdot \mathbf a \quad \mbox{if }
\mu_a = 0,\] \[ \mbox{Cov}(\mathbf a, \mathbf b) =
\sigma_{\mathbf{ab}}^2 = \frac{1}{n} \sum_i(a_i  \mu_a)(b_i  \mu_b)
= \frac{1}{n} \sum_i a_ib_i = \frac{1}{n}\mathbf a \cdot \mathbf b
\quad \mbox{if } \mu_a =0 \mbox{ and } \mu_b = 0.\]
So the variance is equal to the covariance of a variable with itself, \(\mbox{Var}(\mathbf{a}) = \mbox{Cov}(\mathbf a, \mathbf a)\). Since all of \(\mathbf X\)’s columns \(\mathbf x_i\), are mean zero, we can define a matrix \(\mathbf C\) where the \([i,j]\)th member is the covariance of \(\mathbf x_i\) and \(\mathbf x_j\).
\[ \frac{1}{n} \mathbf X^T \mathbf X =
\begin{bmatrix}
\frac{1}{n} \mathbf x_1 \cdot \mathbf x_1 & \cdots & \frac{1}{n} \mathbf x_1 \cdot \mathbf x_r \\
\vdots & \ddots & \vdots \\
\frac{1}{n} \mathbf x_r \cdot \mathbf x_1 & \cdots & \frac{1}{n} \mathbf x_r \cdot \mathbf x_r \end{bmatrix} = \mathbf C.\]
Let’s do this now with our camera measurments in R.
options(width = 120)
measurements < select(camera.data, t)
measurements < mutate_each(measurements, funs(.  mean(.)))
measurements < as.matrix(measurements)
covariance.matrix < t(measurements) %*% measurements / nrow(measurements)
covariance.matrix
## x1 y1 x2 y2 x3 y3
## x1 0.0021234862 0.005043854 0.0002620748 0.002255248 0.0005278387 0.002325504
## y1 0.0050438535 0.092671234 0.0166668085 0.034679534 0.0217014469 0.034022588
## x2 0.0002620748 0.016666808 0.0032414307 0.006126479 0.0041287510 0.005953620
## y2 0.0022552479 0.034679534 0.0061264792 0.013067542 0.0079891875 0.012826169
## x3 0.0005278387 0.021701447 0.0041287510 0.007989187 0.0053130622 0.007799908
## y3 0.0023255041 0.034022588 0.0059536201 0.012826169 0.0077999083 0.012612169
Let’s also reframe our optimization problem in terms of this nice, square, symmetric matrix:
\[ \mbox{Var}(\mathbf{X} \mathbf u) = \frac{1}{n}(\mathbf{X} \mathbf u)^T (\mathbf{X} \mathbf u) = \mathbf u^T (\frac{1}{n} \mathbf{X}^T \mathbf{X}) \mathbf u = \mathbf u^T \mathbf{C} \mathbf u. \]
Therefore we are trying find a \(\mathbf u\) that maximizes the matrix product with the covariance matrix, \(\mathbf u^T \mathbf C \mathbf u\) such that \(\mathbf u\) is a unit vector and perpendicular to the other solutions we have found. Remember that \(\mathbf C\) is a square, symmetric matrix, and its set of unit eigenvectors \(\{\mathbf v_1, \mathbf v_2,\dots, \mathbf v_r\}\) are orthogonal, i.e. \(\mathbf v_i \cdot \mathbf v_j = 0\) for \(i \neq j\), and span it’s column and row spaces. Therefore we can express \(\mathbf u\) as some combination of the orthogonal unit eigenvectors of \(\mathbf C\):
\[\mathbf u = a_1 \mathbf v_1 + a_2 \mathbf v_2 + \dots + a_r \mathbf v_r,\]
where \(\sum a_i^2 = 1\) to make the result length 1. What happens when we make this substitution into the quantity we are trying to optimize?
\[\begin{align} \mathbf u^T \mathbf C \mathbf u &= (a_1 \mathbf v_1^T + a_2 \mathbf v_2^T + \dots + a_r \mathbf v_r^T) \mathbf C
(a_1 \mathbf v_1 + a_2 \mathbf v_2 + \dots + a_r \mathbf v_r) \\
&= (a_1 \mathbf v_1^T + a_2 \mathbf v_2^T + \dots + a_r \mathbf v_r^T)
(a_1 \mathbf C \mathbf v_1 + a_2 \mathbf C \mathbf v_2 + \dots + a_r \mathbf C \mathbf v_r) \\
&= (a_1 \mathbf v_1^T + a_2 \mathbf v_2^T + \dots + a_r \mathbf v_r^T)
(a_1 \lambda_1 \mathbf v_1 + a_2 \lambda_2 \mathbf v_2 + \dots + a_r \lambda_r \mathbf v_r) \\
&= a_1^2 \lambda_1 + a_2^2 \lambda_2 + \dots + a_r^2 \lambda_r \quad \mbox{because the } \mathbf v \mbox{'s are orthogonal.}\end{align}\]
This is very interesting. Let \[\{\alpha_1, \alpha_2, ..., \alpha_r \} = \{a_1^2, a_2^2, ..., a_r^2\}, \] we are trying to solve an optimization of the form \[\begin{align}
\mbox{Maximize} \quad \alpha_1 \lambda_1 + &\alpha_2 \lambda_2 + \dots + \alpha_r \lambda_r \\
\mbox{Subject to} \quad \sum_i \alpha_i &= 1 \\
\mbox{All } \alpha_i & \geq 0
\end{align}\]
This is a linear programming problem, where all solutions will be on the corners of the space. On the corners, the inequalities are tight, so the solutions are where \(r1\) \(\alpha\)’s are 0, and exactly one \(\alpha_i = 1\), so \(a_i = \sqrt{\alpha_i} = 1\).
So \(\mathbf u_i = \mathbf v_i\), where the \(i\)th most important direction has the \(i\)th largest eigenvalue. The solutions are the eigenvectors of the covariance matrix. Since the eigenvalues are orthogonal, our coordinate system is orthogonal. Pretty cool result, if you ask me.
Isolating the signal
Some people use the singular SVD to get the eigenvectors and eigenvalues. If there is a deeper reason to use SVD, feel free to email me! I am curious. Here, I’m just going to use the R function eigen
to grab the eigenvectors.
First I’m going to look at the eigenvalues. I expect to see 1 large eigenvalue, and 5 much smaller values. What I’m going to do then is get the eigenvectors, project the data into that basis, and plot the results.
important.directions < eigen(covariance.matrix)
important.directions$value
## [1] 1.265056e01 2.491410e03 3.188713e05 1.083847e17 3.857996e18 2.696033e18
The first eigenvalue is 100x bigger than the second, which is 100x bigger than the next, and so on. This is exactly what I hoped to see.
projected < as.data.frame(measurements %*% important.directions$vector)
colnames(projected) < paste0("component", 1:6)
projected$t < camera.data$t
plot.data < gather(projected, component, value, component1:component6)
ggplot(data = plot.data, aes(x = t, y = value)) + geom_point() +
facet_wrap(~component)
We’ve not only isolated the signal, we’ve also isolated the noise as well! PCA has saved the scientists from the complicated analysis of 6 signals, and given them only one to work with. Problem solved.
]]>
http://michaeljflynn.net/2017/02/06/atutorialonprincipalcomponentanalysis/feed/
0
499

Alice, The Alien, and the Illusion of the Twin Paradox
http://michaeljflynn.net/2017/01/17/alicethealienandtheillusionofthetwinparadox/
http://michaeljflynn.net/2017/01/17/alicethealienandtheillusionofthetwinparadox/#respond
Tue, 17 Jan 2017 16:16:17 +0000
http://michaeljflynn.net/?p=388
[Read more...]]]>
Alice is traveling through deep space with her alien navigator. They are delivering a shipment of water to Proxima Centauri at a solid fraction of light speed. They had been traveling in deep space at constant speed for so long that it was as if they were standing still. Suddenly, and with a flash of light, they pass another interstellar traveler, one Robert, heading back to Earth. Sensor readings on Alice’s ship have him traveling at 3/5 light speed relative to them.
“Bob’s foot is made of solid lead,” the Alien commented.
“It better be when he has to slow down.” Alice took a moment to consider the significance of this event. “Reality is so strange. Bob is speeding past us right now. According to special relativity, time should be passing slower for him. However, special relativity also says that he has every right to say that he is stationary and we are moving, and therefore time is passing slower for us. Who is right?”
“Both.” The Alien blinked.
Alice continued. “One of us could start accelerating towards the other to meet up and find out which one was aging faster, but whoever does the acceleration will find that they had aged as if they were the ones moving the whole time! It’s almost like acceleration breaks the symmetry.”
The Alien squinted his big black eyes. “Well, I guess.”
“Don’t you think that’s weird? That’s how the real world works!”
The Alien shrugged, looking back at his console. “Not really.”
Alice sighed. She wasn’t particularly surprised that the Alien wasn’t playing along. He had a habit of shooting down her conversation topics.
She went on. “I guess the core of the paradox is that we both have an equal right to say that we are the more aged person while we are both moving at constant speed. Special relativity says that we are both right, but how could that be possible?”
The Alien turned. “What do you mean when we arrive at the station?”
“I mean the moment in time when our ship’s coordinates become coincident with the station’s coordinates.” said Alice, frustrated.
“What do you mean moment in time?” asked the Alien, with a grin. “I’m going to go rehydrate some noodles.” He clapped his hands excitedly and went to go pull out some noodles from the freezer.
Now the Alien was just being obtuse. Alice was determined to get through his philosophical meanderings.
“I mean the slice of all points in spacetime where every synced clock traveling parallel to us at the same speed would read the same number as our clock.”
“But“, the Alien began but Alice interrupted him, anticipating his question and being a gifted experimentalist. “You could easily sync clocks using a laser. All you need to do is fire a pulse at the clock. When the reflection comes back, you know that the clock is exactly half of the elapsed lightseconds away and it has advanced by exactly half the number of seconds the total journey took. Add that number to the reflected value on the clock and then you are synced.”
Alice drew a diagram.
“Here’s a chart with time on the yaxis and distance in line with the direction to the clock on the xaxis. I’ve scaled the axes so that light moves at 45 degree angles. Here you can see how the mechanism works. The red line is the laser pulse. All the spacetime points on the horizontal dotted line make up the moment in time, the present, when we hit the station, because any laser pulse reflected off a clock at any of those points makes a 454590 triangle, who’s altitude from the 90 bisects the base. Do you get it?”
The Alien nodded.
Alice continued. “In fact, we don’t even need a clock. We know that when we receive the reflection of the laser, whatever it bounced off of must have been simultaneous with whatever was happening on our ship exactly halfway through the round trip. We could even do this with Bob. I know his speed relative to me is 3/5 light speed. In 30 months, when we arrive at the station, he will be 18 light months away. I can send out a laser pulse in 12 months that will hit him exactly at (30 months, 18 lightmonths), which is when I arrive at the station. It will return to me. What will the reflection read out? Well, I remember a special trick for measuring straightline clocktime from special relativity class: , where is in lightunits. So his clock reads months greater than the time when we crossed paths, confirming that he has aged less than the 30 months we will have aged!”
“But how does that make any sense?” asked Alice, still stumped. “Perhaps Bob’s reference frame is in a different universe. Maybe we are now split into multiverses.”
“Hold on a second,” pleaded the Alien. “Bob can do the same to us. He can wait 12 months, which due to time dilation is 15 months for us, and send a laser pulse towards us. When he receives the reflection, he knows he can go back half the roundtrip time to the exact moment on his timeline where the laser hit us. And it seems the laser hits us at… 24 weeks, which is perfectly symmetrical, but…” He edited the diagram.
“Bob knows that the laser pulse hit you when your clock had elapsed 24 months, and that must be at the same time as when his clock read 30 months. However, these two events do not happen at the same time in our reference frame. There is no paradox, but Bob’s present is tilted relative to ours. That’s what causes the illusion of contradiction. At the event at which you have accused Bob of being younger than us, to him you are not even close to reaching the station.” The Alien said, translating Bob’s present line down.
The Alien plopped a bowl of noodles in front of Alice. She was shocked. “So we are both right, in our own reference frames, and there isn’t any contradiction because the present is relative?” she asked. She paused for a moment, tired of thinking. “Thanks for the soup.”
“You’re welcome.” said the Alien. They sipped in silence.
All diagrams made using the diagrams library by Brent Yorgey.
Note: This post closely follows the approach of Tim Maudlin in his book: Philosophy of Physics: Space and Time (Princeton Foundations of Contemporary Philosophy). It is an excellent book. Maudlin is concrete where many philosophers are abstract and fluffy. He also understands physics better than many physicists. The great Richard Feynman gave an incorrect, acceleration symmetrybreaking explanation to this phenomenon.
]]>
http://michaeljflynn.net/2017/01/17/alicethealienandtheillusionofthetwinparadox/feed/
0
388

Simple RNA folding in 130 lines of Haskell
http://michaeljflynn.net/2017/01/09/simplernafoldingin130linesofhaskell/
http://michaeljflynn.net/2017/01/09/simplernafoldingin130linesofhaskell/#comments
Mon, 09 Jan 2017 00:56:41 +0000
http://michaeljflynn.net/?p=278
[Read more...]]]>
How do we predict how RNA folds? First of all, we need to define what we mean by folding. RNA is a polymer made up of nucleic acids: adenine, guanine, cytosine, and uracil. There are 3 levels of RNA structure that are traditionally considered:
 primary structure: the sequence of bases, ‘A’, ‘G’, ‘C’, and ‘U’ that form the strand of RNA, ‘AAAAGGGGCCCCUUUU’ for example,
 secondary structure: which bases are hydrogen bonded together, ‘A’ to ‘U’ and ‘G’ to ‘C’, can be specified by a sequence of pairs corresponding to which indices are paired together (1,24), (2,23)…, and
 tertiary structure: how the molecule bends on itself on a larger scale.
Usually, primary structure is already known or given as an input and tertiary structure is hard to specify and compute the energy for. Secondary structure is easy to specify using a sequence of pairs and the energy is roughly dominated by hydrogen bonding between bases which is discrete. Since RNA functions mostly by binding to things using exposed (nonpaired) bases, the secondary structure is also the most relevant when discerning function. For these reasons, biophysicists usually focus on predicting the secondary structure of RNA.
From the laws of thermal physics, if RNA is modeled as a system in thermal equilibrium then the probability of any state with free energy is equal to
where is the full set of states. The denominator is called the partition function, usually denoted , it normalizes the probability distribution and with it we can compute the probability of any structure. How many terms are in the partition function? How many states are in ? Depends on what assumptions you make. In full generality, even only allowing WatsonCrick base pairs, there are terms in that sum. However, if you assume that there are no crossing pairs, i.e. arcs that would intersect in the above diagram, or pairs such that , , and , there are only terms in the sum. This is still an exponential quantity, but it turns out it is actually computable in subexponential time via dynamic programming. This assumption is called the nopseudoknot assumption, and it means that if I have an pair, then for all bases inbetween them, I can ignore all the bases outside of them when enumerating structures. This begs the question whether it is a physically valid assumption. The answer is NO. There are many RNAs found in nature that have “crossing” pairs such as Group I and Group II introns. However, without this assumption, computing the partition function is intractable, so most RNA folding software packages make it.
While it is tempting to do analysis on a structure with the minimum free energy and highest probability, since RNA is a thermal system with an exponential multiplicity of states even the probability of the most probable state is extremely low and any quantity computed on even the most probable state has negligible contribution to the physical expected value of that quantity. However, I will show here that not only can we compute the partition function, but we can also sample from the Boltzmann distribution. Once we are able to do that we can compute arbitrary expected values on the full Boltzmann distribution via Monte Carlo integration. (In fact, Monte Carlo integration is not even needed if you can express your quantity in terms of the partition function values, which can give you full expected values without statistical error! But that is outside of the scope of this post).
Putting this all together, to predict the folding and general properties of RNA we need 4 things:
 a specification of a state of RNA ,
 an energy function , which takes a state and outputs a physical
energy,
 the partition function, the sum of the Boltzmann factors over every
possible state , and
 an algorithm for sampling structures from the Boltzmann distribution, .
Quick Code Note
All code going forward will be in Haskell. You should be able to extract all the code here into a file and it should compile. To set up our environment, make sure to have this header at the top of the file:
{# LANGUAGE NoMonomorphismRestriction #}
{# LANGUAGE FlexibleContexts #}
{# LANGUAGE TypeFamilies #}
import Data.Array
import Data.Time.Clock.POSIX
import Data.Colour
import System.Random
import Data.List
import Diagrams.Prelude
import Diagrams.Backend.SVG.CmdLine
States of RNA
Because of the nopseudoknot assumption, we can actually specify the states using a recursive data structure.
data Structure = Paired Char Char (Structure, Structure) 
Slip Char Structure 
None
deriving Show
The thought behind this definition is that we are specifying the structure base by base. If base is unpaired, then we move on to the next base. If base is paired, then we fork the structure into the bases underneath the pair and the bases after the pair. We carry along the letter of the base for convenience and to compute the energies of any pairing that might happen.
Energy Model
A fully realistic energy model of RNA would be very difficult to compute. There are exposed ions everywhere on the strand which interact with each other and there is tension in the backbone. Many of these interactions are nonlinear which breaks the recursion and therefore computability of the problem. In practice, there are several levels of complexity that are implemented which try to account for the nonlinearities in a computable fashion (this actually is the only difference between professional RNA secondary structure prediction and an Algorithms homework problem). Here I will implement a very simple energy model: adding the free energies of each hydrogen bond. A hydrogen bond contributes about . Adenineuracil bonds have 2 hydrogen bonds, guaninecytosine bonds have 3. Therefore in the exponent of the Boltzmann factor, I will put 4.6 for AU bonds and 6.9 for GC bonds. Anything else is represented by infinite energy, meaning that it will never happen.
baseEnergy :: Char > Char > Double
baseEnergy a b = case (a, b) of
('A', 'U') > 4.6
('G', 'C') > 6.9
('U', 'A') > 4.6
('C', 'G') > 6.9
otherwise > 1.0/0.0  nonallowed pairs treated like an infinite hill
energyModel :: Structure > Double
energyModel (Paired a b (s1,s2)) =
(baseEnergy a b) +  energy of pair
(energyModel s1) +  energy of substructure under pair
(energyModel s2)  energy of substructure after pair
energyModel (Slip _ s) = energyModel s
energyModel None = 0
Partition Function
The sum initially seems daunting. However, from the recursive definition of our energy model, we can easily see that the partition function can be defined recursively as well. To be precise, if we define as the subpartitionfunction from to , it must be equal to the sum of the Boltzmann terms of all structures where is paired and all structures where is unpaired:
where . This becomes a standard dynamic programming problem. We can turn Z into a table of values with cells. If cell is on the bottom left, and on the top right, each row of the table only depends on the entries below and to the left. We can compute the full partition function by starting from the bottom left and working out to the top right in time. This is even parallelizable along the diagonals (I’m not going to parallelize it here).
type PFunArray = Array (Int, Int) Double
boltz :: Double > Double
boltz x = exp (x)
partitionFunction :: String > PFunArray
partitionFunction strand = arr
where
len = length strand
arr = array ((0,0), (len  1, len  1))
[((i,j), (pfCell i j strand arr))  i < [0..len1], j < [i..len1]]
pfCell :: Int > Int > String > PFunArray > Double
pfCell i j strand arr = if (i==j) then 1 else (sum pairTerms) + slipTerm
where
pairTerm i k = boltz (baseEnergy (strand !! i) (strand !!k)) *
(if ki > 2 then arr ! (i+1, k1) else 1) *
(if jk > 1 then arr ! (k+1, j) else 1)
pairTerms = [pairTerm i k  k < [(i+1)..j]]
slipTerm = if i == j then 1 else arr ! (i+1, j)
Sampling RNA Structures
To actually predict the structure of RNA molecules in nature, we can work backward through the partition function table to sample structures from the Boltzmann distribution. Consider the length of the strand along base to base ,
 the probability that is unpaired is equal to the sum of the probabilities of every structure along to with unpaired, or ,
 the probability that is paired to base is equal to the sum of the probabilities of every structure along to where and are paired, or ,
 completing the rest of the structure can be achieved by calling the sampling function recursively for every length of the strand left general, to for the unpaired case, to and to for the second case.
I encode this logic in Haskell below. At each level I roll a random number from 0 to and sort through the cases. When I pass cumulative probability greater than the number I've rolled I choose that case and recurse.
sampleStructure :: Int > Int > PFunArray > String > StdGen > Structure
sampleStructure i j arr strand gen = if (i > j) then None else struct
where
 setup pair terms
pairTerm i k = boltz (baseEnergy (strand !! i) (strand !! k)) *
(if ki > 2 then arr ! (i+1, k1) else 1) *
(if jk > 1 then arr ! (k+1, j) else 1)
pairTerms = [pairTerm i k  k < [(i+1)..j]]
 structure functions
slipStructure genr = Slip (strand !! i) (sampleStructure (i+1) j arr strand genr)
innerPairStructure k genr = if k  i > 1 then sampleStructure (i+1) (k1) arr strand genr else None
outerPairStructure k genr = if j  k > 0 then sampleStructure (k+1) (j) arr strand genr else None
pairStructure k genr = Paired (strand !! i) (strand !! k) (innerPairStructure k genr, outerPairStructure k genr)
 (cumulative probability, structure) pairs
slipCase = (0, slipStructure)  starting at offset of 0
pairCases = [(sum (take (ki1) pairTerms) + arr ! (i+1, j), pairStructure k)  k < [(i+1)..j]]
allCases = slipCase:pairCases
 sample a random number according to the current sub
(roll, newgen) = randomR (0, arr ! (i, j)) gen
 if roll >= cumulative probability switch canidates, else not
checkCase current candidate = if roll >= (fst candidate) then (snd candidate) newgen else current
struct = foldl' checkCase None allCases
Drawing Structures
One last thing to do is to draw these structures, which I do using Brent Yorgey's cool diagrams library. Every base is a circle with a letter underneath, color coded by base, with backbone and arcs denoting bonds. This depicts RNA with the bonds fanned out and visible, whereas other depictions of RNA will try to keep the length of each bond pretty constant, which is a more physical picture.
colorMap a = case a of
'G' > sRGB24read "#107896"
'C' > sRGB24read "#829356"
'U' > sRGB24read "#F26D21"
'A' > sRGB24read "#C02F1D"
otherwise > white
base :: Char > Diagram B
base b = circle 1 # fc (colorMap b) <>
text [b] # fontSize (local 2) # (translateY (2.6))
backbone = hrule 1
isNone :: Structure > Bool
isNone s = case s of
None > True
otherwise > False
structLength :: Structure > Int
structLength None = 0
structLength (Slip _ s) = 1 + structLength s
structLength (Paired _ _ (s1, s2)) = 2 + structLength s1 + structLength s2
renderStructure :: Structure > Diagram B
renderStructure (Paired a b (s1, s2)) =
base a <>
translateX 1.5 backbone <>
translateX 3 innerElement <>
translateX (3.0*k') (base b) <>
translateX (3.0*k'+3) outerElement <>
arc xDir (0.5 @@ turn) # scale (1.5*k') # translateX (1.5*k')
where
k = structLength s1 + 1
k' = fromIntegral k
innerElement = if isNone s1 then mempty else renderStructure s1  backbone
outerElement = if isNone s2 then mempty
else translateX (1.5) backbone <>
renderStructure s2
renderStructure (Slip c s) = if isNone s then base c else base c <>
translateX 1.5 backbone <>
translateX 3 (renderStructure s)
renderStructure None = mempty
Example Output
Let's try it out:
main = do
let gen = mkStdGen 42
let rna = "GAAAAAAGGGGAAACCAAAGCCCAAUUUGCUUUUAAAAGGCCAA"
let pf = partitionFunction rna
let struct = sampleStructure 0 (length rna  1) pf rna gen
let diagram = renderStructure struct
mainWith diagram
In shell:
ghc RNAfolding.hs
RNAfolding.o o RNAt.svg w 600 h 300
Cool!
Further Reading
There is a long history of this problem being solved going back to the 1980's even before John McCaskill first formulated a partition function approach to RNA secondary structure prediction. Below are a couple significant papers in this series, or at least those that are known to me. I'll also put my undergraduate thesis here, where I wrote extensively on a slightly more complicated version of this problem.
Partition Functions:
McCaskill, J.S., 1990. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers, 29(6‐7), pp.11051119. pdf
Stochastic Sampling:
Ding, Y. and Lawrence, C.E., 2003. A statistical sampling algorithm for RNA secondary structure prediction. Nucleic acids research, 31(24), pp.72807301. pdf
More complicated energy models:
Mathews, D.H., Sabina, J., Zuker, M. and Turner, D.H., 1999. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. Journal of molecular biology, 288(5), pp.911940. pdf
Mathews, D.H., Disney, M.D., Childs, J.L., Schroeder, S.J., Zuker, M. and Turner, D.H., 2004. Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proceedings of the National Academy of Sciences of the United States of America, 101(19), pp.72877292. pdf
Some clear formulations, useful to me:
Dirks, R.M. and Pierce, N.A., 2003. A partition function algorithm for nucleic acid secondary structure including pseudoknots. Journal of computational chemistry, 24(13), pp.16641677. pdf
RNAbows:
Aalberts, D.P. and Jannen, W.K., 2013. Visualizing RNA basepairing probabilities with RNAbow diagrams. RNA, 19(4), pp.475478. pdf
My thesis:
Flynn, M.J. and Aalberts, D.P., 2015. RNA Macrostates and Computational Tools. Williams College. github pdf
]]>
http://michaeljflynn.net/2017/01/09/simplernafoldingin130linesofhaskell/feed/
4
278

My Favorite Algorithm: MetropolisHastings
http://michaeljflynn.net/2015/06/01/myfavoritealgorithmmetropolishastings/
http://michaeljflynn.net/2015/06/01/myfavoritealgorithmmetropolishastings/#comments
Mon, 01 Jun 2015 06:05:11 +0000
http://52.7.230.38/?p=20
[Read more...]]]>
I’m writing this post because I feel like any computational scientist should have a favorite algorithm, and in my research I have run into this particular algorithm again and again and every time it impresses me. I am also writing this because I googled “What is your favorite algorithm” and was surprised to find that no one mentioned it in the top results. Few algorithms in computational physics have the scope and power of the MetropolisHastings algorithm. Briefly, given a domain and a function with bounded integral over that domain , the MHalgorithm samples points from a probability distribution proportional to . This subtle technique can solve very difficult problems in the design of engines, rockets, generators, nuclear weapons and reactors, plasma physics, the simulation of thermodynamic systems and transport of light in Computer Graphics, the computation of partition functions of RNA macrostates and their transition states, and the computation of posterior distributions in Bayesian statistics (as well as a family of statistical tools that I know relatively less about). In this post I plan to describe the algorithm, Monte Carlo integration, and some examples of applications.
The Algorithm
Let’s say we have some function over some domain such that
For example could be and could be the real line , or the uniform distribution and the interval :
or some higher dimensional function with a corresponding higher dimensional domain. It turns out we can generate a list of samples from the probability distribution proportional to this function,
(for our examples and , respectively) by following a simple procedure: Start at some initial state and generate the output, a list of samples , by continuously sampling from some probability distribution function (that you choose) set by :
This list of samples will converge to provided some conditions on are satisfied, namely ergodicity, which means that every state can eventually reach any other state, and detailed balance. If there are samples in state at some point in the chain, the number of samples in state created by transitions from these states is expected to be . is called the transition rate from to because it is the expected flow of states from to (or the other way, if it is negative). Detailed balance is said to hold if the transition rates between all states are zero when the sample has reached the correct probability distribution. This way the net flow of probability is zero and the distribution is expected to stay the same. Therefore if we have sampled states and we want to reach distribution , if we have that and want , our transition probabilities should be constrained by:
From here we can break down into the product of 2 different distributions, , the probability of generating a transition from to , and , the probability of accepting that transition. We do this to let the definition of be flexible by having handle the constraint. This is done by solving for with a general :
One choice that satisfies this equation is
Provided that the function satisfies these constraints, we can construct a mutation function that samples from to give us a new element, create the chain by accepting or rejecting the new element, and the samples’ distribution will converge to .
Example
Let’s say we want to sample from a difficult distribution on the plane, perhaps the one proportional to
Since this distribution is pretty difficult to sample directly, instead we use MetropolisHastings. Let’s construct a mutation function that chooses from all directions uniformly and moves in that direction a distance with probability . Our acceptance probability should therefore be:
This is all we need to know to implement a sampling algorithm in R:
## Metropolis Hastings demonstration by Mike Flynn
## Warning: This code has a runtime of around 4 hours on my machine
samples = matrix(nrow = 10^6, ncol = 2)
currentSample = c(0.001,0.001)
set.seed(44)
for(i in 1:(10^8)) {
## choose direction and distance for transition
direction = runif(1, 0, 2*pi)
distance = rexp(1, 2)
## current point
x1 = currentSample[1]
y1 = currentSample[2]
## compute next point
x2 = x1 + distance*cos(direction)
y2 = y1 + distance*sin(direction)
## accept?
accept = min(sin(sqrt(x2^2 + y2^2))^2*(x1^2 + y1^2)^(3/2)/
(sin(sqrt(x1^2 + y1^2))^2*(x2^2 + y2^2)^(3/2)),1)
if(accept > runif(1, 0, 1)) {
currentSample = c(x2, y2)
} else {
currentSample = c(x1, y1)
}
## only take 1 of every 100 to reduce autocorrelation
if(i %% 100 == 0) {
samples[i/100,] = currentSample
}
}
library(ggplot2)
library(ggthemes)
plotdat = data.frame(x = samples[,1], y = samples[,2])
## display p to plot
p = ggplot(data = plotdat, aes(x = x, y=y)) + geom_point(alpha = .05, size =.01) + theme_bw() +
xlim(6*pi, 6*pi) + ylim(6*pi, 6*pi)
which outputs this picture:
This plot looks a lot like interference fringes to me, which pleases me as a physicist. Of course, this example is completely contrived: interference fringes sometimes roughly follow this distribution. More importantly, the samples produced by the MetropolisHastings algorithm do seem to match the probability distribution proportional to .
Monte Carlo Integration and Importance Sampling
Monte Carlo integration is the process of evaluating an integral by sampling randomly from the domain and averaging. Often Monte Carlo integration of will make use of the expectation value identity for samples, , from a probability distribution on the same domain
For example we might evaluate the integral by randomly sampling from the uniform density on , letting , , and averaging:
The closer gets to the shape of the integrated function, the faster the convergence will be. For example, if , then the above derivation will yield
No matter what is, it will converge after the first iteration. This is cheating because it assumes we already know the answer, . In general we won’t have access to this knowledge, or else we wouldn’t be doing the integral in the first place. However, we do have the MetropolisHastings algorithm to let us sample from without explicitly knowing . We can use this in many creative ways, stemming from the fact that the expectation value of any statistic on these samples is .
Integrating the Example
Luckily enough our example function can be integrated over the entire xy plane by hand:
(The integral can be done via some complex analysis). Does our MetropolisHastings result corroborate this? Consider a function and the results of the MHalgorithm on that function, samples from the distribution . We are looking for . Let’s say we come up with some function and take the following statistic:
As long as integrates to over the domain, the answer comes out to be , so we can just take the reciprocal to get our answer. On the plane, a good function that integrates to 1 and matches the shape of our function pretty well is , so our statistic becomes:
We hope the answer comes out close to . I compute the statistic with 1 line of R code:
estimate = sum(apply(samples, 1, function(row) exp( row[1]^2  row[2]^2) * (row[1]^2 + row[2]^2)^(3/2)/(sin(sqrt(row[1]^2 + row[2]^2))^2) ))/(nrow(samples) * pi)
When running this line on the data generated by the previous code block, I get the answer 0.1016102, less than 1% away from the true value. Taking the square root of the reciprocal I get 3.137121, not a bad estimate of for a seemingly arbitrary sum. Pretty remarkable.
Applications
The applications of MetropolisHastings is truly where the algorithm shines. Many integrals are not solvable in “closed form” via the Fundamental Theorem of Calculus because they have no antiderivative that can be expressed in terms of elementary functions (apparently this is provable) , for example the integral
For these problems, if you really need to know the solution you can use numerical methods of estimating the integral, such as the trapezoidal rule (bringing back fond memories of AP calculus). For some integrals we cannot even use those techniques. This could be because either there are too many dimensions we are integrating over, which forces us to use exponentially many “trapezoids”, or it could be that the domain is really large (perhaps infinitely large) and contributions from different parts are uneven. For these problems Monte Carlo integration is the best option.
There are many examples of such problems in statistical mechanics. The spread of heat through conduction, convection, and radiation is essential for the design of engines, rockets, and electric generators. The many pathways of heat transfer must be integrated, a large space over which Monte Carlo integration is the only practical option. Neutron transport for the design of nuclear weapons and reactors is another similar problem. A third example would be the transport of light or photons through a scene for the purpose of rendering a physically accurate image in Computer Graphics, for which the application of the MHalgorithm has it’s own name: Metropolis Light Transport.
To illustrate how the MHalgorithm makes hard integrals solvable, here is the luminosity equation from Computer Graphics:
A quick description: the is the light outgoing from a flat surface in direction , is the light emitted by that surface in that direction, is the light incoming to that surface from direction , is the chance that light incoming from will scatter in direction , and is integrated over the positive hemisphere . In English terms, what this equation is saying is that to find the amount of light that is bouncing off a surface towards you, you have to add up the amount of light being bounced towards that surface from all other surfaces, and find all the light that is being bounced towards them, and so on. This integral is defined recursively! It is actually an infinite recursion of infinite integrals. This qualifies as a very hard integral. Luckily a surface absorbs some of the light bouncing off of it so it converges and we don’t go blind, but how do we integrate this?
A slick way to do it is to expand the integral into an integral over paths through each pixel instead of a recursive integral. That way we would have that the light received by pixel , , could be expressed as:
Where is the space of all paths of light through the camera, is what proportion of light of each color that path contributes to that pixel, is the total amount of light flowing along that path, and is a measure on that path. We can then treat this as a Monte Carlo integration problem, sample paths () from the distribution using the MHalgorithm, and use the expectation value identity:
to get the value of the integral, multiplied by a normalization constant (but this can be quickly found and set to whatever looks good at the end of the process). Because Metropolis Light Transport samples the most important points first, it has the fastest general convergence time of any unbiased Monte Carlo method. Additionally, it is trivially parallelized.
The MetropolisHastings algorithm makes normally impossible integrals solvable with relative efficiency, and because of the wide range of applications of the algorithm to problems that are very cool I consider it my favorite.
References
Much of what I have written here is knowledge that I built up over summer research on sampling methods and a Computer Graphics class I took this fall, including the luminosity equation I got from The Graphics Codex by Morgan McGuire, the Metropolis Light Transport paper and Ph. D thesis of Eric Veach. In addition I learned much about Monte Carlo methods from Computer Graphics: Principles and Practice by by
Chib, Siddhartha, and Edward Greenberg. “Understanding the metropolishastings algorithm.” The american statistician 49, no. 4 (1995): 327335.
Hastings, W.K. (1970). “Monte Carlo Sampling Methods Using Markov Chains and Their Applications”. Biometrika 57 (1): 97–109.
Hughes, John F.; van Dam, Andries; McGuire, Morgan; Sklar, David F.; Foley, James D.; Feiner, Steven K.; and Akeley, Kurt. Computer graphics: principles and practice. Pearson Education, 2013.
McGuire, Morgan. The Graphics Codex. Online book at www.graphicscodex.com.
Metropolis, N.; Rosenbluth, A.W.; Rosenbluth, M.N.; Teller, A.H.; Teller, E. (1953). “Equations of State Calculations by Fast Computing Machines”. Journal of Chemical Physics 21 (6): 1087–1092.
Veach, Eric. “Robust monte carlo methods for light transport simulation.” PhD diss., Stanford University, 1997.
Veach, Eric, and Leonidas J. Guibas. “Metropolis light transport.” In Proceedings of the 24th annual conference on Computer graphics and interactive techniques, pp. 6576. ACM Press/AddisonWesley Publishing Co., 1997.
]]>
http://michaeljflynn.net/2015/06/01/myfavoritealgorithmmetropolishastings/feed/
6
20