This is the second post in my series on great algorithms. My last post was on the Metropolis-Hastings algorithm.

The Fourier transform is a fundamental operation in applied math. The original Fourier transform is attributed to Joseph Fourier, a French mathematician and physicist, for solving partial differential equations involved in the transfer of heat [1]. Since then, the revelation that any function can be approximated by a series of sines and cosines has exploded far beyond the original application. The technique is used in applications from digital signal processing to medical imaging to detecting nuclear tests. Until the FFT, all applications used an algorithm that summed \(N\) terms for each of \(N\) output terms, and therefore had an asymptotic runtime of \(O(N^2)\). The \(O(N\log N)\) algorithm was first presented in a complete, packaged form in the 1965 paper by John Tukey and James Cooley: “An algorithm for the machine calculation of complex fourier series” [2]. There is evidence that the FFT had been discovered before. Gauss used the algorithm in the early 1800s to interpolate the trajectory of the asteroid Pallas but never published his results (aymptotic runtime matters when you’re doing all the computation by hand). The Ex Libris blog gives an interesting analysis of Gauss’s approach here. Others claim to have discovered the fundamental principle behind the FFT first, but no one got it in a form that made users realize they could perform their \(O(N^2)\) computations in \(O(N\log N)\) until Tukey and Cooley and for that they deserve credit.

The motivation behind the development of the FFT was not academic. This was the case for many algorithmic improvements of the 50’s and 60’s. In a time where business-scale mainframe computers like the IBM System/360 only had processing speeds of 35 kilohertz and memory sizes of 512 kilobytes, optimizing algorithms for memory and time performance had material economic benefits for companies. For the FFT there were national security reasons as well. James Cooley gives an interesting account of his development of the algorithm in a 1988 issue of *Mikrochmica Acta*: “The Re-Discovery of the Fast Fourier Transform Algorithm” [3]. The way he tells it, the (hydrogen-bomb) physicist Richard Garwin had a huge part in putting the effort together, for Cold-War purposes:

I was working on a research project of my own when Richard Garwin came to the computing center of the laboratory with some notes he made while with John Tukey at a meeting of President Kennedy’s Scientific Advisory Committee, of which they were both members. John Tukey showed that if \(N\), is a composite, \(N = ab\), then the Fourier series can be expressed as an a-term series of subseries of b terms each. If one were computing all values of the series, this would reduce the number of operations from \(N^2\) to \(N(a+ b)\). Tukey also said that if this were iterated, the number of operations would be proportional to \(N\log (N)\) instead of \(N^2\). Garwin knew that this was a very important calculation and he wanted to have this idea developed and applied.

Garwin described his problem of determining the periodicities of the spin orientations in a 3-D crystal of He\(^3\). Later, I found out that he was far more interested in improving the ability to do remote seismic monitoring of nuclear explosions since the Russians would not agree to inspections within their borders thereby hindering efforts at obtaining a nuclear test ban treaty. He also saw a need for the capability of long range acoustical detection of submarines. Like many others, I did not see the significance in this improvement and gave the job a little less priority than my own research. However, I was told of Garwin’s reputation and, prodded by his occasional telephone calls (some of them to my manager), I produced a 3-dimensional FFT program. I put some effort into designing the algorithm so as to save storage and addressing by over-writing data and I spent some time working out a 3-dimensional indexing scheme that was combined with the indexing within the algorithm.

Garwin publicized the program at first through his many personal contacts, producing a small but increasing stream of requests for copies of it.

This is not to say that Garwin came up with the idea or wrote the code or paper, but his presence here is indisputable. I am impressed that Garwin was able to pose a dummy physics problem for Cooley to solve that he could then use to detect nuclear tests and track nuclear submarines – but that is the immense power of the FFT.

Cooley’s article has lots of historical tidbits in it, including notes about the computational limits of the day. One detail is the “record” Fourier transform on a dataset with 2048 samples:

Another member of our department, Lee Alsop, who was a geophysicist and adjunct professor at the Lamont Geophysical Laboratory of Columbia University decided to try the new algorithm on a record of 2048 samples of a strain seismograph of the Rat Island earthquake.

Another is an example of computation that was *still* infeasible, even with the algorithmic speedups. When approached by a colleague with spectrum data to analyze he recounts:

One extraordinary thing about this was that a single record of data was about 512 000 points and all values of the spectrum were needed. This was beyond the capacity of the high speed memory of existing machines.

He also mentions that the FFT does give a theoretical speedup by a factor of 12,800 in this case. Finally the collaboration with Tukey was very limited (as well as the editing):

Thus, the paper made only one round trip between me and Tukey and our only collaboration was in a few telephone conversations.

Perhaps that explains why it is not very easy to read…

Since then, there have been too many applications of the FFT to count. The algorithm goes so far as to be one of the fundamental operations in gate-based quantum computing. Therefore, it’s probably worth a look.

The idea is to take advantage of symmetries in the complex exponential to “factor” the problem into several smaller problems recursively, yeilding standard “divide and conquer” speedups. While the algorithm has been generalized to work for any composite number of datapoints, the presentation is easiest when the number of points is a power of 2, so that is a fundamental assumption I will make throughout this post.

I’m going to assume that readers are familiar with the Fourier series and it’s generalization, the Fourier transform at a superficial level. To review, you can approximate any function over an interval of length \(L\) with sines and cosines with period equal to \(L\):

\[ f(x) = \frac{1}{2} a_0 + \sum_{n = 1}^{\infty} a_n \sin \left (\frac{2\pi}{L}nx \right ) + \sum_{n=1}^{\infty} b_n \cos \left (\frac{2\pi}{L}nx \right ).\]

Using the identities \(e^{i\theta} = \cos\theta + i\sin\theta\), \(\sin\theta = \frac{i}{2}(e^{i\theta} + e^{-i\theta})\), and \(\cos\theta = \frac{1}{2}(e^{i\theta} – e^{-i\theta})\), these terms can be represented as complex exponentials:

\[ \begin{align}

f(x) =& \frac{1}{2} a_0 + \sum_{n = 1}^{\infty} a_n \sin \left (\frac{2\pi}{L}nx \right ) + \sum_{n=1}^{\infty} b_n \cos \left (\frac{2\pi}{L}nx \right ) \\

=& \frac{1}{2} a_0 + \frac{1}{2}\sum_{n = 1}^{\infty} ia_n (e^{2\pi i n x / L} – e^{-2\pi i n x / L}) + \frac{1}{2}\sum_{n=1}^{\infty} b_n (e^{2\pi i n x / L} + e^{-2\pi i n x / L}) \\

=& \frac{1}{2} a_0 e^{2\pi i 0 x / L} + \frac{1}{2}\sum_{n = 1}^{\infty} (b_n + ia_n) e^{2\pi i n x / L} + \frac{1}{2}\sum_{n=1}^{\infty} (b_n – i a_n) e^{-2\pi i n x / L}

\end{align} \]

Notice that that if we define \(c_0 = \frac{1}{2} a_0\), \(c_n = \frac{1}{2} (b_n + \mbox{sign}(n) i a_n)\), we can simplify the above to the final form:

\[ f(x) = \sum_{n = -\infty}^{\infty} c_n e^{2\pi i n x / L}.\]

The coefficients can be found by taking advantage of an indentity of the complex exponential: it integrates to zero over one full period, i.e. \(\int_0^L e^{2\pi i (m-n) x / L} = L\) if \(m=n\) and \(0\) of \(m \neq n\). This can be used to isolate \(c_m\):

\[ \int_0^L f(x) e^{-2 \pi i m x} dx = \sum_{n = -\infty}^{\infty} c_n \int_0^L e^{2 \pi i n x / L} e^{2 \pi i m x / L}dx = \sum_{n = -\infty}^{\infty} c_n \int_0^L e^{2 \pi i (n – m) x / L} = Lc_m\]

Therefore \[c_m = \frac{1}{L} \int_0^L f(x) e^{-2\pi imx/L} dx.\]

This is the basis of the complex Fourier series. The output of the Fourier transform is the set of \(c_m\)‘s.

In practice, we cannot compute each of the infinite \(c_m\), and it turns out we do not have enough information to do so. In almost all applications, we do not have \(f(x)\), rather we have \(N\) evenly-spaced samples from \(f(x)\) stored in a computer. Unfortunately, this limits how far out in frequency space we can get.

For example, consider what happens when analyzing the coefficient of the frequency \(N\). The function goes through one full period before it reaches the next point, so you are effectively applying a constant displacement to each point, the same effect as the frequency \(n=0\). In general, each of the points will be effected in the same way for frequency \(n\) as the frequency \(n +N\), and we get no new information from analyzing beyond \(n = N-1\).

Therefore, in our output we are only looking for frequencies \(0\) through \(N-1\). If we have \(N\) points, we are looking for \(N\) frequencies, each computed via the discrete integral:

\[ c(n) = \frac{1}{N} \sum_{x = 0}^{N-1} f(x) e^{-2\pi i n x /N}\]

This is effectively computing the integral with a rectangular approximation \(\frac{1}{L} \int_0^L f(x) e^{-2\pi i nx/L} dx \approx \frac{1}{L} \frac{L}{N} \sum_{x’=0}^{N-1} f(x’) e^{-2\pi i n x’ / N}.\) where \(x = L/N x’\).

From here, I’m going to drop the normalizing term \(\frac{1}{N}\) and assume it will be applied at the end. Also, it’s much easier from this point to look at the Fourier transform using a picture. We can imagine the discrete Fourier transform as a circuit taking in the \(N\) samples as input and giving the \(N\) Fourier coefficients as output. This is depicted below, with the inputs \(f(x)\) coming in from the left, and the outputs \(c(n)\) coming out to the right. Notice that each of the \(N\) terms of the output needs to compute a sum of \(N\) terms, and therefore the runtime of this circuit is \(O(N^2)\).

\[

\begin{array}{ll|ccc|llll}

\hline

f(0) & \rightarrow & & & & \rightarrow & \frac{1}{8} \sum\limits_{x = 0}^{7} f(x) & = & c(0)\\

f(1) & \rightarrow & & & & \rightarrow & \frac{1}{8} \sum\limits_{x = 0}^{7} f(x) e^{-2\pi i x /8} & =& c(1) \\

f(2) & \rightarrow & & & & \rightarrow & \frac{1}{8} \sum\limits_{x = 0}^{7} f(x) e^{-2\pi i 2 x /8} & =& c(2)\\

f(3) & \rightarrow & & \mbox{Discrete Fouirer Transform} & & \rightarrow & \frac{1}{8} \sum\limits_{x = 0}^{7} f(x) e^{-2\pi i 3 x /8} & =& c(3) \\

f(4) & \rightarrow & & & & \rightarrow & \frac{1}{8} \sum\limits_{x = 0}^{7} f(x) e^{-2\pi i 4 x /8} & =& c(4)\\

f(5) & \rightarrow & & & & \rightarrow & \frac{1}{8} \sum\limits_{x = 0}^{7} f(x) e^{-2\pi i 5 x /8} & =& c(5)\\

f(6) & \rightarrow & & & & \rightarrow & \frac{1}{8} \sum\limits_{x = 0}^{7} f(x) e^{-2\pi i 6 x /8} & =& c(6)\\

f(7) & \rightarrow & & & & \rightarrow & \frac{1}{8} \sum\limits_{x = 0}^{7} f(x) e^{-2\pi i 7 x /8} & =& c(7)\\ \hline

\end{array}

\]

However, there seems to be redundancy here. Each of these terms is a sum of the \(f(x)\) terms, each with a complex exponential slapped on, but the complex exponential is *periodic*. Take \(c(0)\) and \(c(4)\) for example.

\[ c(0) = \sum_{x = 0}^7 f(x) \].

\[ c(n) = \sum_{x = 0}^7 f(x) e^{-2\pi i 4 x /8} = \sum_{x = 0}^7 f(x)

e^{-\pi i x} \]

They are the same sum, except each term of \(c(4)\) has an extra term of \(e^{-\pi i x}\) slapped on. This term is going to be equal to 1 for even numbers of \(x\) and equal to \(-1\) for odd number of \(x\). This behavior is independant of \(n\): if we pair up any Fourier coefficient with the coefficient that is \(N/2\) indices up from it (for example 1 and 5, 2 and 6, and 3 and 7), we will find the same relationship since

\[ e^{-2\pi i (n + N/2) x / N} = e^{-2\pi i n x / N} e^{\pi i x} .\]

This suggests a strategy: there are really only \(N\) terms to work out between \(c(n)\) and \(c(n + N/2)\) (instead of \(2N\)), we just need to remember to slap a -1 onto the odd terms for \(c(n+N/2)\). This leads us to the next symmetry we can exploit. We can divide these two sums into 2 sums of the form “full sum” = “even terms” + “odd terms”. I will do this out for the \(N=8\) case but you will see how it can easily generalize.

For \(c(n)\) this looks like:

\[\begin{align} c(n) = \sum\limits_{x = 0}^{7} f(x) e^{-2\pi i n x /8} &= \overbrace{\sum\limits_{x = 0}^{4} f(2x) e^{-2\pi i n (2x) /8}}^{\mbox{even terms}} + \overbrace{\sum\limits_{x = 0}^{4} f(2x + 1) e^{-2\pi i n (2x+1) /8}}^{\mbox{odd terms}} \\

&= \sum\limits_{x = 0}^{4} f(2x) e^{-2\pi i n x /4} + \sum\limits_{x = 0}^{4} f(2x + 1) e^{-2\pi i n x /4} e^{-\pi i n /4}.

\end{align}\]

For \(c(n + N/2)\) this looks like:

\[\begin{align} c(n + N/2) = \sum\limits_{x = 0}^{7} f(x) e^{-2\pi i n x /8}e^{-\pi i x} &= \overbrace{\sum\limits_{x = 0}^{4} f(2x) e^{-2\pi i n (2x) /8}}^{\mbox{even terms}} – \overbrace{\sum\limits_{x = 0}^{4} f(2x + 1) e^{-2\pi i n (2x+1) /8}}^{\mbox{odd terms}} \\

&= \sum\limits_{x = 0}^{4} f(2x) e^{-2\pi i n x /4} – \sum\limits_{x = 0}^{4} f(2x + 1) e^{-2\pi i n x /4} e^{-\pi i n /4}.

\end{align}\]

The two terms in the resulting equation have remarkable symmetries to the original problem: they are the Fourier transform of the function using only \(N/2\) samples, one using the even indexed samples, the other using the odd indexed samples. Define two new variables for each \(n\) accordingly:

\[ \begin{align}

c_e(n) &= \sum\limits_{x = 0}^{4} f(2x) e^{-2\pi i n x /4} \\

c_o(n) &= \sum\limits_{x = 0}^{4} f(2x + 1) e^{-2\pi i n x /4},

\end{align}\]

we immediately see that

\[ \begin{align} c(n)& = c_e(n) + c_o(n) e^{-\pi i n/4} \quad \mbox{ and} \\

c(n + N/2)& = c_e(n) – c_o(n) e^{-\pi i n/4}.

\end{align}\]

This relationship underlies the “butterfly diagram” that the FFT algorithm is known for. Updating our circuit with \(f_e(x) = f(2x)\) and \(f_o(e) = f(2x+1)\):

\[\begin{array}{llll|ccc|llll||l}

\hline

f(0) & = & f_e(0) & \rightarrow & & & & \rightarrow & \sum\limits_{x = 0}^{3} f_e(x) & = & c_e(0) & c(0) = c_e(0) + c_o(0)\\

f(2) & = & f_e(1) & \rightarrow & & \mbox{Discrete Fouirer Transform} & & \rightarrow & \sum\limits_{x = 0}^{3} f_e(x) e^{-2\pi i x /4} & =& c_e(1) & c(1) = c_e(1) + c_o(1) e^{-\pi i / 4}\\

f(4) & = & f_e(2) & \rightarrow & & & & \rightarrow & \sum\limits_{x = 0}^{3} f_e(x) e^{-2\pi i 2 x /4} & =& c_e(2) & c(2) = c_e(2) + c_o(2) e^{-2 \pi i / 4}\\

f(6) & = & f_e(3) & \rightarrow & & & & \rightarrow & \sum\limits_{x = 0}^{3} f_e(x) e^{-2\pi i 3 x /4} & =& c_e(3) & c(3) = c_e(3) + c_o(3) e^{-3 \pi i /4} \\ \hline

f(1) & = & f_o(0) & \rightarrow & & & & \rightarrow & \sum\limits_{x = 0}^{3} f_o(x) & =& c_o(0) & c(4) = c_e(0) – c_o(0)\\

f(3) & = & f_o(1)& \rightarrow & & \mbox{Discrete Fouirer Transform} & & \rightarrow & \sum\limits_{x = 0}^{3} f_o(x) e^{-2\pi i x /4} & =& c_o(1) & c(5) = c_e(1) – c_o(1)e^{-\pi i / 4} \\

f(5) & = & f_o(2) & \rightarrow & & & & \rightarrow & \sum\limits_{x = 0}^{3} f_o(x) e^{-2\pi i 2 x /4} & =& c_o(2) & c(6) = c_e(2) – c_o(2) e^{-2\pi i /4} \\

f(7) & = & f_o(3) & \rightarrow & & & & \rightarrow & \sum\limits_{x = 0}^{3} f_o(x) e^{-2\pi i 3 x /4} & =& c_o(3) & c(7) = c_e(3) – c_o(3) e^{-3\pi i /4} \\ \hline

\end{array}\]

We can do this process recursively until we hit the trivial case of a Fourier transfom with one sample. The number of times we will recursively call the algorithm is equal to \(\log N\), because each time the number of terms left in each gets divided by two. At each level of the division, we do \(N\) constant-time recombinations of the terms in the previous level. Therefore the asymptotic performance of this algorithm should be \(O(N \log N)\), much faster than \(O(N^2)\).

Let’s code this up and compare runtimes. Again we are going to assume that our input is always going to be a power of two in length. At level \(l\) back, there are going to be \(2^l\) Fourier transforms to compute, each with \(N/2^l\) samples. Indexing the output is a lot of fun, in the dwarf fortress sense, as you can see by the following table:

Level | x | Indices |
---|---|---|

0 | 0:7 | \(\{x\}\) |

1 | 0:3 | \(\{2x\}\), \(\{2x+1\}\) |

2 | 0:1 | \(\{2(2x)\}\), \(\{2(2x) + 1\}\), \(\{2(2x + 1)\}\), \(\{2(2x+1) + 1\}\) |

In general the set of indices at any level of the recursion are going to be \(\{2^lx +n\}\). Each seperate Fourier transform has a different value of \(n\), and for it \(x\) ranges between \(0\) and \(N/2^l\). I suggest working this out yourself to truly grok it.

fast.fourier.transform <- function(f) { N <- length(f) levels <- log(N, 2) prev.level <- f level <- complex(real = numeric(length(f))) for(l in (levels-1):0) { for(n in 0:(2^l - 1)) { half.num.indices <- 2^(levels - l - 1) for(x in 0:(half.num.indices - 1)) { ## One catch here to be aware of: in the circuit diagram above, the two ## seperate Fourier transforms have been placed as if their indices are ## together, for ease of reading. In the implementation, these indices ## are going to be interleaved, i.e. $c_e(0)$ at $i=0$, $c_o(0)$ at $i=1$ ## and so on. Therefore we vary x up to half of the indices, with evens=2x ## and odds=2x+1. level[2^l * x + n + 1] = prev.level[2^(l) * (2*x) + n + 1] + prev.level[2^l * (2*x+1) + n + 1] * exp(complex(imaginary = -pi*x/half.num.indices)) level[2^l * (x + half.num.indices) + n + 1] = prev.level[2^l * (2*x) + n + 1] - prev.level[2^l * (2*x + 1) + n + 1] * exp(complex(imaginary = -pi*x/half.num.indices)) } } prev.level <- level } level } slow.fourier.transform <- function(f){ N <- length(f) result <- numeric(N) for(n in 0:(N-1)) { for(x in 0:(N-1)) { result[n+1] = result[n+1] + f[x+1] * exp(complex(imaginary = -2*pi * x * n / N)) } } result }

Let’s test these two functions. Do they give the correct result? Is the fast Fourier transform really faster? For the first question, I’m going to test out the two new series against very simple Fourier series compositions, and also against `R`

’s `fft`

function. For the second question I will test the runtimes with several different length transforms and then compute the relationship between log time and log length. The slow Fourier transform should have a slope of 2, the fast Fourier transform should have a slope that is only slightly higher than 1.

What are 4 simple series to test. How about, for \(x=0..15\):

**Function 1:**\(f(x) = \sin(\frac{2\pi}{16} x)\)**Function 2:**\(f(x) = \sin(\frac{2\pi}{16} x) – \cos(\frac{2\pi}{16}3 x) + \sin(\frac{2\pi}{16} 6 x)\)**Function 3:**\(f(x) = \sum_{n = 0}^{15} n \sin(\frac{2 \pi }{16}n x)\)**Function 4:**\(f(x) = \sum_{n = 0}^{15} \sin(\frac{2\pi}{16} n )\sin(\frac{2 \pi}{16}n x)\)

library(ggplot2) library(dplyr) library(tidyr) plot_theme <- theme(panel.background = element_blank(), panel.grid.major = element_line(color = "grey90", size = 0.2), panel.grid.minor = element_line(color = "grey95", size = 0.5), strip.text.x = element_text(color = "white"), strip.background = element_rect(fill = "black"), axis.text = element_text(color = "black"), axis.title = element_text(face = "bold"), title = element_text(face = "bold")) ## Here are the functions we want to test, they should have very clear Fourier transforms function.1 <- function(x) sin(2*pi*x/16) function.2 <- function(x) sin(2*pi*x/16) - cos(2*pi*3*x/16) + sin(2*pi*6 * x/16) function.3 <- function(x) sapply(x, function(s) sum(sapply(0:15, function(n) n* sin(2*pi*n*s/16)))) function.4 <- function(x) sapply(x, function(s) sum(sapply(0:15, function(n) sin(2*pi*n/16)* sin(2*pi*n*s/16)))) tests <- list("Function 1" = function.1, "Function 2" = function.2, "Function 3" = function.3, "Function 4" = function.4) ## for each of these function, we want to apply the function test.data <- bind_rows(lapply(1:length(tests), function(i) { fun <- tests[[i]] name <- names(tests)[[i]] x <- 0:15 bind_rows( data.frame(x = x, y = fun(x), func = name, transform = "None"), data.frame(x = x, y = abs(fft(fun(x))), func = name, transform = "fft"), data.frame(x = x, y = abs(slow.fourier.transform(fun(x))), func = name, transform = "slow.fourier.transform"), data.frame(x = x, y = abs(fast.fourier.transform(fun(x))), func = name, transform = "fast.fourier.transform")) })) test.data$transform <- factor(test.data$transform, levels = c("None", "slow.fourier.transform", "fast.fourier.transform", "fft")) ggplot(data = test.data, aes(x=x, y = y)) + geom_point() + facet_grid(func~transform, scale = "free") + theme_bw()

As you can see via the above plot, the code that we’ve written agrees with R’s built in `fft`

function, which is good. Also good is that it gives the output that we expect.

You might be wondering why the first row of plots contain two non-zero dots each – wasn’t there just one frequency included in function 1? Remember that to put in this function, we used \(\sin(x) = (e^{ix} – e^{-ix})/2i\), the other point that shows up is the “negative” frequency which has show up on the other side due to the periodic symmetry of the complex exponential.

Now let’s time the functions against several different length inputs and see how they behave as the length of the sequences get large.

sequence.lengths <- sapply(0:13, function(x) 2^x) timing.test <- data.frame( length = sequence.lengths, slow = sapply(sequence.lengths, function(l) { dat <- 0:l; system.time({slow.fourier.transform(dat)})[["elapsed"]]}), fast = sapply(sequence.lengths, function(l) { dat <- 0:l; system.time({fast.fourier.transform(dat)})[["elapsed"]]}), fft = sapply(sequence.lengths, function(l) { dat <- 0:l; system.time({fast.fourier.transform(dat)})[["elapsed"]]})) timing.test <- gather(timing.test, func, time, slow, fast, fft)

Normal plot:

ggplot(data = timing.test, aes(x = length, y = time, color = func)) + geom_point() + theme_bw()

Log-Log-plot:

ggplot(data = timing.test, aes(x = log(length), y= log(time), color = func)) + geom_point() + theme_bw() + geom_smooth(method = "lm") + coord_fixed(ratio = 1)

## Warning: Removed 22 rows containing non-finite values (stat_smooth).

Linear models on the log-log data

slow <- subset(timing.test, func %in% "slow" & time != 0) fast <- subset(timing.test, func %in% "fast" & time != 0) fft <- subset(timing.test, func %in% "fft" & time != 0) lm.slow <- lm(log(time) ~ log(length), data = slow) lm.slow

## ## Call: ## lm(formula = log(time) ~ log(length), data = slow) ## ## Coefficients: ## (Intercept) log(length) ## -11.889 1.974

lm.fast <- lm(log(time) ~ log(length), data = fast) lm.fast

## ## Call: ## lm(formula = log(time) ~ log(length), data = fast) ## ## Coefficients: ## (Intercept) log(length) ## -10.334 1.132

lm.fft <- lm(log(time) ~ log(length), data = fft) lm.fft

## ## Call: ## lm(formula = log(time) ~ log(length), data = fft) ## ## Coefficients: ## (Intercept) log(length) ## -9.855 1.078

Via the coefficient on `log(length)`

we see that the slow version has slope approximately equal to 2, and the fast version has slope slightly greater than one. I’m especially suprised that my version in plain R seems to perform nearly as fast as R’s native `fft`

function, which I had assumed was programmed in optimized C. Thus we have derived the a way in which the discrete Fourier transform can be computed \(O(N\log N)\).

Gilbert Strang gives a really good lecture on FFT, using matrix factorization which I didn’t cover here. (Unfortunately lost me around 37:25). link

[1] Fourier, Joseph (1822). ThÃ©orie analytique de la chaleur (in French). Paris: Firmin Didot PÃ¨re et Fils. OCLC 2688081. link

[2] Cooley, James W.; Tukey, John W. (1965). “An algorithm for the machine calculation of complex Fourier series”. Math. Comput. 19: 297â€“301. doi:10.2307/2003354. link

[3] Cooley, J. W. (1987). The re-discovery of the fast Fourier transform algorithm. Microchimica Acta, 93(1-6), 33-45. link

Machine learning is a buzzword these days. Just dropping it can impress people without requiring that you elaborate, perhaps because it feeds their fantasies of a robot takeover. Investors are particularly vulnerable because they get to act on these fantasies. The following exchange approximates one that actually happened in real life:

“We’ve got a couple cutting edge projects coming down the pipeline: cloud computing and *machine learning*!”

“Wow! I’m terribly excited about all this! Hail our new overlords! Take my money!”

Notice that money changed hands without inquiry into what these two concepts were being used for.

All this means is that, whoever you are, it is especially important to become familiar with the real tools of machine learning and data science, what problems they are meant to solve, and how they solve them.

This is a tutorial on “principal component analysis”, or PCA, a technique used to reduce *redundancy* in your data. In any given dataset, you might have two or more measurements that actually measure the same thing. One example would be a dataset measuring students’ hours spent doing schoolwork and their GPA. These 2 variables could be interpreted as measuring 1 underlying quality of a student: how much of a “good student” they are. We might want to reduce these 2 factors into this “good student” measure because this greatly simplifies statistical analysis. That’s where PCA comes in. PCA rotates your data’s coordinate system to one that aligns with the important questions i.e. whether a student is a good student or not (GPA + hours worked), or how naturally smart or efficient they are (GPA – hours worked). It also gives a score on how important these factor are.

In this tutorial I am going to set up a situation where there is one underlying phenomenon: the motion of a spring in one direction. I’m going to duplicate and obfuscate this phenomenon by having a fictional group of scientist measure this it with several different cameras. Then I’m going to use PCA to isolate the signal.

Let’s say we have a ball attached to a spring. The balls motion might be described as \[x(t) = cos(t)\] around the equilibrium point. I can model this in R.

library(ggplot2) library(tidyr) library(dplyr) theme_set(theme_bw()) # set default theme to black-white t <- seq(0, 6*pi, by = .1) x <- cos(t) qplot(x = t, y = x)

The above plot is how the position \(x\) varies with time. Of course, in the real world the ball has 3 spatial coordinates and every measurement has noise. To account for these I’m going to add \(y\) and \(z\) coordinates, which are both 0 and I’m also going to add noise in every direction.

positions <- data.frame(t = seq(0, 6*pi, by = .1), x = x + rnorm(length(x), sd = .1), y = rnorm(length(x), sd = .1), z = rnorm(length(x), sd = .1)) ggplot(data = gather(positions, variable, value, x:z), aes(x = t, y = value)) + geom_point() + facet_wrap(~ variable, nrow = 1)

Now you can see 3 relationships, one of mostly signal, 2 of noise.

Now let’s say a team of scientists have set up 3 cameras to observe this process. They do not have priviledged access to the coordinate system that we know of, where the \(x\) coordinate contains all the signal. Rather, for each camera they get two coordinates: the \(x\) and \(y\) of the ball’s location in the camera’s image plane. For simplicity’s sake, I’m going to assume each camera is pointed directly at the midpoint of the ball’s motion. I’m going to assert, and you can prove to yourself, that a camera’s image plane coordinates can be expressed by rotating and scaling the original coordinate system, and then projecting the data onto two of the resulting dimensions.

One side note: In the process of writing this post, I discovered that generating a random 3D-rotation is actually super annoying. I spent a little bit of time trying to tie the concept down and this is what I’ve come up with: a rotation can be uniquely specified by an axis and an angle to rotate around that axis. You can read about all that here and here, or you can just steal my code.

## random 3d rotation matrix random.rotation.matrix <- function(seed) { set.seed(seed) ## angle around vector to rotate psi <- runif(1, max = 2*pi) ## select a random point on S^2 phi <- runif(1, max = 2*pi) theta <- acos(2 * runif(1) - 1) ## construct axis from random S^2 point axis <- c(cos(theta)*cos(phi), cos(theta)*sin(phi), sin(theta)) ## cross product matrix for formula axis.cp <- matrix(c(0, -axis[3], axis[2], axis[3], 0, -axis[1], -axis[2], axis[1], 0), nrow = 3, byrow = TRUE) ## create rotation matrix using wikipedia formula R <- cos(psi) * diag(c(1,1,1)) + sin(psi) * axis.cp + (1-cos(psi)) * outer(axis, axis) R }

I’ve tested the above function and it does seem to create random rotations. So let’s go ahead an find the new coordinates!

new.camera.data <- function(positions, seed) { set.seed(seed) ## original data original.coordinates <- t(as.matrix(select(positions, x,y,z))) ## get scale scale <- 1/rexp(1, 1/3) ## get rotation rotation <- random.rotation.matrix(seed) ## new points new.points <- t(scale * rotation %*% original.coordinates) ## project image.plane.projection <- new.points[,1:2] list(scale = scale, rotation = rotation, data = image.plane.projection) } camera.1 <- new.camera.data(positions, 1) camera.2 <- new.camera.data(positions, 2) camera.3 <- new.camera.data(positions, 3) camera.data = data.frame(t = seq(0, 6*pi, by = .1), x1 = camera.1$data[,1], y1 = camera.1$data[,2], x2 = camera.2$data[,1], y2 = camera.2$data[,2], x3 = camera.3$data[,1], y3 = camera.3$data[,2]) ggplot(data = gather(camera.data, variable, value, x1:y3), aes(x = t, y = value)) + geom_point() + facet_wrap(~ variable, nrow = 2)

It now looks that there are several signals here, but the scientists are suspicious because they seem very correlated. How will they isolate the signal?

PCA is based on the hypothesis that signals are things that cause your data to vary. Therefore, directions that possess large signals will also have data that varies in that direction. This leads us to the fundamental assumption of PCA:

**Fundamental Assumption of PCA:** The most important directions are the directions in which the data *varies most*.

Note that this implicitly assumes that the signal to noise ratio, \[\frac{\sigma_{signal}^2}{\sigma_{noise}^2} > 1.\] Only then will the signal directions be expected to be the directions of greatest variation. It many real-world applications, for example stock price forecasting, the noise overwhelms the signal. It would be an interesting exercise to see how this simple example breaks down as you add noise.

Thus, PCA becomes an optimization problem. Let the matrix of data with each column translated so that it is mean 0 (\(a \to a – mean(a)\)) be denoted \(\mathbf X\). You will soon see why this translation is important. Since translation does not effect variance, we are trying to find some unit vector \(\mathbf u\) such that the variance of \(\mathbf X \mathbf u\) is maximized.

\[ \mbox{Maximize Var(}{\mathbf X \mathbf u} \mbox{)} \quad \mbox{ subject to } \quad ||\mathbf u|| = 1,\]

repeatedly such that each \(\mathbf u_i \cdot \mathbf u_j = 0\) for \(i \neq j\) until our directions span the space of the data.

To solve this, let’s talk about variance for a bit. Consider two sets of measurements \[ \mathbf a = \{a_1, a_2,…, a_n\} \quad \mbox{ and

} \quad \mathbf b = \{b_1, b_2,…,b_n\} \] with means of \(\mu_a\) and \(\mu_b\), respectively. The variance and covariance are defined as average squared differences from the mean: \[ \mbox{Var}(\mathbf a) =

\sigma_{\mathbf a}^2 = \frac{1}{n} \sum_i(a_i – \mu_a)^2 = \frac{1}{n}

\sum_i a_i^2 = \frac{1}{n} \mathbf a \cdot \mathbf a \quad \mbox{if }

\mu_a = 0,\] \[ \mbox{Cov}(\mathbf a, \mathbf b) =

\sigma_{\mathbf{ab}}^2 = \frac{1}{n} \sum_i(a_i – \mu_a)(b_i – \mu_b)

= \frac{1}{n} \sum_i a_ib_i = \frac{1}{n}\mathbf a \cdot \mathbf b

\quad \mbox{if } \mu_a =0 \mbox{ and } \mu_b = 0.\]

So the variance is equal to the covariance of a variable with itself, \(\mbox{Var}(\mathbf{a}) = \mbox{Cov}(\mathbf a, \mathbf a)\). Since all of \(\mathbf X\)’s columns \(\mathbf x_i\), are mean zero, we can define a matrix \(\mathbf C\) where the \([i,j]\)th member is the covariance of \(\mathbf x_i\) and \(\mathbf x_j\).

\[ \frac{1}{n} \mathbf X^T \mathbf X =

\begin{bmatrix}

\frac{1}{n} \mathbf x_1 \cdot \mathbf x_1 & \cdots & \frac{1}{n} \mathbf x_1 \cdot \mathbf x_r \\

\vdots & \ddots & \vdots \\

\frac{1}{n} \mathbf x_r \cdot \mathbf x_1 & \cdots & \frac{1}{n} \mathbf x_r \cdot \mathbf x_r \end{bmatrix} = \mathbf C.\]

Let’s do this now with our camera measurments in R.

options(width = 120) measurements <- select(camera.data, -t) measurements <- mutate_each(measurements, funs(. - mean(.))) measurements <- as.matrix(measurements) covariance.matrix <- t(measurements) %*% measurements / nrow(measurements) covariance.matrix

## x1 y1 x2 y2 x3 y3 ## x1 0.0021234862 -0.005043854 -0.0002620748 -0.002255248 -0.0005278387 -0.002325504 ## y1 -0.0050438535 0.092671234 0.0166668085 0.034679534 0.0217014469 0.034022588 ## x2 -0.0002620748 0.016666808 0.0032414307 0.006126479 0.0041287510 0.005953620 ## y2 -0.0022552479 0.034679534 0.0061264792 0.013067542 0.0079891875 0.012826169 ## x3 -0.0005278387 0.021701447 0.0041287510 0.007989187 0.0053130622 0.007799908 ## y3 -0.0023255041 0.034022588 0.0059536201 0.012826169 0.0077999083 0.012612169

Let’s also reframe our optimization problem in terms of this nice, square, symmetric matrix:

\[ \mbox{Var}(\mathbf{X} \mathbf u) = \frac{1}{n}(\mathbf{X} \mathbf u)^T (\mathbf{X} \mathbf u) = \mathbf u^T (\frac{1}{n} \mathbf{X}^T \mathbf{X}) \mathbf u = \mathbf u^T \mathbf{C} \mathbf u. \]

Therefore we are trying find a \(\mathbf u\) that maximizes the matrix product with the covariance matrix, \(\mathbf u^T \mathbf C \mathbf u\) such that \(\mathbf u\) is a unit vector and perpendicular to the other solutions we have found. Remember that \(\mathbf C\) is a square, symmetric matrix, and its set of unit eigenvectors \(\{\mathbf v_1, \mathbf v_2,\dots, \mathbf v_r\}\) are orthogonal, i.e. \(\mathbf v_i \cdot \mathbf v_j = 0\) for \(i \neq j\), and span it’s column and row spaces. Therefore we can express \(\mathbf u\) as some combination of the orthogonal unit eigenvectors of \(\mathbf C\):

\[\mathbf u = a_1 \mathbf v_1 + a_2 \mathbf v_2 + \dots + a_r \mathbf v_r,\]

where \(\sum a_i^2 = 1\) to make the result length 1. What happens when we make this substitution into the quantity we are trying to optimize?

\[\begin{align} \mathbf u^T \mathbf C \mathbf u &= (a_1 \mathbf v_1^T + a_2 \mathbf v_2^T + \dots + a_r \mathbf v_r^T) \mathbf C

(a_1 \mathbf v_1 + a_2 \mathbf v_2 + \dots + a_r \mathbf v_r) \\

&= (a_1 \mathbf v_1^T + a_2 \mathbf v_2^T + \dots + a_r \mathbf v_r^T)

(a_1 \mathbf C \mathbf v_1 + a_2 \mathbf C \mathbf v_2 + \dots + a_r \mathbf C \mathbf v_r) \\

&= (a_1 \mathbf v_1^T + a_2 \mathbf v_2^T + \dots + a_r \mathbf v_r^T)

(a_1 \lambda_1 \mathbf v_1 + a_2 \lambda_2 \mathbf v_2 + \dots + a_r \lambda_r \mathbf v_r) \\

&= a_1^2 \lambda_1 + a_2^2 \lambda_2 + \dots + a_r^2 \lambda_r \quad \mbox{because the } \mathbf v \mbox{‘s are orthogonal.}\end{align}\]

This is very interesting. Let \[\{\alpha_1, \alpha_2, …, \alpha_r \} = \{a_1^2, a_2^2, …, a_r^2\}, \] we are trying to solve an optimization of the form \[\begin{align}

\mbox{Maximize} \quad \alpha_1 \lambda_1 + &\alpha_2 \lambda_2 + \dots + \alpha_r \lambda_r \\

\mbox{Subject to} \quad \sum_i \alpha_i &= 1 \\

\mbox{All } \alpha_i & \geq 0

\end{align}\]

This is a linear programming problem, where all solutions will be on the *corners* of the space. On the corners, the inequalities are tight, so the solutions are where \(r-1\) \(\alpha\)’s are 0, and exactly one \(\alpha_i = 1\), so \(a_i = \sqrt{\alpha_i} = 1\).

So \(\mathbf u_i = \mathbf v_i\), where the \(i\)th most important direction has the \(i\)th largest eigenvalue. The solutions are the eigenvectors of the covariance matrix. Since the eigenvalues are orthogonal, our coordinate system is orthogonal. Pretty cool result, if you ask me.

Some people use the singular SVD to get the eigenvectors and eigenvalues. If there is a deeper reason to use SVD, feel free to email me! I am curious. Here, I’m just going to use the R function `eigen`

to grab the eigenvectors.

First I’m going to look at the eigenvalues. I expect to see 1 large eigenvalue, and 5 much smaller values. What I’m going to do then is get the eigenvectors, project the data into that basis, and plot the results.

important.directions <- eigen(covariance.matrix) important.directions$value

## [1] 1.265056e-01 2.491410e-03 3.188713e-05 1.083847e-17 3.857996e-18 -2.696033e-18

The first eigenvalue is 100x bigger than the second, which is 100x bigger than the next, and so on. This is exactly what I hoped to see.

projected <- as.data.frame(measurements %*% important.directions$vector) colnames(projected) <- paste0("component", 1:6) projected$t <- camera.data$t plot.data <- gather(projected, component, value, component1:component6) ggplot(data = plot.data, aes(x = t, y = value)) + geom_point() + facet_wrap(~component)

We’ve not only isolated the signal, we’ve also isolated the noise as well! PCA has saved the scientists from the complicated analysis of 6 signals, and given them only one to work with. Problem solved.

“Bob’s foot is made of solid lead,” the Alien commented.

“It better be when he has to slow down.” Alice took a moment to consider the significance of this event. “Reality is so strange. Bob is speeding past us right now. According to special relativity, time should be passing slower for him. However, special relativity also says that he has every right to say that he is stationary and we are moving, and therefore time is passing slower for us. Who is right?”

“Both.” The Alien blinked.

Alice continued. “One of us could start accelerating towards the other to meet up and find out which one was aging faster, but whoever does the acceleration will find that *they* had aged as if they were the ones moving the whole time! It’s almost like acceleration breaks the symmetry.”

The Alien squinted his big black eyes. “Well, I guess.”

“Don’t you think that’s weird? That’s how the *real world* works!”

The Alien shrugged, looking back at his console. “Not really.”

Alice sighed. She wasn’t particularly surprised that the Alien wasn’t playing along. He had a habit of shooting down her conversation topics.

She went on. “I guess the core of the paradox is that we both have an equal right to say that we are the more aged person while we are both moving at constant speed. Special relativity says that we are both right, but how could that be possible?”

The Alien turned. “What do you mean *when* we arrive at the station?”

“I mean the moment in time when our ship’s coordinates become coincident with the station’s coordinates.” said Alice, frustrated.

“What do you mean *moment in time*?” asked the Alien, with a grin. “I’m going to go re-hydrate some noodles.” He clapped his hands excitedly and went to go pull out some noodles from the freezer.

Now the Alien was just being obtuse. Alice was determined to get through his philosophical meanderings.

“*I mean* the slice of all points in space-time where every synced clock traveling parallel to us at the same speed would read the same number as our clock.”

“But-“, the Alien began but Alice interrupted him, anticipating his question and being a gifted experimentalist. “You could easily sync clocks using a laser. All you need to do is fire a pulse at the clock. When the reflection comes back, you know that the clock is exactly half of the elapsed lightseconds away and it has advanced by exactly half the number of seconds the total journey took. Add that number to the reflected value on the clock and then you are synced.”

Alice drew a diagram.

“Here’s a chart with time on the y-axis and distance in line with the direction to the clock on the x-axis. I’ve scaled the axes so that light moves at 45 degree angles. Here you can see how the mechanism works. The red line is the laser pulse. All the spacetime points on the horizontal dotted line make up the moment in time, the *present*, when we hit the station, because any laser pulse reflected off a clock at any of those points makes a 45-45-90 triangle, who’s altitude from the 90 bisects the base. Do you get it?”

The Alien nodded.

Alice continued. “In fact, we don’t even need a clock. We know that when we receive the reflection of the laser, whatever it bounced off of must have been simultaneous with whatever was happening on our ship exactly halfway through the round trip. We could even do this with Bob. I know his speed relative to me is 3/5 light speed. In 30 months, when we arrive at the station, he will be 18 light months away. I can send out a laser pulse in 12 months that will hit him exactly at (30 months, 18 lightmonths), which is when I arrive at the station. It will return to me. What will the reflection read out? Well, I remember a special trick for measuring straight-line clock-time from special relativity class: , where is in light-units. So his clock reads months greater than the time when we crossed paths, confirming that he has aged less than the 30 months we will have aged!”

“But how does that make any sense?” asked Alice, still stumped. “Perhaps Bob’s reference frame is in a different universe. Maybe we are now split into *multiverses*.”

“Hold on a second,” pleaded the Alien. “Bob can do the same to us. He can wait 12 months, which due to time dilation is 15 months for us, and send a laser pulse towards us. When he receives the reflection, he knows he can go back half the round-trip time to the exact moment on his timeline where the laser hit us. And it seems the laser hits us at… 24 weeks, which is perfectly symmetrical, but…” He edited the diagram.

“Bob knows that the laser pulse hit you when your clock had elapsed 24 months, and that *must* be at the same time as when his clock read 30 months. However, these two events do not happen at the same time in our reference frame. There is no paradox, but Bob’s present is *tilted* relative to ours. That’s what causes the illusion of contradiction. At the event at which you have accused Bob of being younger than us, to him you are not even close to reaching the station.” The Alien said, translating Bob’s present line down.

The Alien plopped a bowl of noodles in front of Alice. She was shocked. “So we are both right, in our own reference frames, and there isn’t any contradiction because the present is relative?” she asked. She paused for a moment, tired of thinking. “Thanks for the soup.”

“You’re welcome.” said the Alien. They sipped in silence.

All diagrams made using the diagrams library by Brent Yorgey.

*Note: This post closely follows the approach of Tim Maudlin in his book: Philosophy of Physics: Space and Time (Princeton Foundations of Contemporary Philosophy). It is an excellent book. Maudlin is concrete where many philosophers are abstract and fluffy. He also understands physics better than many physicists. The great Richard Feynman gave an incorrect, acceleration symmetry-breaking explanation to this phenomenon.*

How do we predict how RNA folds? First of all, we need to define what we mean by folding. RNA is a polymer made up of nucleic acids: adenine, guanine, cytosine, and uracil. There are 3 levels of RNA structure that are traditionally considered:

**primary structure:**the sequence of bases, ‘A’, ‘G’, ‘C’, and ‘U’ that form the strand of RNA, ‘AAAAGGGGCCCCUUUU’ for example,**secondary structure:**which bases are hydrogen bonded together, ‘A’ to ‘U’ and ‘G’ to ‘C’, can be specified by a sequence of pairs corresponding to which indices are paired together (1,24), (2,23)…, and**tertiary structure:**how the molecule bends on itself on a larger scale.

Usually, primary structure is already known or given as an input and tertiary structure is hard to specify and compute the energy for. Secondary structure is easy to specify using a sequence of pairs and the energy is roughly dominated by hydrogen bonding between bases which is discrete. Since RNA functions mostly by binding to things using exposed (non-paired) bases, the secondary structure is also the most relevant when discerning function. For these reasons, biophysicists usually focus on predicting the secondary structure of RNA.

From the laws of thermal physics, if RNA is modeled as a system in thermal equilibrium then the probability of any state with free energy is equal to

where is the full set of states. The denominator is called the *partition function*, usually denoted , it normalizes the probability distribution and with it we can compute the probability of any structure. How many terms are in the partition function? How many states are in ? Depends on what assumptions you make. In full generality, even only allowing Watson-Crick base pairs, there are terms in that sum. However, if you assume that there are no *crossing* pairs, i.e. arcs that would intersect in the above diagram, or pairs such that , , and , there are only terms in the sum. This is still an exponential quantity, but it turns out it is actually computable in sub-exponential time via dynamic programming. This assumption is called the *no-pseudoknot assumption,* and it means that if I have an pair, then for all bases in-between them, I can ignore all the bases outside of them when enumerating structures. This begs the question whether it is a physically valid assumption. The answer is NO. There are many RNAs found in nature that have “crossing” pairs such as Group I and Group II introns. However, without this assumption, computing the partition function is intractable, so most RNA folding software packages make it.

While it is tempting to do analysis on a structure with the minimum free energy and highest probability, since RNA is a thermal system with an exponential multiplicity of states even the probability of the most probable state is extremely low* *and any quantity computed on even the most probable state has negligible contribution to the physical expected value of that quantity. However, I will show here that not only can we compute the partition function, but we can also sample from the Boltzmann distribution. Once we are able to do that we can compute arbitrary expected values on the full Boltzmann distribution via Monte Carlo integration. (In fact, Monte Carlo integration is not even needed if you can express your quantity in terms of the partition function values, which can give you full expected values without statistical error! But that is outside of the scope of this post).

Putting this all together, to predict the folding and general properties of RNA we need 4 things:

- a specification of a state of RNA ,
- an energy function , which takes a state and outputs a physical

energy, - the partition function, the sum of the Boltzmann factors over every

possible state , and - an algorithm for sampling structures from the Boltzmann distribution, .

**Quick Code Note**

All code going forward will be in Haskell. You should be able to extract all the code here into a file and it should compile. To set up our environment, make sure to have this header at the top of the file:

{-# LANGUAGE NoMonomorphismRestriction #-} {-# LANGUAGE FlexibleContexts #-} {-# LANGUAGE TypeFamilies #-} import Data.Array import Data.Time.Clock.POSIX import Data.Colour import System.Random import Data.List import Diagrams.Prelude import Diagrams.Backend.SVG.CmdLine

**States of RNA**

Because of the no-pseudoknot assumption, we can actually specify the states using a recursive data structure.

data Structure = Paired Char Char (Structure, Structure) | Slip Char Structure | None deriving Show

The thought behind this definition is that we are specifying the structure base by base. If base is unpaired, then we move on to the next base. If base is paired, then we fork the structure into the bases *underneath *the pair and the bases *after *the pair. We carry along the letter of the base for convenience and to compute the energies of any pairing that might happen.

**Energy Model**

A fully realistic energy model of RNA would be very difficult to compute. There are exposed ions everywhere on the strand which interact with each other and there is tension in the backbone. Many of these interactions are non-linear which breaks the recursion and therefore computability of the problem. In practice, there are several levels of complexity that are implemented which try to account for the non-linearities in a computable fashion (this actually is the only difference between professional RNA secondary structure prediction and an Algorithms homework problem). Here I will implement a very simple energy model: adding the free energies of each hydrogen bond. A hydrogen bond contributes about . Adenine-uracil bonds have 2 hydrogen bonds, guanine-cytosine bonds have 3. Therefore in the exponent of the Boltzmann factor, I will put -4.6 for A-U bonds and -6.9 for G-C bonds. Anything else is represented by infinite energy, meaning that it will never happen.

baseEnergy :: Char -> Char -> Double baseEnergy a b = case (a, b) of ('A', 'U') -> -4.6 ('G', 'C') -> -6.9 ('U', 'A') -> -4.6 ('C', 'G') -> -6.9 otherwise -> 1.0/0.0 -- non-allowed pairs treated like an infinite hill energyModel :: Structure -> Double energyModel (Paired a b (s1,s2)) = (baseEnergy a b) + -- energy of pair (energyModel s1) + -- energy of substructure under pair (energyModel s2) -- energy of substructure after pair energyModel (Slip _ s) = energyModel s energyModel None = 0

**Partition Function**

The sum initially seems daunting. However, from the recursive definition of our energy model, we can easily see that the partition function can be defined recursively as well. To be precise, if we define as the sub-partition-function from to , it must be equal to the sum of the Boltzmann terms of all structures where is paired and all structures where is unpaired:

where . This becomes a standard dynamic programming problem. We can turn Z into a table of values with cells. If cell is on the bottom left, and on the top right, each row of the table only depends on the entries below and to the left. We can compute the full partition function by starting from the bottom left and working out to the top right in time. This is even parallelizable along the diagonals (I’m not going to parallelize it here).

type PFunArray = Array (Int, Int) Double boltz :: Double -> Double boltz x = exp (-x) partitionFunction :: String -> PFunArray partitionFunction strand = arr where len = length strand arr = array ((0,0), (len - 1, len - 1)) [((i,j), (pfCell i j strand arr)) | i <- [0..len-1], j <- [i..len-1]] pfCell :: Int -> Int -> String -> PFunArray -> Double pfCell i j strand arr = if (i==j) then 1 else (sum pairTerms) + slipTerm where pairTerm i k = boltz (baseEnergy (strand !! i) (strand !!k)) * (if k-i > 2 then arr ! (i+1, k-1) else 1) * (if j-k > 1 then arr ! (k+1, j) else 1) pairTerms = [pairTerm i k | k <- [(i+1)..j]] slipTerm = if i == j then 1 else arr ! (i+1, j)

**Sampling RNA Structures**

To actually predict the structure of RNA molecules in nature, we can work backward through the partition function table to sample structures from the Boltzmann distribution. Consider the length of the strand along base to base ,

- the probability that is unpaired is equal to the sum of the probabilities of every structure along to with unpaired, or ,
- the probability that is paired to base is equal to the sum of the probabilities of every structure along to where and are paired, or ,
- completing the rest of the structure can be achieved by calling the sampling function recursively for every length of the strand left general, to for the unpaired case, to and to for the second case.

I encode this logic in Haskell below. At each level I roll a random number from 0 to and sort through the cases. When I pass cumulative probability greater than the number I’ve rolled I choose that case and recurse.

sampleStructure :: Int -> Int -> PFunArray -> String -> StdGen -> Structure sampleStructure i j arr strand gen = if (i > j) then None else struct where -- setup pair terms pairTerm i k = boltz (baseEnergy (strand !! i) (strand !! k)) * (if k-i > 2 then arr ! (i+1, k-1) else 1) * (if j-k > 1 then arr ! (k+1, j) else 1) pairTerms = [pairTerm i k | k <- [(i+1)..j]] -- structure functions slipStructure genr = Slip (strand !! i) (sampleStructure (i+1) j arr strand genr) innerPairStructure k genr = if k - i > 1 then sampleStructure (i+1) (k-1) arr strand genr else None outerPairStructure k genr = if j - k > 0 then sampleStructure (k+1) (j) arr strand genr else None pairStructure k genr = Paired (strand !! i) (strand !! k) (innerPairStructure k genr, outerPairStructure k genr) -- (cumulative probability, structure) pairs slipCase = (0, slipStructure) -- starting at offset of 0 pairCases = [(sum (take (k-i-1) pairTerms) + arr ! (i+1, j), pairStructure k) | k <- [(i+1)..j]] allCases = slipCase:pairCases -- sample a random number according to the current sub (roll, newgen) = randomR (0, arr ! (i, j)) gen -- if roll >= cumulative probability switch canidates, else not checkCase current candidate = if roll >= (fst candidate) then (snd candidate) newgen else current struct = foldl' checkCase None allCases

**Drawing Structures**

One last thing to do is to draw these structures, which I do using Brent Yorgey‘s cool diagrams library. Every base is a circle with a letter underneath, color coded by base, with backbone and arcs denoting bonds. This depicts RNA with the bonds fanned out and visible, whereas other depictions of RNA will try to keep the length of each bond pretty constant, which is a more physical picture.

colorMap a = case a of 'G' -> sRGB24read "#107896" 'C' -> sRGB24read "#829356" 'U' -> sRGB24read "#F26D21" 'A' -> sRGB24read "#C02F1D" otherwise -> white base :: Char -> Diagram B base b = circle 1 # fc (colorMap b) <> text [b] # fontSize (local 2) # (translateY (-2.6)) backbone = hrule 1 isNone :: Structure -> Bool isNone s = case s of None -> True otherwise -> False structLength :: Structure -> Int structLength None = 0 structLength (Slip _ s) = 1 + structLength s structLength (Paired _ _ (s1, s2)) = 2 + structLength s1 + structLength s2 renderStructure :: Structure -> Diagram B renderStructure (Paired a b (s1, s2)) = base a <> translateX 1.5 backbone <> translateX 3 innerElement <> translateX (3.0*k') (base b) <> translateX (3.0*k'+3) outerElement <> arc xDir (0.5 @@ turn) # scale (1.5*k') # translateX (1.5*k') where k = structLength s1 + 1 k' = fromIntegral k innerElement = if isNone s1 then mempty else renderStructure s1 ||| backbone outerElement = if isNone s2 then mempty else translateX (-1.5) backbone <> renderStructure s2 renderStructure (Slip c s) = if isNone s then base c else base c <> translateX 1.5 backbone <> translateX 3 (renderStructure s) renderStructure None = mempty

**Example Output**

Let’s try it out:

main = do let gen = mkStdGen 42 let rna = "GAAAAAAGGGGAAACCAAAGCCCAAUUUGCUUUUAAAAGGCCAA" let pf = partitionFunction rna let struct = sampleStructure 0 (length rna - 1) pf rna gen let diagram = renderStructure struct mainWith diagram

In shell:

ghc RNAfolding.hs RNAfolding.o -o RNAt.svg -w 600 -h 300

Cool!

**Further Reading**

There is a long history of this problem being solved going back to the 1980’s even before John McCaskill first formulated a partition function approach to RNA secondary structure prediction. Below are a couple significant papers in this series, or at least those that are known to me. I’ll also put my undergraduate thesis here, where I wrote extensively on a slightly more complicated version of this problem.

Partition Functions:

McCaskill, J.S., 1990. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers, 29(6‐7), pp.1105-1119. pdf

Stochastic Sampling:

Ding, Y. and Lawrence, C.E., 2003. A statistical sampling algorithm for RNA secondary structure prediction. Nucleic acids research, 31(24), pp.7280-7301. pdf

More complicated energy models:

Mathews, D.H., Sabina, J., Zuker, M. and Turner, D.H., 1999. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. Journal of molecular biology, 288(5), pp.911-940. pdf

Mathews, D.H., Disney, M.D., Childs, J.L., Schroeder, S.J., Zuker, M. and Turner, D.H., 2004. Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proceedings of the National Academy of Sciences of the United States of America, 101(19), pp.7287-7292. pdf

Some clear formulations, useful to me:

Dirks, R.M. and Pierce, N.A., 2003. A partition function algorithm for nucleic acid secondary structure including pseudoknots. Journal of computational chemistry, 24(13), pp.1664-1677. pdf

RNAbows:

Aalberts, D.P. and Jannen, W.K., 2013. Visualizing RNA base-pairing probabilities with RNAbow diagrams. RNA, 19(4), pp.475-478. pdf

My thesis:

Flynn, M.J. and Aalberts, D.P., 2015. RNA Macrostates and Computational Tools. Williams College. github pdf

**The Algorithm**

Let’s say we have some function over some domain such that

For example could be and could be the real line , or the uniform distribution and the interval :

or some higher dimensional function with a corresponding higher dimensional domain. It turns out we can generate a list of samples from the probability distribution proportional to this function,

(for our examples and , respectively) by following a simple procedure: Start at some initial state and generate the output, a list of samples , by continuously sampling from some probability distribution function (that you choose) set by :

This list of samples will converge to provided some conditions on are satisfied, namely *ergodicity,* which means that every state can eventually reach any other state,* *and *detailed balance*. If there are samples in state at some point in the chain, the number of samples in state created by transitions from these states is expected to be . is called the *transition rate *from to because it is the expected flow of states from to (or the other way, if it is negative). Detailed balance is said to hold if the transition rates between all states are zero when the sample has reached the correct probability distribution. This way the net flow of probability is zero and the distribution is expected to stay the same. Therefore if we have sampled states and we want to reach distribution , if we have that and want , our transition probabilities should be constrained by:

From here we can break down into the product of 2 different distributions, , the probability of generating a transition from to , and , the probability of accepting that transition. We do this to let the definition of be flexible by having handle the constraint. This is done by solving for with a general :

One choice that satisfies this equation is

Provided that the function satisfies these constraints, we can construct a *mutation function *that samples from to give us a new element, create the chain by accepting or rejecting the new element, and the samples’ distribution will converge to .

**Example**

Let’s say we want to sample from a difficult distribution on the -plane, perhaps the one proportional to

Since this distribution is pretty difficult to sample directly, instead we use Metropolis-Hastings. Let’s construct a mutation function that chooses from all directions uniformly and moves in that direction a distance with probability . Our acceptance probability should therefore be:

This is all we need to know to implement a sampling algorithm in R:

## Metropolis Hastings demonstration by Mike Flynn ## Warning: This code has a runtime of around 4 hours on my machine samples = matrix(nrow = 10^6, ncol = 2) currentSample = c(0.001,0.001) set.seed(44) for(i in 1:(10^8)) { ## choose direction and distance for transition direction = runif(1, 0, 2*pi) distance = rexp(1, 2) ## current point x1 = currentSample[1] y1 = currentSample[2] ## compute next point x2 = x1 + distance*cos(direction) y2 = y1 + distance*sin(direction) ## accept? accept = min(sin(sqrt(x2^2 + y2^2))^2*(x1^2 + y1^2)^(3/2)/ (sin(sqrt(x1^2 + y1^2))^2*(x2^2 + y2^2)^(3/2)),1) if(accept > runif(1, 0, 1)) { currentSample = c(x2, y2) } else { currentSample = c(x1, y1) } ## only take 1 of every 100 to reduce autocorrelation if(i %% 100 == 0) { samples[i/100,] = currentSample } } library(ggplot2) library(ggthemes) plotdat = data.frame(x = samples[,1], y = samples[,2]) ## display p to plot p = ggplot(data = plotdat, aes(x = x, y=y)) + geom_point(alpha = .05, size =.01) + theme_bw() + xlim(-6*pi, 6*pi) + ylim(-6*pi, 6*pi)

which outputs this picture:

This plot looks a lot like interference fringes to me, which pleases me as a physicist. Of course, this example is completely contrived: interference fringes sometimes roughly follow this distribution. More importantly, the samples produced by the Metropolis-Hastings algorithm do seem to match the probability distribution proportional to .

**Monte Carlo Integration and Importance Sampling**

Monte Carlo integration is the process of evaluating an integral by sampling randomly from the domain and averaging. Often Monte Carlo integration of will make use of the expectation value identity for samples, , from a probability distribution on the same domain

For example we might evaluate the integral by randomly sampling from the uniform density on , letting , , and averaging:

The closer gets to the shape of the integrated function, the faster the convergence will be. For example, if , then the above derivation will yield

No matter what is, it will converge after the first iteration. This is cheating because it assumes we already know the answer, . In general we won’t have access to this knowledge, or else we wouldn’t be doing the integral in the first place. However, we do have the Metropolis-Hastings algorithm to let us sample* *from without *explicitly* knowing . We can use this in many creative ways, stemming from the fact that the expectation value of any statistic on these samples is .

**Integrating the Example**

Luckily enough our example function can be integrated over the entire x-y plane by hand:

(The integral can be done via some complex analysis). Does our Metropolis-Hastings result corroborate this? Consider a function and the results of the MH-algorithm on that function, samples from the distribution . We are looking for . Let’s say we come up with some function and take the following statistic:

As long as integrates to over the domain, the answer comes out to be , so we can just take the reciprocal to get our answer. On the plane, a good function that integrates to 1 and matches the shape of our function pretty well is , so our statistic becomes:

We hope the answer comes out close to . I compute the statistic with 1 line of R code:

estimate = sum(apply(samples, 1, function(row) exp(- row[1]^2 - row[2]^2) * (row[1]^2 + row[2]^2)^(3/2)/(sin(sqrt(row[1]^2 + row[2]^2))^2) ))/(nrow(samples) * pi)

When running this line on the data generated by the previous code block, I get the answer 0.1016102, less than 1% away from the true value. Taking the square root of the reciprocal I get 3.137121, not a bad estimate of for a seemingly arbitrary sum. Pretty remarkable.

**Applications**

The applications of Metropolis-Hastings is truly where the algorithm shines. Many integrals are not solvable in “closed form” via the Fundamental Theorem of Calculus because they have no antiderivative that can be expressed in terms of elementary functions (apparently this is provable) , for example the integral

For these problems, if you really need to know the solution you can use numerical methods of estimating the integral, such as the trapezoidal rule (bringing back fond memories of AP calculus). For some* *integrals we cannot even use those techniques. This could be because either there are too many dimensions we are integrating over, which forces us to use exponentially many “trapezoids”, or it could be that the domain is really large (perhaps infinitely large) and contributions from different parts are uneven. For these problems Monte Carlo integration is the best option.

There are many examples of such problems in statistical mechanics. The spread of heat through conduction, convection, and radiation is essential for the design of engines, rockets, and electric generators. The many pathways of heat transfer must be integrated, a large space over which Monte Carlo integration is the only practical option. Neutron transport for the design of nuclear weapons and reactors is another similar problem. A third example would be the transport of light or photons through a scene for the purpose of rendering a physically accurate image in Computer Graphics, for which the application of the MH-algorithm has it’s own name: Metropolis Light Transport.

To illustrate how the MH-algorithm makes *hard *integrals solvable, here is the luminosity equation from Computer Graphics:

A quick description: the is the light outgoing from a flat surface in direction , is the light emitted by that surface in that direction, is the light incoming to that surface from direction , is the chance that light incoming from will scatter in direction , and is integrated over the positive hemisphere . In English terms, what this equation is saying is that to find the amount of light that is bouncing off a surface towards you, you have to add up the amount of light being bounced towards that surface from *all *other surfaces, and find all the light that is being bounced towards *them*, and so on. This integral is defined recursively! It is actually an infinite recursion of infinite integrals. This qualifies as a very hard integral. Luckily a surface absorbs some of the light bouncing off of it so it converges and we don’t go blind, but how do we integrate this?

A slick way to do it is to expand the integral into an integral over paths through each pixel instead of a recursive integral. That way we would have that the light received by pixel , , could be expressed as:

Where is the space of all paths of light through the camera, is what proportion of light of each color that path contributes to that pixel, is the total amount of light flowing along that path, and is a measure on that path. We can then treat this as a Monte Carlo integration problem, sample paths () from the distribution using the MH-algorithm, and use the expectation value identity:

to get the value of the integral, multiplied by a normalization constant (but this can be quickly found and set to whatever looks good at the end of the process). Because Metropolis Light Transport samples the most important points first, it has the fastest general convergence time of any unbiased Monte Carlo method. Additionally, it is trivially parallelized.

The Metropolis-Hastings algorithm makes normally impossible integrals solvable with relative efficiency, and because of the wide range of applications of the algorithm to problems that are very cool I consider it my favorite.

**References**

Much of what I have written here is knowledge that I built up over summer research on sampling methods and a Computer Graphics class I took this fall, including the luminosity equation I got from The Graphics Codex by Morgan McGuire, the Metropolis Light Transport paper and Ph. D thesis of Eric Veach. In addition I learned much about Monte Carlo methods from Computer Graphics: Principles and Practice by by

Chib, Siddhartha, and Edward Greenberg. “Understanding the metropolis-hastings algorithm.” *The american statistician* 49, no. 4 (1995): 327-335.

Hastings, W.K. (1970). “Monte Carlo Sampling Methods Using Markov Chains and Their Applications”. *Biometrika* **57** (1): 97–109.

Hughes, John F.; van Dam, Andries; McGuire, Morgan; Sklar, David F.; Foley, James D.; Feiner, Steven K.; and Akeley, Kurt. *Computer graphics: principles and practice*. Pearson Education, 2013.

McGuire, Morgan. *The Graphics Codex. *Online book at www.graphicscodex.com.

Metropolis, N.; Rosenbluth, A.W.; Rosenbluth, M.N.; Teller, A.H.; Teller, E. (1953). “Equations of State Calculations by Fast Computing Machines”. *Journal of Chemical Physics* **21** (6): 1087–1092.

Veach, Eric. “Robust monte carlo methods for light transport simulation.” PhD diss., Stanford University, 1997.

Veach, Eric, and Leonidas J. Guibas. “Metropolis light transport.” In *Proceedings of the 24th annual conference on Computer graphics and interactive techniques*, pp. 65-76. ACM Press/Addison-Wesley Publishing Co., 1997.

]]>