Wavelets – part 1 June 26, 2017 If you enjoy this post, subscribe using the form to the left! I try to make new posts every weekend, though sometimes life gets in the way. This one took 3 months! Since my last post was on the fast Fourier transform I thought I would continue the theme of signal processing and write a post on wavelet analysis. However after looking over the subject for a couple weekends months I’ve decided to split this topic into several posts. This post is heavily informed by Gilbert Strang and Troung Nguyen’s “Wavelets and Filter Banks”. To be honest, the book is hard to read, so I’ll try to digest it the best I can for you here. Maybe I’m being hypocritical. I hope you can read this! Introduction Like the Fourier transform, the wavelet transform outputs a different representation of a set of data. The original data is usually represented as a series of measurements in time. For example, in an audio recording the data is a time series of voltages across the capacitor plate of a microphone. ## Time CapacitorVoltage ## 00:000000 0.1 ## 00:000001 0.13 ## 00:000002 0.15 ## ... ... The Fourier transform outputs a represention of the data as a combination of constant tones: ## Frequency Volume ## C-tone 0.1 ## D-tone 0.01 ## F-tone 0 ## ... ... Think of music. It might be more efficient to represent data as a set of tones than a series of millions of voltages. However, the Fourier Transform has a weird trait that it represents the signal as a series of constant tones that play the entire length of the recording. There is a math theorem that says any function can be fully represented in that way, but that’s not how people think of music. Instead, people think of “notes” that happen at specific times in the recording, like sheet music. That is what the wavelet transform intends to represent. It would look something like this: ## Note Time Volume ## C-note 0 1 ## D-note 0 0 ## D-note 1 1 ## ... ... ... Representing certain types of data in this way is very efficient. Think of how sheet music is much easier to store than spreadsheets of billions of capacitor voltages. This is why many compression techniques for the web, like JPEG 2000, use it to send data efficiently over the wire without losing the “gist” of the image. So how do we actually transform data into this representation? We need some sort of operation that will give us a compromise between the perfect time-resolution of original representation and the perfect frequency resolution of the Fourier transform. What if I told you there already exists an operation that has both high-resolution in frequency and high resolution in time? It’s an elementary circuit in analog electronics, called the high-pass filter. The high pass filter takes a signal and filters out lower frequency components of that signal. An example output is illustrated below. I’ve drawn the results of a high-pass filter, labeled “HPF” and a complementary low-pass filter, labeled “LPF”, with input coming from the right (same for both), and output going out to the left: HPF LPF function NormSInv(p) { var a1 = -39.6968302866538, a2 = 220.946098424521, a3 = -275.928510446969; var a4 = 138.357751867269, a5 = -30.6647980661472, a6 = 2.50662827745924; var b1 = -54.4760987982241, b2 = 161.585836858041, b3 = -155.698979859887; var b4 = 66.8013118877197, b5 = -13.2806815528857, c1 = -7.78489400243029E-03; var c2 = -0.322396458041136, c3 = -2.40075827716184, c4 = -2.54973253934373; var c5 = 4.37466414146497, c6 = 2.93816398269878, d1 = 7.78469570904146E-03; var d2 = 0.32246712907004, d3 = 2.445134137143, d4 = 3.75440866190742; var p_low = 0.02425, p_high = 1 - p_low; var q, r; var retVal; if ((p < 0) || (p > 1)) { alert("NormSInv: Argument out of range."); retVal = 0; } else if (p < p_low) { q = Math.sqrt(-2 * Math.log(p)); retVal = (((((c1 * q + c2) * q + c3) * q + c4) * q + c5) * q + c6) / ((((d1 * q + d2) * q + d3) * q + d4) * q + 1); } else if (p <= p_high) { q = p - 0.5; r = q * q; retVal = (((((a1 * r + a2) * r + a3) * r + a4) * r + a5) * r + a6) * q / (((((b1 * r + b2) * r + b3) * r + b4) * r + b5) * r + 1); } else { q = Math.sqrt(-2 * Math.log(1 - p)); retVal = -(((((c1 * q + c2) * q + c3) * q + c4) * q + c5) * q + c6) / ((((d1 * q + d2) * q + d3) * q + d4) * q + 1); } return retVal; } $(document).ready(function() { var inputArray = Array.from(Array(256), () => 0); var outputArray = Array.from(Array(256), () => 0); var lpoutputArray = Array.from(Array(256), () => 0); var inputPath =$('.inputPath'); var outputPath = $('.outputPath'); var lpinputPath =$('.lpinputPath'); var lpoutputPath = ('.lpoutputPath'); var inputY = 0; var prevInputY = 0; var cumY = 0; var update = false; setInterval(function() { var passthrough = inputArray.shift(); lpoutputArray.shift(); lpoutputArray.push((inputY + prevInputY)/2); prevInputY = inputY; inputY = inputY + passthrough; var newdiff = NormSInv(Math.random())*4 - cumY/15; cumY = cumY + newdiff; inputArray.push(newdiff); outputArray.shift(); outputArray.push(passthrough); var inputString = "M 50 0 m 0 " + inputY + " " + inputArray.map(function(e) { return "l .78125 " + e; }).join(" "); var outputString = "M -250 0 " + outputArray.map(function(e,i) { return "L " + (-250 + .78125*i) + " " + e; }).join(" "); var lpoutputString = "M -250 0 " + lpoutputArray.map(function(e,i) { return "L " + (-250 + .78125*i) + " " + e; }).join(" "); inputPath.attr("d", inputString); lpinputPath.attr("d", inputString); outputPath.attr("d", outputString); lpoutputPath.attr("d", lpoutputString); }, 16); }); Notice that the low pass output is “smoothed out” compared to the original signal. The spiky details are all in the output of the high-pass filter, i.e. those spikes are the high-frequency components. We could label the output of HPF with a label (“C”) and the output would be the volume of that note at each moment in time. If this was a real music track and we wanted to compress and denoise, we might then filter out any entry of the data that is below a certain volume, and potentially get a good compression ratio. Unfortunately, since this diagram contains a stream of random noise, nothing really stands out. What about the other notes? Notice that there are still bumps in the output of the low-pass filter. What if we applied another high pass filter on its output, with a lower frequency bound? That would give us another “band” of frequencies we could represent with a “note”. That’s what the wavelet transform is: applying successive high-pass filters until we’ve completely separated the signal into frequency bands. And that’s it. You understand wavelet analysis! (qualitatively). Quantitative description To get a quantative understanding of the wavelet transform, first let’s build a model of the input signal. Let’s say there is some signal with amplitude $$x(t)$$ that starts at $$t=0$$. In other words, $$x(t) = 0$$ or is undefined for $$t < 0$$. Let’s also assume that $$x$$ comes in equally spaced measurements, 1 unit of $$t$$ apart, so we have $$x(0), x(1), x(2),\dots$$ as a stream of data and nothing else. Our input stream $$x(t)$$ has a Fourier transform we’ll call $$X(\omega)$$. Note that since $$x$$ has a time resolution of 1 entry per second, $$X$$ is only meaningfully defined between $$-\pi$$ and $$\pi$$. This can be seen be the definition of the Fourier transform: $X(\omega) = \sum_{0}^{\infty} x(t) e^{-i \omega t}$ Extending $$\omega$$ past some range $$R$$: $X(\omega + R) = \sum_{0}^{\infty} x(t) e^{-i (\omega + R) t} = \sum_{0}^{\infty} x(t) e^{-i \omega t} e^{-i R t},$ we see that $$X(\omega) = X(\omega + R)$$ whenever $$Rt$$ is an integer multiple of $$2\pi$$. Therefore, if $$t$$ is an integer, $$X(\omega)$$ is redundant beyond $$R= 2\pi$$. In general, if $$t = Tt'$$, where $$t'$$ is a natural number and $$T$$ is the sample spacing, $$X$$ is redundant beyond $$R = 2\pi/T$$. The smaller the sample spacing in time, the higher the frequency resolution. Define a filter as an operation on $$X(\omega)$$ that produces some output signal $$Y(\omega)$$. This is usually some window function, $$H(\omega)$$, so that: $Y(\omega) = H(\omega) X(\omega).$ One example of a window function is $H(\omega) = \begin{cases} 1 \quad |\omega| > \pi/2 \\ 0 \quad \mbox{otherwise} \end{cases}.$ This window function flattens any frequencies less than $$\pi/2$$, and lets them through otherwise. This is an example of a high-pass filter. It’s not the one we will use though, because the Fourier transform of a discontinuous function has infinite components and we don’t want that when we are implementing the filters in time-space. Let’s look at the output and window functions in time-space. To do this, I’ll just use the definition of the fourier transform: \begin{align} Y(\omega) &= H(\omega) X(\omega) \\ \sum_t y(t) e^{-i \omega t} &= \left ( \sum_k h(k) e^{-i \omega k} \right ) \left ( \ \sum_{t'} x(t') e^{-i \omega t'} \right ) \\ &= \sum_{t'}\sum_{k}h(k)x(t')e^{-i \omega (k + t')}. \end{align} So now on the left we have some expression with $$y(t)$$, and on the right we have an expression of $$h(k)$$ and $$x(t)$$, with some exponential components. Notice that if we let $$t' = t - k$$, the exponential components become the same: $\sum_t y(t) e^{-i \omega t} = \sum_{t}\sum_{k}h(k)x(t-k)e^{-i \omega t}.$ Therefore we can isolate a relationship between $$y(t)$$, $$h(k)$$, and $$x(t)$$: $y(t) = \sum_{k}h(k)x(t-k).$ How can we interpret this? First, let’s try an example. Since we can choose any filter that we want, let’s try a filter where $$h(0) = 1/2$$ and $$h(1) = 1/2$$, 0 otherwise. Then $y(t) = \frac{1}{2}x(t) + \frac{1}{2} x(t-1).$ This is a 2-point moving average. Going back to the functional form of $$y(t)$$, you can see that the filter is just some rolling statistic on $$x$$. This concept struck me when I first realized it. Moving average and difference as low and high pass filters The 2-point moving average and difference, which I will call $$y_l$$ and $$y_h$$, are very simple: $y_l(t) = \frac{1}{2}x(t) + \frac{1}{2} x(t-1),\mbox{ and}$ $y_h(t) = \frac{1}{2}x(t) - \frac{1}{2} x(t-1).$ It turns out that they function as complementary filters. Moving average and difference in frequency space Let’s see how these rolling statistics act in frequency space. Remember that we just need to take the Fourier transform on $$h$$ to get to the function $$H$$ that multiplies the frequency spectrum $$X$$. Let’s do that. For $$h_l(k)$$, the Fourier transform is \begin{align} H_l(\omega)& = \sum_k h_l(k) e^{-i \omega k}\\ &= h_l(0) + h_l(1) e^{-i \omega} \\ &= \frac{1}{2} (1 + e^{-i \omega}) \\ &= \frac{1}{2}e^{-i\omega/2} (e^{i \omega/2} + e^{-i \omega 2}) \\ &= e^{-i \omega /2} cos(\omega/2) . \end{align} To see what frequencies are being selected, we plot the magnitude $$|H_l(\omega)| = cos(\omega/2)$$ from $$-\pi$$ to $$\pi$$: The magnitude is near 1 for frequencies $$\omega$$ near 0, and goes to zero near the limits, $$-\pi$$ and $$\pi$$. We can do the same for $$h_h(k)$$. The fourier transform is: \begin{align} H_h(\omega)& = \sum_k h_h(k) e^{-i \omega k}\\ &= h_h(0) + h_h(1) e^{-i \omega} \\ &= \frac{1}{2} (1 - e^{-i \omega}) \\ &= \frac{1}{2}e^{-i\omega/2} (e^{i \omega/2} - e^{-i \omega 2}) \\ &= e^{-i \omega /2} i sin(\omega/2) . \end{align} The plot of the maginitude $$H_h(\omega)$$ is seems to be the complement of $$|H_l(\omega)|$$: This function goes to 1 at the endpoints $$-\pi$$ and $$\pi$$, but goes to zero for frequencies near zero. Notice that neither of these filters are not even close to “ideal filters”, i.e. they still let through some opposing frequency. That’s okay, because this is just a simple example. It would be an interesting exercise to find different rolling statistics that act as much better filters. Probably could do so via reverse engineering a desired frequency spectrum. Complementarity These two filters complement each other. Notice that we can reconstruct the full signal by taking the sum or the difference of these two filter signals. $y_l(t) + y_h(t) = \frac{1}{2}\left ( x(t) + x(t-1) + x(t) - x(t-1)\right ) = x(t).$ $y_l(t) - y_h(t) = \frac{1}{2}\left ( x(t) + x(t-1) - x(t) + x(t-1) \right )= x(t-1).$ It is very important that the difference results in a series lagged by one time unit. This means that we can reconstruct the full signal using only the even members of both $$y_l(t)$$ and $$y_h(t)$$. This means if we have $$N$$ samples of $$x(t)$$, we only need $$N/2$$ samples of each $$y_l(t)$$ to reconstruct the signal. Remember from the introduction how we planned to apply another high pass filter to the low pass filter output? Then we’ll get two series with $$N/4$$ samples. Applying to the low-pass output again will give us two with $$N/8$$ samples, eventually getting down to 1 sample, at which point we’ll stop. The complementarity of these two filters is important because it means our algorithm will terminate. Building the Wavelet pyramid So now we have a complementary low-pass filter and a high-pass filter. We have a plan to run a recursive algorithm: split a signal (blue) into a high-pass output (yellow) and low-pass output (red): = + downsample each sample so they only contain the even points, all we need to fully reproduce the symbol: = + and run the algorithm on the low-pass output, = + + until we are left with the ultimate low pass filter, the full average. = + + + For 8 input signal points, we have output 8 “wavelet” points, each level in the y-direction indicating an individual frequency band, each point in the x-direction is a measurement of the volume of that frequency band. Notice that there are fewer measurements the lower the frequency goes, they have less “time-resolution”. Signals that vary less in time have lower time resolution. I hope these diagrams are useful for you to understand the concept, now I’m going to put them into matrix form. I’m skipping the pre-downsampling step, in the interest of space. I’m going to use the following notation: $$y_h$$ is the high-pass output after downsampling, $$y_l$$ is the low-pass output after downsampling, $$y_{lh}$$ is the output after the low-pass filter, downsampling, doing a high-pass filter on that, and downsampling, and so on for $$y_{ll}$$, $$y_{llh}$$, $$y_{lll}$$. Assuming that $$x$$ is 8 samples long, the matrix equation is going to look something like this for the first step: $\begin{bmatrix} y_h(1) \\ y_h(3) \\ y_h(5) \\ y_h(7) \\ y_l(1) \\y_l(3) \\ y_l(5) \\ y_l(7) \end{bmatrix} = \frac{1}{2} \begin{bmatrix} -1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & -1 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & -1 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & -1 & 1 \\ 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 \\ \end{bmatrix} \begin{bmatrix} x(0) \\ x(1) \\ x(2) \\ x(3) \\ x(4) \\ x(5) \\ x(6) \\ x(7) \end{bmatrix}$ Now we have 4 high-pass points and 4 low-pass points, what will this matrix look like after doing a recursive pass on the low-pass? This problem looks the same, just using $$y_l$$ as the starting point and half as big: $\begin{bmatrix} y_{lh}(3) \\ y_{lh}(7) \\y_{ll}(3) \\ y_{ll}(7) \end{bmatrix} = \frac{1}{2} \begin{bmatrix} -1 & 1 & 0 & 0 \\ 0 & 0 & -1 & 1 \\ 1 & 1 & 0 & 0 \\ 0 & 0 & 1 & 1 \end{bmatrix} \begin{bmatrix} y_l(1) \\ y_l(3) \\y_l(5) \\ y_l(7) \end{bmatrix}$ We can plug in the definition of $$y_l$$ to get $$y_{lh}$$ and $$y_{ll}$$ in terms of $$x$$. \begin{align} \begin{bmatrix} y_{lh}(3) \\ y_{lh}(7) \\y_{ll}(3) \\ y_{ll}(7) \end{bmatrix} &= \frac{1}{4} \begin{bmatrix} -1 & 1 & 0 & 0 \\ 0 & 0 & -1 & 1 \\ 1 & 1 & 0 & 0 \\ 0 & 0 & 1 & 1 \end{bmatrix} \begin{bmatrix} 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 \\ \end{bmatrix} \begin{bmatrix} x(0) \\ x(1) \\ x(2) \\ x(3) \\ x(4) \\ x(5) \\ x(6) \\ x(7) \end{bmatrix} \\ &= \frac{1}{4} \begin{bmatrix} -1 & -1 & 1 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & -1 & -1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 1 & 1 & 1 \\ \end{bmatrix} \begin{bmatrix} x(0) \\ x(1) \\ x(2) \\ x(3) \\ x(4) \\ x(5) \\ x(6) \\ x(7) \end{bmatrix} \\ \end{align} You can see that that $$y_{lh}(3)$$ is $$x(3) + x(2) - x(1) - x(0)$$, it is a 4 point moving difference. It’s like a stretched out version of the original high-pass filter. Likewise $$y_{ll}$$ is a 4-point moving difference, a stretched out version of the original low-pass filter. Let me recursively apply the algorithm one last time: \begin{align} \begin{bmatrix} y_{llh}(7) \\ y_{lll}(7) \end{bmatrix} &= \frac{1}{2} \begin{bmatrix} -1 & 1 \\ 1 & 1 \\ \end{bmatrix} \begin{bmatrix} y_{ll}(3) \\ y_{ll}(7) \end{bmatrix} \\ \begin{bmatrix} y_{llh}(7) \\ y_{lll}(7) \end{bmatrix} &= \frac{1}{8} \begin{bmatrix} -1 & 1 \\ 1 & 1 \\ \end{bmatrix} \begin{bmatrix} 1 & 1 & 1 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} x(0) \\ x(1) \\ x(2) \\ x(3) \\ x(4) \\ x(5) \\ x(6) \\ x(7) \end{bmatrix} \\ &= \frac{1}{8} \begin{bmatrix} -1 & -1 & -1 & -1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} x(0) \\ x(1) \\ x(2) \\ x(3) \\ x(4) \\ x(5) \\ x(6) \\ x(7) \end{bmatrix} \\ \end{align} The third pass of the algorithm gives us an 8-point moving difference and average. The recursion stops here because you can no longer do any filters on one point of data. If we put all the high-pass results together in one matrix, it looks something like: $\begin{bmatrix} y_h(1) \\ y_h(3) \\ y_h(5) \\ y_h(7) \\ y_{lh}(3) \\y_{lh}(7) \\ y_{llh}(7) \\ y_{lll}(7) \end{bmatrix} = \begin{bmatrix} -h & h & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & -h & h & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & -h & h & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & -h & h \\ -h^2 & -h^2 & h^2 & h^2 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & -h^2 & -h^2 & h^2 & h^2 \\ -h^3 & -h^3 & -h^3 & -h^3 & h^3 & h^3 & h^3 & h^3 \\ h^3 & h^3 & h^3 & h^3 & h^3 & h^3 & h^3 & h^3 \\ \end{bmatrix} \begin{bmatrix} x(0) \\ x(1) \\ x(2) \\ x(3) \\ x(4) \\ x(5) \\ x(6) \\ x(7) \end{bmatrix}$ where $$h = \frac{1}{2}$$. The matrix on the right is the normalized Haar matrix, corresponding to the Haar wavelet transform. The Haar wavelet transform was discovered in 1909 by Alfred Haar as just an example of an “orthogonal system” one could project a function onto. It wasn’t considered a “wavelet” until around 75 years later, but it is. At each level of filter, we have filtered into some band of frequency which you can think of as a “tone”. We could label $$y_h = C$$, $$y_{lh} = B$$, and $$y_{llh}= A$$ (and $$y_{lll}$$ would just be some baseline background voltage). We also have some notion of “when” they happen in time, more for higher-frequency notes than for the lower-frequency notes. To complete the music analogy, our transform looks like this: $\begin{bmatrix} \mbox{C-note}(t=1) \\ \mbox{C-note}(t=3) \\ \mbox{C-note}(t=5) \\ \mbox{C-note}(t=7) \\ \mbox{B-note}(t=3) \\\mbox{B-note}(t=7) \\ \mbox{A-note}(t=7) \\ \mbox{Background} \end{bmatrix} = \begin{bmatrix} -h & h & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & -h & h & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & -h & h & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & -h & h \\ -h^2 & -h^2 & h^2 & h^2 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & -h^2 & -h^2 & h^2 & h^2 \\ -h^3 & -h^3 & -h^3 & -h^3 & h^3 & h^3 & h^3 & h^3 \\ h^3 & h^3 & h^3 & h^3 & h^3 & h^3 & h^3 & h^3 \\ \end{bmatrix} \begin{bmatrix} \mbox{Voltage(0)} \\ \mbox{Voltage(1)} \\ \mbox{Voltage(2)} \\ \mbox{Voltage(3)} \\ \mbox{Voltage(4)} \\ \mbox{Voltage(5)} \\ \mbox{Voltage(6)} \\ \mbox{Voltage(7)} \end{bmatrix}$ What is a wavelet? All this talk about wavelet transforms, and I haven’t talked about wavelets very much. A wavelet is a class of functions that become the new basis of a coordinate transform, like the complex exponential in Fourier series. Each of the rows of the Haar matrix is a basis vector. So, in our case, a wavelet looks like the function: $f(x) = \begin{cases} 1 & 0<=x<1/2 \\ -1 & 1/2 <= x < 1 \\ 0 & \mbox{otherwise} \end{cases},$ at different levels of scales and shift. When you have an incoming signal that is equal to a wavelet function, you should see the wavelet transform place the wavelet into an appropriate frequency bucket. The following is an animation of the wavelet transform in action, sorting output into different wavelet frequencies. The input is 256 measurements rolling in from the left into a buffer that goes out 256 entries to the left. This output is split into $$\log(N) - 1 = 7$$ wavelet scales using the filters we described above. You might notice that some of the lower-frequency filters update less often. That’s because the wavelet transform is not “shift-invariant”: if you shift the input signal forward by one unit of time, you don’t get the output signal shifted forward by one unit of time, you actually get a different answer. Think of the signal where $$x(t=1) = 1$$, but 0 everywhere else. When I operate the moving difference algorithm on this, $$y_h(1) = x(1) - x(0) = 1 - 0 = 1$$. But let me shift the signal forward one, so that $$x(t=0) = 1$$ now. Now $$y_h(1) = x(1) - x(0) = 0 - 1 = -1$$. $$y_h(1)$$ has flipped over the y-axis. You might say “hey, but now $$y_h(0) = x(0) - x(-1) = 1 - 0 = 1$$ now”, but we don’t have $$y_h(0)$$ because we downsampled. The only way to have shift invariance is to shift the signal one window length of the filter, in this case 2. This is actually a huge problem and I really don’t like it. In any practical application, you aren’t going to be able to align your filter windows perfectly with the incoming signal and so you are going to pick up wavelet frequencies that are misleading. You can see examples of this phenomena by unchecking the box “Align Sent Wavelet Window”, wavelets get aliased at other frequencies. If anyone knows how to make a wavelet transform time-invariant or shift-invariant, I am super interested in that! Align Sent Wavlet Window Input Scale 0 Scale 1 Scale 2 Scale 3 Scale 4 Scale 5 Scale 6 Scale 7(document).ready(function() { var inputArray = Array.from(Array(256), () => 0); var outputArray = Array.from(Array(256), () => 0); var inputPath = $('.fbinputPath'); var outputPath0 =$('.fboutputPath0'); var outputPath1 = $('.fboutputPath1'); var outputPath2 =$('.fboutputPath2'); var outputPath3 = $('.fboutputPath3'); var outputPath4 =$('.fboutputPath4'); var outputPath5 = $('.fboutputPath5'); var outputPath6 =$('.fboutputPath6'); var backgroundPath = $('.backgroundPath'); var invsqrt2 = 1/2; function highfilterdown(arr) { var output = Array(arr.length/2); for(i = 0; i < arr.length/2; i++) { output[i] = invsqrt2 * (arr[2*i] - arr[2*i+1]); } return output; } function lowfilterdown(arr) { var output = Array(arr.length/2); for(i = 0; i < arr.length/2; i++) { output[i] = invsqrt2 * (arr[2*i+1] + arr[2*i]); } return output; } function renderOutputPath(arr, startx, basey) { var step = 200/arr.length; var path = "M " + startx + " " + basey + " " + arr.map(function(e, i) { return "V " + (basey + e) + " h " + step; }).join(" ") return path; } var wavelet = { 'started' : false, 'scale' : 0, 'sending' : false }; function sendWavelet(scale) { return function() { if(!wavelet['sending ']) { wavelet['sending'] = true; wavelet['scale'] = scale; } }; }$('#send0').on('click', sendWavelet(2)); $('#send1').on('click', sendWavelet(4));$('#send2').on('click', sendWavelet(8)); $('#send3').on('click', sendWavelet(16));$('#send4').on('click', sendWavelet(32)); $('#send5').on('click', sendWavelet(64));$('#send6').on('click', sendWavelet(128)); $('#send7').on('click', sendWavelet(256)); var cumY = 0; var updateCounter = 0; function actionWavelet() { var scale = wavelet['scale']; if(!wavelet['sending']) return 0; if(!wavelet['started']) { if(updateCounter % scale == 0 || !$("#alignWindow").is(':checked')) { wavelet['started'] = updateCounter; return 0; } } else { if(updateCounter - wavelet['started'] == 1) { return -35; } if(updateCounter - wavelet['started'] == scale/2 + 1) { return 70; } if(updateCounter - wavelet['started'] == scale + 1) { wavelet['sending'] = false; wavelet['started'] = false; return -35; } } return 0; }; setInterval(function() { outputArray.shift(); outputArray.push(inputArray.shift()); var random = Math.random(); var newdiff = NormSInv(random)*2 - cumY/100; newdiff += actionWavelet(); cumY = cumY + newdiff; inputArray.push(inputArray[inputArray.length-1] + newdiff); if(updateCounter % 2 == 0) { var outputArray0 = highfilterdown(outputArray); outputPath0.attr("d", renderOutputPath(outputArray0, -350, -280)); if(updateCounter % 4 == 0) { var lp = lowfilterdown(outputArray); var outputArray1 = highfilterdown(lp); outputPath1.attr("d", renderOutputPath(outputArray1, -350, -200)); if(updateCounter % 8 == 0) { lp = lowfilterdown(lp); var outputArray2 = highfilterdown(lp); outputPath2.attr("d", renderOutputPath(outputArray2, -350, -120)); if(updateCounter % 16 == 0) { lp = lowfilterdown(lp); var outputArray3 = highfilterdown(lp); outputPath3.attr("d", renderOutputPath(outputArray3, -350, -40)); if(updateCounter % 32 == 0) { lp = lowfilterdown(lp); var outputArray4 = highfilterdown(lp); outputPath4.attr("d", renderOutputPath(outputArray4, -350, 40)); if(updateCounter % 64 == 0) { lp = lowfilterdown(lp); var outputArray5 = highfilterdown(lp); outputPath5.attr("d", renderOutputPath(outputArray5, -350, 120)); if(updateCounter % 128 == 0) { lp = lowfilterdown(lp); var outputArray6 = highfilterdown(lp); outputPath6.attr("d", renderOutputPath(outputArray6, -350, 200)); if(updateCounter % 256 == 0) { lp = lowfilterdown(lp); var background = highfilterdown(lp); console.log(background); backgroundPath.attr("d", renderOutputPath(background, -350, 280)); } } } } } } } } inputPath.attr("d", renderOutputPath(inputArray, 150, 0)); updateCounter += 1; }, 16); }); (function () { var script = document.createElement("script"); script.type = "text/javascript"; script.src = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"; document.getElementsByTagName("head")[0].appendChild(script); })(); Posted in: Signal Processing Tagged: Fourier, Wavelets History and Derivation of the Fast Fourier Transform March 20, 2017 / 3 Comments If you enjoy this post, subscribe using the form to the left! I try to make new posts every weekend, though sometimes life gets in the way (this one took 6 weekends). This is the second post in my series on great algorithms. My last post was on the Metropolis-Hastings algorithm. History and Motivation The Fourier transform is a fundamental operation in applied math. The original Fourier transform is attributed to Joseph Fourier, a French mathematician and physicist, for solving partial differential equations involved in the transfer of heat [1]. Since then, the revelation that any function can be approximated by a series of sines and cosines has exploded far beyond the original application. The technique is used in applications from digital signal processing to medical imaging to detecting nuclear tests. Until the FFT, all applications used an algorithm that summed $$N$$ terms for each of $$N$$ output terms, and therefore had an asymptotic runtime of $$O(N^2)$$. The $$O(N\log N)$$ algorithm was first presented in a complete, packaged form in the 1965 paper by John Tukey and James Cooley: “An algorithm for the machine calculation of complex fourier series” [2]. There is evidence that the FFT had been discovered before. Gauss used the algorithm in the early 1800s to interpolate the trajectory of the asteroid Pallas but never published his results (aymptotic runtime matters when you’re doing all the computation by hand). The Ex Libris blog gives an interesting analysis of Gauss’s approach here. Others claim to have discovered the fundamental principle behind the FFT first, but no one got it in a form that made users realize they could perform their $$O(N^2)$$ computations in $$O(N\log N)$$ until Tukey and Cooley and for that they deserve credit. The motivation behind the development of the FFT was not academic. This was the case for many algorithmic improvements of the 50’s and 60’s. In a time where business-scale mainframe computers like the IBM System/360 only had processing speeds of 35 kilohertz and memory sizes of 512 kilobytes, optimizing algorithms for memory and time performance had material economic benefits for companies. For the FFT there were national security reasons as well. James Cooley gives an interesting account of his development of the algorithm in a 1988 issue of Mikrochmica Acta: “The Re-Discovery of the Fast Fourier Transform Algorithm” [3]. The way he tells it, the (hydrogen-bomb) physicist Richard Garwin had a huge part in putting the effort together, for Cold-War purposes: I was working on a research project of my own when Richard Garwin came to the computing center of the laboratory with some notes he made while with John Tukey at a meeting of President Kennedy’s Scientific Advisory Committee, of which they were both members. John Tukey showed that if $$N$$, is a composite, $$N = ab$$, then the Fourier series can be expressed as an a-term series of subseries of b terms each. If one were computing all values of the series, this would reduce the number of operations from $$N^2$$ to $$N(a+ b)$$. Tukey also said that if this were iterated, the number of operations would be proportional to $$N\log (N)$$ instead of $$N^2$$. Garwin knew that this was a very important calculation and he wanted to have this idea developed and applied. Garwin described his problem of determining the periodicities of the spin orientations in a 3-D crystal of He$$^3$$. Later, I found out that he was far more interested in improving the ability to do remote seismic monitoring of nuclear explosions since the Russians would not agree to inspections within their borders thereby hindering efforts at obtaining a nuclear test ban treaty. He also saw a need for the capability of long range acoustical detection of submarines. Like many others, I did not see the significance in this improvement and gave the job a little less priority than my own research. However, I was told of Garwin’s reputation and, prodded by his occasional telephone calls (some of them to my manager), I produced a 3-dimensional FFT program. I put some effort into designing the algorithm so as to save storage and addressing by over-writing data and I spent some time working out a 3-dimensional indexing scheme that was combined with the indexing within the algorithm. Garwin publicized the program at first through his many personal contacts, producing a small but increasing stream of requests for copies of it. This is not to say that Garwin came up with the idea or wrote the code or paper, but his presence here is indisputable. I am impressed that Garwin was able to pose a dummy physics problem for Cooley to solve that he could then use to detect nuclear tests and track nuclear submarines – but that is the immense power of the FFT. Cooley’s article has lots of historical tidbits in it, including notes about the computational limits of the day. One detail is the “record” Fourier transform on a dataset with 2048 samples: Another member of our department, Lee Alsop, who was a geophysicist and adjunct professor at the Lamont Geophysical Laboratory of Columbia University decided to try the new algorithm on a record of 2048 samples of a strain seismograph of the Rat Island earthquake. Another is an example of computation that was still infeasible, even with the algorithmic speedups. When approached by a colleague with spectrum data to analyze he recounts: One extraordinary thing about this was that a single record of data was about 512 000 points and all values of the spectrum were needed. This was beyond the capacity of the high speed memory of existing machines. He also mentions that the FFT does give a theoretical speedup by a factor of 12,800 in this case. Finally the collaboration with Tukey was very limited (as well as the editing): Thus, the paper made only one round trip between me and Tukey and our only collaboration was in a few telephone conversations. Perhaps that explains why it is not very easy to read… Since then, there have been too many applications of the FFT to count. The algorithm goes so far as to be one of the fundamental operations in gate-based quantum computing. Therefore, it’s probably worth a look. Introduction The idea is to take advantage of symmetries in the complex exponential to “factor” the problem into several smaller problems recursively, yeilding standard “divide and conquer” speedups. While the algorithm has been generalized to work for any composite number of datapoints, the presentation is easiest when the number of points is a power of 2, so that is a fundamental assumption I will make throughout this post. I’m going to assume that readers are familiar with the Fourier series and it’s generalization, the Fourier transform at a superficial level. To review, you can approximate any function over an interval of length $$L$$ with sines and cosines with period equal to $$L$$: $f(x) = \frac{1}{2} a_0 + \sum_{n = 1}^{\infty} a_n \sin \left (\frac{2\pi}{L}nx \right ) + \sum_{n=1}^{\infty} b_n \cos \left (\frac{2\pi}{L}nx \right ).$ Using the identities $$e^{i\theta} = \cos\theta + i\sin\theta$$, $$\sin\theta = \frac{i}{2}(e^{i\theta} + e^{-i\theta})$$, and $$\cos\theta = \frac{1}{2}(e^{i\theta} – e^{-i\theta})$$, these terms can be represented as complex exponentials: \begin{align} f(x) =& \frac{1}{2} a_0 + \sum_{n = 1}^{\infty} a_n \sin \left (\frac{2\pi}{L}nx \right ) + \sum_{n=1}^{\infty} b_n \cos \left (\frac{2\pi}{L}nx \right ) \\ =& \frac{1}{2} a_0 + \frac{1}{2}\sum_{n = 1}^{\infty} ia_n (e^{2\pi i n x / L} – e^{-2\pi i n x / L}) + \frac{1}{2}\sum_{n=1}^{\infty} b_n (e^{2\pi i n x / L} + e^{-2\pi i n x / L}) \\ =& \frac{1}{2} a_0 e^{2\pi i 0 x / L} + \frac{1}{2}\sum_{n = 1}^{\infty} (b_n + ia_n) e^{2\pi i n x / L} + \frac{1}{2}\sum_{n=1}^{\infty} (b_n – i a_n) e^{-2\pi i n x / L} \end{align} Notice that that if we define $$c_0 = \frac{1}{2} a_0$$, $$c_n = \frac{1}{2} (b_n + \mbox{sign}(n) i a_n)$$, we can simplify the above to the final form: $f(x) = \sum_{n = -\infty}^{\infty} c_n e^{2\pi i n x / L}.$ The coefficients can be found by taking advantage of an indentity of the complex exponential: it integrates to zero over one full period, i.e. $$\int_0^L e^{2\pi i (m-n) x / L} = L$$ if $$m=n$$ and $$0$$ of $$m \neq n$$. This can be used to isolate $$c_m$$: $\int_0^L f(x) e^{-2 \pi i m x} dx = \sum_{n = -\infty}^{\infty} c_n \int_0^L e^{2 \pi i n x / L} e^{2 \pi i m x / L}dx = \sum_{n = -\infty}^{\infty} c_n \int_0^L e^{2 \pi i (n – m) x / L} = Lc_m$ Therefore $c_m = \frac{1}{L} \int_0^L f(x) e^{-2\pi imx/L} dx.$ This is the basis of the complex Fourier series. The output of the Fourier transform is the set of $$c_m$$‘s. The Discrete Fourier Series In practice, we cannot compute each of the infinite $$c_m$$, and it turns out we do not have enough information to do so. In almost all applications, we do not have $$f(x)$$, rather we have $$N$$ evenly-spaced samples from $$f(x)$$ stored in a computer. Unfortunately, this limits how far out in frequency space we can get. For example, consider what happens when analyzing the coefficient of the frequency $$N$$. The function goes through one full period before it reaches the next point, so you are effectively applying a constant displacement to each point, the same effect as the frequency $$n=0$$. In general, each of the points will be effected in the same way for frequency $$n$$ as the frequency $$n +N$$, and we get no new information from analyzing beyond $$n = N-1$$. Therefore, in our output we are only looking for frequencies $$0$$ through $$N-1$$. If we have $$N$$ points, we are looking for $$N$$ frequencies, each computed via the discrete integral: $c(n) = \frac{1}{N} \sum_{x = 0}^{N-1} f(x) e^{-2\pi i n x /N}$ This is effectively computing the integral with a rectangular approximation $$\frac{1}{L} \int_0^L f(x) e^{-2\pi i nx/L} dx \approx \frac{1}{L} \frac{L}{N} \sum_{x’=0}^{N-1} f(x’) e^{-2\pi i n x’ / N}.$$ where $$x = L/N x’$$. From here, I’m going to drop the normalizing term $$\frac{1}{N}$$ and assume it will be applied at the end. Also, it’s much easier from this point to look at the Fourier transform using a picture. We can imagine the discrete Fourier transform as a circuit taking in the $$N$$ samples as input and giving the $$N$$ Fourier coefficients as output. This is depicted below, with the inputs $$f(x)$$ coming in from the left, and the outputs $$c(n)$$ coming out to the right. Notice that each of the $$N$$ terms of the output needs to compute a sum of $$N$$ terms, and therefore the runtime of this circuit is $$O(N^2)$$. $\begin{array}{ll|ccc|llll} \hline f(0) & \rightarrow & & & & \rightarrow & \frac{1}{8} \sum\limits_{x = 0}^{7} f(x) & = & c(0)\\ f(1) & \rightarrow & & & & \rightarrow & \frac{1}{8} \sum\limits_{x = 0}^{7} f(x) e^{-2\pi i x /8} & =& c(1) \\ f(2) & \rightarrow & & & & \rightarrow & \frac{1}{8} \sum\limits_{x = 0}^{7} f(x) e^{-2\pi i 2 x /8} & =& c(2)\\ f(3) & \rightarrow & & \mbox{Discrete Fouirer Transform} & & \rightarrow & \frac{1}{8} \sum\limits_{x = 0}^{7} f(x) e^{-2\pi i 3 x /8} & =& c(3) \\ f(4) & \rightarrow & & & & \rightarrow & \frac{1}{8} \sum\limits_{x = 0}^{7} f(x) e^{-2\pi i 4 x /8} & =& c(4)\\ f(5) & \rightarrow & & & & \rightarrow & \frac{1}{8} \sum\limits_{x = 0}^{7} f(x) e^{-2\pi i 5 x /8} & =& c(5)\\ f(6) & \rightarrow & & & & \rightarrow & \frac{1}{8} \sum\limits_{x = 0}^{7} f(x) e^{-2\pi i 6 x /8} & =& c(6)\\ f(7) & \rightarrow & & & & \rightarrow & \frac{1}{8} \sum\limits_{x = 0}^{7} f(x) e^{-2\pi i 7 x /8} & =& c(7)\\ \hline \end{array}$ However, there seems to be redundancy here. Each of these terms is a sum of the $$f(x)$$ terms, each with a complex exponential slapped on, but the complex exponential is periodic. Take $$c(0)$$ and $$c(4)$$ for example. $c(0) = \sum_{x = 0}^7 f(x)$. $c(n) = \sum_{x = 0}^7 f(x) e^{-2\pi i 4 x /8} = \sum_{x = 0}^7 f(x) e^{-\pi i x}$ They are the same sum, except each term of $$c(4)$$ has an extra term of $$e^{-\pi i x}$$ slapped on. This term is going to be equal to 1 for even numbers of $$x$$ and equal to $$-1$$ for odd number of $$x$$. This behavior is independant of $$n$$: if we pair up any Fourier coefficient with the coefficient that is $$N/2$$ indices up from it (for example 1 and 5, 2 and 6, and 3 and 7), we will find the same relationship since $e^{-2\pi i (n + N/2) x / N} = e^{-2\pi i n x / N} e^{\pi i x} .$ This suggests a strategy: there are really only $$N$$ terms to work out between $$c(n)$$ and $$c(n + N/2)$$ (instead of $$2N$$), we just need to remember to slap a -1 onto the odd terms for $$c(n+N/2)$$. This leads us to the next symmetry we can exploit. We can divide these two sums into 2 sums of the form “full sum” = “even terms” + “odd terms”. I will do this out for the $$N=8$$ case but you will see how it can easily generalize. For $$c(n)$$ this looks like: \begin{align} c(n) = \sum\limits_{x = 0}^{7} f(x) e^{-2\pi i n x /8} &= \overbrace{\sum\limits_{x = 0}^{4} f(2x) e^{-2\pi i n (2x) /8}}^{\mbox{even terms}} + \overbrace{\sum\limits_{x = 0}^{4} f(2x + 1) e^{-2\pi i n (2x+1) /8}}^{\mbox{odd terms}} \\ &= \sum\limits_{x = 0}^{4} f(2x) e^{-2\pi i n x /4} + \sum\limits_{x = 0}^{4} f(2x + 1) e^{-2\pi i n x /4} e^{-\pi i n /4}. \end{align} For $$c(n + N/2)$$ this looks like: \begin{align} c(n + N/2) = \sum\limits_{x = 0}^{7} f(x) e^{-2\pi i n x /8}e^{-\pi i x} &= \overbrace{\sum\limits_{x = 0}^{4} f(2x) e^{-2\pi i n (2x) /8}}^{\mbox{even terms}} – \overbrace{\sum\limits_{x = 0}^{4} f(2x + 1) e^{-2\pi i n (2x+1) /8}}^{\mbox{odd terms}} \\ &= \sum\limits_{x = 0}^{4} f(2x) e^{-2\pi i n x /4} – \sum\limits_{x = 0}^{4} f(2x + 1) e^{-2\pi i n x /4} e^{-\pi i n /4}. \end{align} The two terms in the resulting equation have remarkable symmetries to the original problem: they are the Fourier transform of the function using only $$N/2$$ samples, one using the even indexed samples, the other using the odd indexed samples. Define two new variables for each $$n$$ accordingly: \begin{align} c_e(n) &= \sum\limits_{x = 0}^{4} f(2x) e^{-2\pi i n x /4} \\ c_o(n) &= \sum\limits_{x = 0}^{4} f(2x + 1) e^{-2\pi i n x /4}, \end{align} we immediately see that \begin{align} c(n)& = c_e(n) + c_o(n) e^{-\pi i n/4} \quad \mbox{ and} \\ c(n + N/2)& = c_e(n) – c_o(n) e^{-\pi i n/4}. \end{align} This relationship underlies the “butterfly diagram” that the FFT algorithm is known for. Updating our circuit with $$f_e(x) = f(2x)$$ and $$f_o(e) = f(2x+1)$$: $\begin{array}{llll|ccc|llll||l} \hline f(0) & = & f_e(0) & \rightarrow & & & & \rightarrow & \sum\limits_{x = 0}^{3} f_e(x) & = & c_e(0) & c(0) = c_e(0) + c_o(0)\\ f(2) & = & f_e(1) & \rightarrow & & \mbox{Discrete Fouirer Transform} & & \rightarrow & \sum\limits_{x = 0}^{3} f_e(x) e^{-2\pi i x /4} & =& c_e(1) & c(1) = c_e(1) + c_o(1) e^{-\pi i / 4}\\ f(4) & = & f_e(2) & \rightarrow & & & & \rightarrow & \sum\limits_{x = 0}^{3} f_e(x) e^{-2\pi i 2 x /4} & =& c_e(2) & c(2) = c_e(2) + c_o(2) e^{-2 \pi i / 4}\\ f(6) & = & f_e(3) & \rightarrow & & & & \rightarrow & \sum\limits_{x = 0}^{3} f_e(x) e^{-2\pi i 3 x /4} & =& c_e(3) & c(3) = c_e(3) + c_o(3) e^{-3 \pi i /4} \\ \hline f(1) & = & f_o(0) & \rightarrow & & & & \rightarrow & \sum\limits_{x = 0}^{3} f_o(x) & =& c_o(0) & c(4) = c_e(0) – c_o(0)\\ f(3) & = & f_o(1)& \rightarrow & & \mbox{Discrete Fouirer Transform} & & \rightarrow & \sum\limits_{x = 0}^{3} f_o(x) e^{-2\pi i x /4} & =& c_o(1) & c(5) = c_e(1) – c_o(1)e^{-\pi i / 4} \\ f(5) & = & f_o(2) & \rightarrow & & & & \rightarrow & \sum\limits_{x = 0}^{3} f_o(x) e^{-2\pi i 2 x /4} & =& c_o(2) & c(6) = c_e(2) – c_o(2) e^{-2\pi i /4} \\ f(7) & = & f_o(3) & \rightarrow & & & & \rightarrow & \sum\limits_{x = 0}^{3} f_o(x) e^{-2\pi i 3 x /4} & =& c_o(3) & c(7) = c_e(3) – c_o(3) e^{-3\pi i /4} \\ \hline \end{array}$ We can do this process recursively until we hit the trivial case of a Fourier transfom with one sample. The number of times we will recursively call the algorithm is equal to $$\log N$$, because each time the number of terms left in each gets divided by two. At each level of the division, we do $$N$$ constant-time recombinations of the terms in the previous level. Therefore the asymptotic performance of this algorithm should be $$O(N \log N)$$, much faster than $$O(N^2)$$. Let’s code this up and compare runtimes. Again we are going to assume that our input is always going to be a power of two in length. At level $$l$$ back, there are going to be $$2^l$$ Fourier transforms to compute, each with $$N/2^l$$ samples. Indexing the output is a lot of fun, in the dwarf fortress sense, as you can see by the following table: Level x Indices 0 0:7 $$\{x\}$$ 1 0:3 $$\{2x\}$$, $$\{2x+1\}$$ 2 0:1 $$\{2(2x)\}$$, $$\{2(2x) + 1\}$$, $$\{2(2x + 1)\}$$, $$\{2(2x+1) + 1\}$$ In general the set of indices at any level of the recursion are going to be $$\{2^lx +n\}$$. Each seperate Fourier transform has a different value of $$n$$, and for it $$x$$ ranges between $$0$$ and $$N/2^l$$. I suggest working this out yourself to truly grok it. Implementation fast.fourier.transform <- function(f) { N <- length(f) levels <- log(N, 2) prev.level <- f level <- complex(real = numeric(length(f))) for(l in (levels-1):0) { for(n in 0:(2^l - 1)) { half.num.indices <- 2^(levels - l - 1) for(x in 0:(half.num.indices - 1)) { ## One catch here to be aware of: in the circuit diagram above, the two ## seperate Fourier transforms have been placed as if their indices are ## together, for ease of reading. In the implementation, these indices ## are going to be interleaved, i.e. $c_e(0)$ at $i=0$, $c_o(0)$ at $i=1$ ## and so on. Therefore we vary x up to half of the indices, with evens=2x ## and odds=2x+1. level[2^l * x + n + 1] = prev.level[2^(l) * (2*x) + n + 1] + prev.level[2^l * (2*x+1) + n + 1] * exp(complex(imaginary = -pi*x/half.num.indices)) level[2^l * (x + half.num.indices) + n + 1] = prev.level[2^l * (2*x) + n + 1] - prev.level[2^l * (2*x + 1) + n + 1] * exp(complex(imaginary = -pi*x/half.num.indices)) } } prev.level <- level } level } slow.fourier.transform <- function(f){ N <- length(f) result <- numeric(N) for(n in 0:(N-1)) { for(x in 0:(N-1)) { result[n+1] = result[n+1] + f[x+1] * exp(complex(imaginary = -2*pi * x * n / N)) } } result } Testing Let’s test these two functions. Do they give the correct result? Is the fast Fourier transform really faster? For the first question, I’m going to test out the two new series against very simple Fourier series compositions, and also against R’s fft function. For the second question I will test the runtimes with several different length transforms and then compute the relationship between log time and log length. The slow Fourier transform should have a slope of 2, the fast Fourier transform should have a slope that is only slightly higher than 1. What are 4 simple series to test. How about, for $$x=0..15$$: Function 1: $$f(x) = \sin(\frac{2\pi}{16} x)$$ Function 2: $$f(x) = \sin(\frac{2\pi}{16} x) - \cos(\frac{2\pi}{16}3 x) + \sin(\frac{2\pi}{16} 6 x)$$ Function 3: $$f(x) = \sum_{n = 0}^{15} n \sin(\frac{2 \pi }{16}n x)$$ Function 4: $$f(x) = \sum_{n = 0}^{15} \sin(\frac{2\pi}{16} n )\sin(\frac{2 \pi}{16}n x)$$ library(ggplot2) library(dplyr) library(tidyr) plot_theme <- theme(panel.background = element_blank(), panel.grid.major = element_line(color = "grey90", size = 0.2), panel.grid.minor = element_line(color = "grey95", size = 0.5), strip.text.x = element_text(color = "white"), strip.background = element_rect(fill = "black"), axis.text = element_text(color = "black"), axis.title = element_text(face = "bold"), title = element_text(face = "bold")) ## Here are the functions we want to test, they should have very clear Fourier transforms function.1 <- function(x) sin(2*pi*x/16) function.2 <- function(x) sin(2*pi*x/16) - cos(2*pi*3*x/16) + sin(2*pi*6 * x/16) function.3 <- function(x) sapply(x, function(s) sum(sapply(0:15, function(n) n* sin(2*pi*n*s/16)))) function.4 <- function(x) sapply(x, function(s) sum(sapply(0:15, function(n) sin(2*pi*n/16)* sin(2*pi*n*s/16)))) tests <- list("Function 1" = function.1, "Function 2" = function.2, "Function 3" = function.3, "Function 4" = function.4) ## for each of these function, we want to apply the function test.data <- bind_rows(lapply(1:length(tests), function(i) { fun <- tests[[i]] name <- names(tests)[[i]] x <- 0:15 bind_rows( data.frame(x = x, y = fun(x), func = name, transform = "None"), data.frame(x = x, y = abs(fft(fun(x))), func = name, transform = "fft"), data.frame(x = x, y = abs(slow.fourier.transform(fun(x))), func = name, transform = "slow.fourier.transform"), data.frame(x = x, y = abs(fast.fourier.transform(fun(x))), func = name, transform = "fast.fourier.transform")) })) test.data$transform <- factor(test.data$transform, levels = c("None", "slow.fourier.transform", "fast.fourier.transform", "fft")) ggplot(data = test.data, aes(x=x, y = y)) + geom_point() + facet_grid(func~transform, scale = "free") + theme_bw() As you can see via the above plot, the code that we’ve written agrees with R’s built in fft function, which is good. Also good is that it gives the output that we expect. You might be wondering why the first row of plots contain two non-zero dots each - wasn’t there just one frequency included in function 1? Remember that to put in this function, we used $$\sin(x) = (e^{ix} - e^{-ix})/2i$$, the other point that shows up is the “negative” frequency which has show up on the other side due to the periodic symmetry of the complex exponential. Timing Now let’s time the functions against several different length inputs and see how they behave as the length of the sequences get large. sequence.lengths <- sapply(0:13, function(x) 2^x) timing.test <- data.frame( length = sequence.lengths, slow = sapply(sequence.lengths, function(l) { dat <- 0:l; system.time({slow.fourier.transform(dat)})[["elapsed"]]}), fast = sapply(sequence.lengths, function(l) { dat <- 0:l; system.time({fast.fourier.transform(dat)})[["elapsed"]]}), fft = sapply(sequence.lengths, function(l) { dat <- 0:l; system.time({fast.fourier.transform(dat)})[["elapsed"]]})) timing.test <- gather(timing.test, func, time, slow, fast, fft) Normal plot: ggplot(data = timing.test, aes(x = length, y = time, color = func)) + geom_point() + theme_bw() Log-Log-plot: ggplot(data = timing.test, aes(x = log(length), y= log(time), color = func)) + geom_point() + theme_bw() + geom_smooth(method = "lm") + coord_fixed(ratio = 1) ## Warning: Removed 22 rows containing non-finite values (stat_smooth). Linear models on the log-log data slow <- subset(timing.test, func %in% "slow" & time != 0) fast <- subset(timing.test, func %in% "fast" & time != 0) fft <- subset(timing.test, func %in% "fft" & time != 0) lm.slow <- lm(log(time) ~ log(length), data = slow) lm.slow ## ## Call: ## lm(formula = log(time) ~ log(length), data = slow) ## ## Coefficients: ## (Intercept) log(length) ## -11.889 1.974 lm.fast <- lm(log(time) ~ log(length), data = fast) lm.fast ## ## Call: ## lm(formula = log(time) ~ log(length), data = fast) ## ## Coefficients: ## (Intercept) log(length) ## -10.334 1.132 lm.fft <- lm(log(time) ~ log(length), data = fft) lm.fft ## ## Call: ## lm(formula = log(time) ~ log(length), data = fft) ## ## Coefficients: ## (Intercept) log(length) ## -9.855 1.078 Via the coefficient on log(length) we see that the slow version has slope approximately equal to 2, and the fast version has slope slightly greater than one. I’m especially suprised that my version in plain R seems to perform nearly as fast as R’s native fft function, which I had assumed was programmed in optimized C. Thus we have derived the a way in which the discrete Fourier transform can be computed $$O(N\log N)$$. Further Reading Gilbert Strang gives a really good lecture on FFT, using matrix factorization which I didn’t cover here. (Unfortunately lost me around 37:25). link Citations [1] Fourier, Joseph (1822). ThÃ©orie analytique de la chaleur (in French). Paris: Firmin Didot PÃ¨re et Fils. OCLC 2688081. link [2] Cooley, James W.; Tukey, John W. (1965). “An algorithm for the machine calculation of complex Fourier series”. Math. Comput. 19: 297â€“301. doi:10.2307/2003354. link [3] Cooley, J. W. (1987). The re-discovery of the fast Fourier transform algorithm. Microchimica Acta, 93(1-6), 33-45. link (function () { var script = document.createElement("script"); script.type = "text/javascript"; script.src = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"; document.getElementsByTagName("head")[0].appendChild(script); })(); Posted in: Uncategorized A Tutorial on Principal Component Analysis February 6, 2017 / Leave a Comment Introduction Machine learning is a buzzword these days. Just dropping it can impress people without requiring that you elaborate, perhaps because it feeds their fantasies of a robot takeover. Investors are particularly vulnerable because they get to act on these fantasies. The following exchange approximates one that actually happened in real life: “We’ve got a couple cutting edge projects coming down the pipeline: cloud computing and machine learning!” “Wow! I’m terribly excited about all this! Hail our new overlords! Take my money!” Notice that money changed hands without inquiry into what these two concepts were being used for. All this means is that, whoever you are, it is especially important to become familiar with the real tools of machine learning and data science, what problems they are meant to solve, and how they solve them. This is a tutorial on “principal component analysis”, or PCA, a technique used to reduce redundancy in your data. In any given dataset, you might have two or more measurements that actually measure the same thing. One example would be a dataset measuring students’ hours spent doing schoolwork and their GPA. These 2 variables could be interpreted as measuring 1 underlying quality of a student: how much of a “good student” they are. We might want to reduce these 2 factors into this “good student” measure because this greatly simplifies statistical analysis. That’s where PCA comes in. PCA rotates your data’s coordinate system to one that aligns with the important questions i.e. whether a student is a good student or not (GPA + hours worked), or how naturally smart or efficient they are (GPA – hours worked). It also gives a score on how important these factor are. In this tutorial I am going to set up a situation where there is one underlying phenomenon: the motion of a spring in one direction. I’m going to duplicate and obfuscate this phenomenon by having a fictional group of scientist measure this it with several different cameras. Then I’m going to use PCA to isolate the signal. Setting up redundant data Let’s say we have a ball attached to a spring. The balls motion might be described as $x(t) = cos(t)$ around the equilibrium point. I can model this in R. library(ggplot2) library(tidyr) library(dplyr) theme_set(theme_bw()) # set default theme to black-white t <- seq(0, 6*pi, by = .1) x <- cos(t) qplot(x = t, y = x) The above plot is how the position $$x$$ varies with time. Of course, in the real world the ball has 3 spatial coordinates and every measurement has noise. To account for these I’m going to add $$y$$ and $$z$$ coordinates, which are both 0 and I’m also going to add noise in every direction. positions <- data.frame(t = seq(0, 6*pi, by = .1), x = x + rnorm(length(x), sd = .1), y = rnorm(length(x), sd = .1), z = rnorm(length(x), sd = .1)) ggplot(data = gather(positions, variable, value, x:z), aes(x = t, y = value)) + geom_point() + facet_wrap(~ variable, nrow = 1) Now you can see 3 relationships, one of mostly signal, 2 of noise. Now let’s say a team of scientists have set up 3 cameras to observe this process. They do not have priviledged access to the coordinate system that we know of, where the $$x$$ coordinate contains all the signal. Rather, for each camera they get two coordinates: the $$x$$ and $$y$$ of the ball’s location in the camera’s image plane. For simplicity’s sake, I’m going to assume each camera is pointed directly at the midpoint of the ball’s motion. I’m going to assert, and you can prove to yourself, that a camera’s image plane coordinates can be expressed by rotating and scaling the original coordinate system, and then projecting the data onto two of the resulting dimensions. One side note: In the process of writing this post, I discovered that generating a random 3D-rotation is actually super annoying. I spent a little bit of time trying to tie the concept down and this is what I’ve come up with: a rotation can be uniquely specified by an axis and an angle to rotate around that axis. You can read about all that here and here, or you can just steal my code. ## random 3d rotation matrix random.rotation.matrix <- function(seed) { set.seed(seed) ## angle around vector to rotate psi <- runif(1, max = 2*pi) ## select a random point on S^2 phi <- runif(1, max = 2*pi) theta <- acos(2 * runif(1) - 1) ## construct axis from random S^2 point axis <- c(cos(theta)*cos(phi), cos(theta)*sin(phi), sin(theta)) ## cross product matrix for formula axis.cp <- matrix(c(0, -axis[3], axis[2], axis[3], 0, -axis[1], -axis[2], axis[1], 0), nrow = 3, byrow = TRUE) ## create rotation matrix using wikipedia formula R <- cos(psi) * diag(c(1,1,1)) + sin(psi) * axis.cp + (1-cos(psi)) * outer(axis, axis) R } I’ve tested the above function and it does seem to create random rotations. So let’s go ahead an find the new coordinates! new.camera.data <- function(positions, seed) { set.seed(seed) ## original data original.coordinates <- t(as.matrix(select(positions, x,y,z))) ## get scale scale <- 1/rexp(1, 1/3) ## get rotation rotation <- random.rotation.matrix(seed) ## new points new.points <- t(scale * rotation %*% original.coordinates) ## project image.plane.projection <- new.points[,1:2] list(scale = scale, rotation = rotation, data = image.plane.projection) } camera.1 <- new.camera.data(positions, 1) camera.2 <- new.camera.data(positions, 2) camera.3 <- new.camera.data(positions, 3) camera.data = data.frame(t = seq(0, 6*pi, by = .1), x1 = camera.1$data[,1], y1 = camera.1$data[,2], x2 = camera.2$data[,1], y2 = camera.2$data[,2], x3 = camera.3$data[,1], y3 = camera.3$data[,2]) ggplot(data = gather(camera.data, variable, value, x1:y3), aes(x = t, y = value)) + geom_point() + facet_wrap(~ variable, nrow = 2) It now looks that there are several signals here, but the scientists are suspicious because they seem very correlated. How will they isolate the signal? Finding the Principal Components PCA is based on the hypothesis that signals are things that cause your data to vary. Therefore, directions that possess large signals will also have data that varies in that direction. This leads us to the fundamental assumption of PCA: Fundamental Assumption of PCA: The most important directions are the directions in which the data varies most. Note that this implicitly assumes that the signal to noise ratio, $\frac{\sigma_{signal}^2}{\sigma_{noise}^2} > 1.$ Only then will the signal directions be expected to be the directions of greatest variation. It many real-world applications, for example stock price forecasting, the noise overwhelms the signal. It would be an interesting exercise to see how this simple example breaks down as you add noise. Thus, PCA becomes an optimization problem. Let the matrix of data with each column translated so that it is mean 0 ($$a \to a - mean(a)$$) be denoted $$\mathbf X$$. You will soon see why this translation is important. Since translation does not effect variance, we are trying to find some unit vector $$\mathbf u$$ such that the variance of $$\mathbf X \mathbf u$$ is maximized. $\mbox{Maximize Var(}{\mathbf X \mathbf u} \mbox{)} \quad \mbox{ subject to } \quad ||\mathbf u|| = 1,$ repeatedly such that each $$\mathbf u_i \cdot \mathbf u_j = 0$$ for $$i \neq j$$ until our directions span the space of the data. To solve this, let’s talk about variance for a bit. Consider two sets of measurements $\mathbf a = \{a_1, a_2,..., a_n\} \quad \mbox{ and } \quad \mathbf b = \{b_1, b_2,...,b_n\}$ with means of $$\mu_a$$ and $$\mu_b$$, respectively. The variance and covariance are defined as average squared differences from the mean: $\mbox{Var}(\mathbf a) = \sigma_{\mathbf a}^2 = \frac{1}{n} \sum_i(a_i - \mu_a)^2 = \frac{1}{n} \sum_i a_i^2 = \frac{1}{n} \mathbf a \cdot \mathbf a \quad \mbox{if } \mu_a = 0,$ $\mbox{Cov}(\mathbf a, \mathbf b) = \sigma_{\mathbf{ab}}^2 = \frac{1}{n} \sum_i(a_i - \mu_a)(b_i - \mu_b) = \frac{1}{n} \sum_i a_ib_i = \frac{1}{n}\mathbf a \cdot \mathbf b \quad \mbox{if } \mu_a =0 \mbox{ and } \mu_b = 0.$ So the variance is equal to the covariance of a variable with itself, $$\mbox{Var}(\mathbf{a}) = \mbox{Cov}(\mathbf a, \mathbf a)$$. Since all of $$\mathbf X$$’s columns $$\mathbf x_i$$, are mean zero, we can define a matrix $$\mathbf C$$ where the $$[i,j]$$th member is the covariance of $$\mathbf x_i$$ and $$\mathbf x_j$$. $\frac{1}{n} \mathbf X^T \mathbf X = \begin{bmatrix} \frac{1}{n} \mathbf x_1 \cdot \mathbf x_1 & \cdots & \frac{1}{n} \mathbf x_1 \cdot \mathbf x_r \\ \vdots & \ddots & \vdots \\ \frac{1}{n} \mathbf x_r \cdot \mathbf x_1 & \cdots & \frac{1}{n} \mathbf x_r \cdot \mathbf x_r \end{bmatrix} = \mathbf C.$ Let’s do this now with our camera measurments in R. options(width = 120) measurements <- select(camera.data, -t) measurements <- mutate_each(measurements, funs(. - mean(.))) measurements <- as.matrix(measurements) covariance.matrix <- t(measurements) %*% measurements / nrow(measurements) covariance.matrix ## x1 y1 x2 y2 x3 y3 ## x1 0.0021234862 -0.005043854 -0.0002620748 -0.002255248 -0.0005278387 -0.002325504 ## y1 -0.0050438535 0.092671234 0.0166668085 0.034679534 0.0217014469 0.034022588 ## x2 -0.0002620748 0.016666808 0.0032414307 0.006126479 0.0041287510 0.005953620 ## y2 -0.0022552479 0.034679534 0.0061264792 0.013067542 0.0079891875 0.012826169 ## x3 -0.0005278387 0.021701447 0.0041287510 0.007989187 0.0053130622 0.007799908 ## y3 -0.0023255041 0.034022588 0.0059536201 0.012826169 0.0077999083 0.012612169 Let’s also reframe our optimization problem in terms of this nice, square, symmetric matrix: $\mbox{Var}(\mathbf{X} \mathbf u) = \frac{1}{n}(\mathbf{X} \mathbf u)^T (\mathbf{X} \mathbf u) = \mathbf u^T (\frac{1}{n} \mathbf{X}^T \mathbf{X}) \mathbf u = \mathbf u^T \mathbf{C} \mathbf u.$ Therefore we are trying find a $$\mathbf u$$ that maximizes the matrix product with the covariance matrix, $$\mathbf u^T \mathbf C \mathbf u$$ such that $$\mathbf u$$ is a unit vector and perpendicular to the other solutions we have found. Remember that $$\mathbf C$$ is a square, symmetric matrix, and its set of unit eigenvectors $$\{\mathbf v_1, \mathbf v_2,\dots, \mathbf v_r\}$$ are orthogonal, i.e. $$\mathbf v_i \cdot \mathbf v_j = 0$$ for $$i \neq j$$, and span it’s column and row spaces. Therefore we can express $$\mathbf u$$ as some combination of the orthogonal unit eigenvectors of $$\mathbf C$$: $\mathbf u = a_1 \mathbf v_1 + a_2 \mathbf v_2 + \dots + a_r \mathbf v_r,$ where $$\sum a_i^2 = 1$$ to make the result length 1. What happens when we make this substitution into the quantity we are trying to optimize? \begin{align} \mathbf u^T \mathbf C \mathbf u &= (a_1 \mathbf v_1^T + a_2 \mathbf v_2^T + \dots + a_r \mathbf v_r^T) \mathbf C (a_1 \mathbf v_1 + a_2 \mathbf v_2 + \dots + a_r \mathbf v_r) \\ &= (a_1 \mathbf v_1^T + a_2 \mathbf v_2^T + \dots + a_r \mathbf v_r^T) (a_1 \mathbf C \mathbf v_1 + a_2 \mathbf C \mathbf v_2 + \dots + a_r \mathbf C \mathbf v_r) \\ &= (a_1 \mathbf v_1^T + a_2 \mathbf v_2^T + \dots + a_r \mathbf v_r^T) (a_1 \lambda_1 \mathbf v_1 + a_2 \lambda_2 \mathbf v_2 + \dots + a_r \lambda_r \mathbf v_r) \\ &= a_1^2 \lambda_1 + a_2^2 \lambda_2 + \dots + a_r^2 \lambda_r \quad \mbox{because the } \mathbf v \mbox{'s are orthogonal.}\end{align} This is very interesting. Let $\{\alpha_1, \alpha_2, ..., \alpha_r \} = \{a_1^2, a_2^2, ..., a_r^2\},$ we are trying to solve an optimization of the form \begin{align} \mbox{Maximize} \quad \alpha_1 \lambda_1 + &\alpha_2 \lambda_2 + \dots + \alpha_r \lambda_r \\ \mbox{Subject to} \quad \sum_i \alpha_i &= 1 \\ \mbox{All } \alpha_i & \geq 0 \end{align} This is a linear programming problem, where all solutions will be on the corners of the space. On the corners, the inequalities are tight, so the solutions are where $$r-1$$ $$\alpha$$’s are 0, and exactly one $$\alpha_i = 1$$, so $$a_i = \sqrt{\alpha_i} = 1$$. So $$\mathbf u_i = \mathbf v_i$$, where the $$i$$th most important direction has the $$i$$th largest eigenvalue. The solutions are the eigenvectors of the covariance matrix. Since the eigenvalues are orthogonal, our coordinate system is orthogonal. Pretty cool result, if you ask me. Isolating the signal Some people use the singular SVD to get the eigenvectors and eigenvalues. If there is a deeper reason to use SVD, feel free to email me! I am curious. Here, I’m just going to use the R function eigen to grab the eigenvectors. First I’m going to look at the eigenvalues. I expect to see 1 large eigenvalue, and 5 much smaller values. What I’m going to do then is get the eigenvectors, project the data into that basis, and plot the results. important.directions <- eigen(covariance.matrix) important.directions$value ## [1] 1.265056e-01 2.491410e-03 3.188713e-05 1.083847e-17 3.857996e-18 -2.696033e-18 The first eigenvalue is 100x bigger than the second, which is 100x bigger than the next, and so on. This is exactly what I hoped to see. projected <- as.data.frame(measurements %*% important.directions$vector) colnames(projected) <- paste0("component", 1:6) projected$t <- camera.data$t plot.data <- gather(projected, component, value, component1:component6) ggplot(data = plot.data, aes(x = t, y = value)) + geom_point() + facet_wrap(~component) We’ve not only isolated the signal, we’ve also isolated the noise as well! PCA has saved the scientists from the complicated analysis of 6 signals, and given them only one to work with. Problem solved. Further reading Jonathan Schlen’s Tutorial on PCA Stack exchange post on PCA (function () { var script = document.createElement("script"); script.type = "text/javascript"; script.src = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"; document.getElementsByTagName("head")[0].appendChild(script); })(); Posted in: Machine Learning Tagged: PCA, R Alice, The Alien, and the Illusion of the Twin Paradox January 17, 2017 / Leave a Comment Alice is traveling through deep space with her alien navigator. They are delivering a shipment of water to Proxima Centauri at a solid fraction of light speed. They had been traveling in deep space at constant speed for so long that it was as if they were standing still. Suddenly, and with a flash of light, they pass another interstellar traveler, one Robert, heading back to Earth. Sensor readings on Alice’s ship have him traveling at 3/5 light speed relative to them. “Bob’s foot is made of solid lead,” the Alien commented. “It better be when he has to slow down.” Alice took a moment to consider the significance of this event. “Reality is so strange. Bob is speeding past us right now. According to special relativity, time should be passing slower for him. However, special relativity also says that he has every right to say that he is stationary and we are moving, and therefore time is passing slower for us. Who is right?” “Both.” The Alien blinked. Alice continued. “One of us could start accelerating towards the other to meet up and find out which one was aging faster, but whoever does the acceleration will find that they had aged as if they were the ones moving the whole time! It’s almost like acceleration breaks the symmetry.” The Alien squinted his big black eyes. “Well, I guess.” “Don’t you think that’s weird? That’s how the real world works!” The Alien shrugged, looking back at his console. “Not really.” Alice sighed. She wasn’t particularly surprised that the Alien wasn’t playing along. He had a habit of shooting down her conversation topics. She went on. “I guess the core of the paradox is that we both have an equal right to say that we are the more aged person while we are both moving at constant speed. Special relativity says that we are both right, but how could that be possible?” The Alien turned. “What do you mean when we arrive at the station?” “I mean the moment in time when our ship’s coordinates become coincident with the station’s coordinates.” said Alice, frustrated. “What do you mean moment in time?” asked the Alien, with a grin. “I’m going to go re-hydrate some noodles.” He clapped his hands excitedly and went to go pull out some noodles from the freezer. Now the Alien was just being obtuse. Alice was determined to get through his philosophical meanderings. “I mean the slice of all points in space-time where every synced clock traveling parallel to us at the same speed would read the same number as our clock.” “But-“, the Alien began but Alice interrupted him, anticipating his question and being a gifted experimentalist. “You could easily sync clocks using a laser. All you need to do is fire a pulse at the clock. When the reflection comes back, you know that the clock is exactly half of the elapsed lightseconds away and it has advanced by exactly half the number of seconds the total journey took. Add that number to the reflected value on the clock and then you are synced.” Alice drew a diagram. “Here’s a chart with time on the y-axis and distance in line with the direction to the clock on the x-axis. I’ve scaled the axes so that light moves at 45 degree angles. Here you can see how the mechanism works. The red line is the laser pulse. All the spacetime points on the horizontal dotted line make up the moment in time, the present, when we hit the station, because any laser pulse reflected off a clock at any of those points makes a 45-45-90 triangle, who’s altitude from the 90 bisects the base. Do you get it?” The Alien nodded. Alice continued. “In fact, we don’t even need a clock. We know that when we receive the reflection of the laser, whatever it bounced off of must have been simultaneous with whatever was happening on our ship exactly halfway through the round trip. We could even do this with Bob. I know his speed relative to me is 3/5 light speed. In 30 months, when we arrive at the station, he will be 18 light months away. I can send out a laser pulse in 12 months that will hit him exactly at (30 months, 18 lightmonths), which is when I arrive at the station. It will return to me. What will the reflection read out? Well, I remember a special trick for measuring straight-line clock-time from special relativity class: , where is in light-units. So his clock reads months greater than the time when we crossed paths, confirming that he has aged less than the 30 months we will have aged!” “But how does that make any sense?” asked Alice, still stumped. “Perhaps Bob’s reference frame is in a different universe. Maybe we are now split into multiverses.” “Hold on a second,” pleaded the Alien. “Bob can do the same to us. He can wait 12 months, which due to time dilation is 15 months for us, and send a laser pulse towards us. When he receives the reflection, he knows he can go back half the round-trip time to the exact moment on his timeline where the laser hit us. And it seems the laser hits us at… 24 weeks, which is perfectly symmetrical, but…” He edited the diagram. “Bob knows that the laser pulse hit you when your clock had elapsed 24 months, and that must be at the same time as when his clock read 30 months. However, these two events do not happen at the same time in our reference frame. There is no paradox, but Bob’s present is tilted relative to ours. That’s what causes the illusion of contradiction. At the event at which you have accused Bob of being younger than us, to him you are not even close to reaching the station.” The Alien said, translating Bob’s present line down. The Alien plopped a bowl of noodles in front of Alice. She was shocked. “So we are both right, in our own reference frames, and there isn’t any contradiction because the present is relative?” she asked. She paused for a moment, tired of thinking. “Thanks for the soup.” “You’re welcome.” said the Alien. They sipped in silence. All diagrams made using the diagrams library by Brent Yorgey. Note: This post closely follows the approach of Tim Maudlin in his book: Philosophy of Physics: Space and Time (Princeton Foundations of Contemporary Philosophy). It is an excellent book. Maudlin is concrete where many philosophers are abstract and fluffy. He also understands physics better than many physicists. The great Richard Feynman gave an incorrect, acceleration symmetry-breaking explanation to this phenomenon. Posted in: Philosophy Tagged: Philosophy, Physics, Relativity Simple RNA folding in 130 lines of Haskell January 9, 2017 / 4 Comments How do we predict how RNA folds? First of all, we need to define what we mean by folding. RNA is a polymer made up of nucleic acids: adenine, guanine, cytosine, and uracil. There are 3 levels of RNA structure that are traditionally considered: primary structure: the sequence of bases, ‘A’, ‘G’, ‘C’, and ‘U’ that form the strand of RNA, ‘AAAAGGGGCCCCUUUU’ for example, secondary structure: which bases are hydrogen bonded together, ‘A’ to ‘U’ and ‘G’ to ‘C’, can be specified by a sequence of pairs corresponding to which indices are paired together (1,24), (2,23)…, and tertiary structure: how the molecule bends on itself on a larger scale. Usually, primary structure is already known or given as an input and tertiary structure is hard to specify and compute the energy for. Secondary structure is easy to specify using a sequence of pairs and the energy is roughly dominated by hydrogen bonding between bases which is discrete.  Since RNA functions mostly by binding to things using exposed (non-paired) bases, the secondary structure is also the most relevant when discerning function. For these reasons, biophysicists usually focus on predicting the secondary structure of RNA. From the laws of thermal physics, if RNA is modeled as a system in thermal equilibrium then the probability of any state with free energy is equal to     where is the full set of states. The denominator is called the partition function, usually denoted , it normalizes the probability distribution and with it we can compute the probability of any structure. How many terms are in the partition function? How many states are in ? Depends on what assumptions you make. In full generality, even only allowing Watson-Crick base pairs, there are terms in that sum. However, if you assume that there are no crossing pairs, i.e. arcs that would intersect in the above diagram, or pairs such that , , and , there are only terms in the sum. This is still an exponential quantity, but it turns out it is actually computable in sub-exponential time via dynamic programming. This assumption is called the no-pseudoknot assumption, and it means that if I have an pair, then for all bases in-between them, I can ignore all the bases outside of them when enumerating structures. This begs the question whether it is a physically valid assumption. The answer is NO. There are many RNAs found in nature that have “crossing” pairs such as Group I and Group II introns. However, without this assumption, computing the partition function is intractable, so most RNA folding software packages make it. While it is tempting to do analysis on a structure with the minimum free energy and highest probability, since RNA is a thermal system with an exponential multiplicity of states even the probability of the most probable state is extremely low and any quantity computed on even the most probable state has negligible contribution to the physical expected value of that quantity. However, I will show here that not only can we compute the partition function, but we can also sample from the Boltzmann distribution. Once we are able to do that we can compute arbitrary expected values on the full Boltzmann distribution via Monte Carlo integration. (In fact, Monte Carlo integration is not even needed if you can express your quantity in terms of the partition function values, which can give you full expected values without statistical error! But that is outside of the scope of this post). Putting this all together, to predict the folding and general properties of RNA we need 4 things: a specification of a state of RNA , an energy function , which takes a state and outputs a physical energy, the partition function, the sum of the Boltzmann factors over every possible state , and an algorithm for sampling structures from the Boltzmann distribution, . Quick Code Note All code going forward will be in Haskell. You should be able to extract all the code here into a file and it should compile. To set up our environment, make sure to have this header at the top of the file: {-# LANGUAGE NoMonomorphismRestriction #-} {-# LANGUAGE FlexibleContexts #-} {-# LANGUAGE TypeFamilies #-} import Data.Array import Data.Time.Clock.POSIX import Data.Colour import System.Random import Data.List import Diagrams.Prelude import Diagrams.Backend.SVG.CmdLine States of RNA Because of the no-pseudoknot assumption, we can actually specify the states using a recursive data structure. data Structure = Paired Char Char (Structure, Structure) | Slip Char Structure | None deriving Show The thought behind this definition is that we are specifying the structure base by base. If base is unpaired, then we move on to the next base. If base is paired, then we fork the structure into the bases underneath the pair and the bases after the pair.  We carry along the letter of the base for convenience and to compute the energies of any pairing that might happen. Energy Model A fully realistic energy model of RNA would be very difficult to compute. There are exposed ions everywhere on the strand which interact with each other and there is tension in the backbone. Many of these interactions are non-linear which breaks the recursion and therefore computability of the problem. In practice, there are several levels of complexity that are implemented which try to account for the non-linearities in a computable fashion (this actually is the only difference between professional RNA secondary structure prediction and an Algorithms homework problem). Here I will implement a very simple energy model: adding the free energies of each hydrogen bond. A hydrogen bond contributes about . Adenine-uracil bonds have 2 hydrogen bonds, guanine-cytosine bonds have 3. Therefore in the exponent of the Boltzmann factor, I will put -4.6 for A-U bonds and -6.9 for G-C bonds. Anything else is represented by infinite energy, meaning that it will never happen. baseEnergy :: Char -> Char -> Double baseEnergy a b = case (a, b) of ('A', 'U') -> -4.6 ('G', 'C') -> -6.9 ('U', 'A') -> -4.6 ('C', 'G') -> -6.9 otherwise -> 1.0/0.0 -- non-allowed pairs treated like an infinite hill energyModel :: Structure -> Double energyModel (Paired a b (s1,s2)) = (baseEnergy a b) + -- energy of pair (energyModel s1) + -- energy of substructure under pair (energyModel s2) -- energy of substructure after pair energyModel (Slip _ s) = energyModel s energyModel None = 0 Partition Function The sum initially seems daunting. However, from the recursive definition of our energy model, we can easily see that the partition function can be defined recursively as well. To be precise, if we define as the sub-partition-function from to , it must be equal to the sum of the Boltzmann terms of all structures where is paired and all structures where is unpaired:     where . This becomes a standard dynamic programming problem. We can turn Z into a table of values with cells. If cell is on the bottom left, and on the top right, each row of the table only depends on the entries below and to the left. We can compute the full partition function by starting from the bottom left and working out to the top right in time. This is even parallelizable along the diagonals (I’m not going to parallelize it here). type PFunArray = Array (Int, Int) Double boltz :: Double -> Double boltz x = exp (-x) partitionFunction :: String -> PFunArray partitionFunction strand = arr where len = length strand arr = array ((0,0), (len - 1, len - 1)) [((i,j), (pfCell i j strand arr)) | i <- [0..len-1], j <- [i..len-1]] pfCell :: Int -> Int -> String -> PFunArray -> Double pfCell i j strand arr = if (i==j) then 1 else (sum pairTerms) + slipTerm where pairTerm i k = boltz (baseEnergy (strand !! i) (strand !!k)) * (if k-i > 2 then arr ! (i+1, k-1) else 1) * (if j-k > 1 then arr ! (k+1, j) else 1) pairTerms = [pairTerm i k | k <- [(i+1)..j]] slipTerm = if i == j then 1 else arr ! (i+1, j) Sampling RNA Structures To actually predict the structure of RNA molecules in nature, we can work backward through the partition function table to sample structures from the Boltzmann distribution. Consider the length of the strand along base to base , the probability that is unpaired is equal to the sum of the probabilities of every structure along to with unpaired, or , the probability that is paired to base is equal to the sum of the probabilities of every structure along to where and are paired, or , completing the rest of the structure can be achieved by calling the sampling function recursively for every length of the strand left general, to for the unpaired case, to and to for the second case. I encode this logic in Haskell below. At each level I roll a random number from 0 to and sort through the cases. When I pass cumulative probability greater than the number I've rolled I choose that case and recurse. sampleStructure :: Int -> Int -> PFunArray -> String -> StdGen -> Structure sampleStructure i j arr strand gen = if (i > j) then None else struct where -- setup pair terms pairTerm i k = boltz (baseEnergy (strand !! i) (strand !! k)) * (if k-i > 2 then arr ! (i+1, k-1) else 1) * (if j-k > 1 then arr ! (k+1, j) else 1) pairTerms = [pairTerm i k | k <- [(i+1)..j]] -- structure functions slipStructure genr = Slip (strand !! i) (sampleStructure (i+1) j arr strand genr) innerPairStructure k genr = if k - i > 1 then sampleStructure (i+1) (k-1) arr strand genr else None outerPairStructure k genr = if j - k > 0 then sampleStructure (k+1) (j) arr strand genr else None pairStructure k genr = Paired (strand !! i) (strand !! k) (innerPairStructure k genr, outerPairStructure k genr) -- (cumulative probability, structure) pairs slipCase = (0, slipStructure) -- starting at offset of 0 pairCases = [(sum (take (k-i-1) pairTerms) + arr ! (i+1, j), pairStructure k) | k <- [(i+1)..j]] allCases = slipCase:pairCases -- sample a random number according to the current sub (roll, newgen) = randomR (0, arr ! (i, j)) gen -- if roll >= cumulative probability switch canidates, else not checkCase current candidate = if roll >= (fst candidate) then (snd candidate) newgen else current struct = foldl' checkCase None allCases Drawing Structures One last thing to do is to draw these structures, which I do using Brent Yorgey's cool diagrams library. Every base is a circle with a letter underneath, color coded by base, with backbone and arcs denoting bonds. This depicts RNA with the bonds fanned out and visible, whereas other depictions of RNA will try to keep the length of each bond pretty constant, which is a more physical picture. colorMap a = case a of 'G' -> sRGB24read "#107896" 'C' -> sRGB24read "#829356" 'U' -> sRGB24read "#F26D21" 'A' -> sRGB24read "#C02F1D" otherwise -> white base :: Char -> Diagram B base b = circle 1 # fc (colorMap b) <> text [b] # fontSize (local 2) # (translateY (-2.6)) backbone = hrule 1 isNone :: Structure -> Bool isNone s = case s of None -> True otherwise -> False structLength :: Structure -> Int structLength None = 0 structLength (Slip _ s) = 1 + structLength s structLength (Paired _ _ (s1, s2)) = 2 + structLength s1 + structLength s2 renderStructure :: Structure -> Diagram B renderStructure (Paired a b (s1, s2)) = base a <> translateX 1.5 backbone <> translateX 3 innerElement <> translateX (3.0*k') (base b) <> translateX (3.0*k'+3) outerElement <> arc xDir (0.5 @@ turn) # scale (1.5*k') # translateX (1.5*k') where k = structLength s1 + 1 k' = fromIntegral k innerElement = if isNone s1 then mempty else renderStructure s1 ||| backbone outerElement = if isNone s2 then mempty else translateX (-1.5) backbone <> renderStructure s2 renderStructure (Slip c s) = if isNone s then base c else base c <> translateX 1.5 backbone <> translateX 3 (renderStructure s) renderStructure None = mempty Example Output Let's try it out: main = do let gen = mkStdGen 42 let rna = "GAAAAAAGGGGAAACCAAAGCCCAAUUUGCUUUUAAAAGGCCAA" let pf = partitionFunction rna let struct = sampleStructure 0 (length rna - 1) pf rna gen let diagram = renderStructure struct mainWith diagram In shell: ghc RNAfolding.hs RNAfolding.o -o RNAt.svg -w 600 -h 300 Cool! Further Reading There is a long history of this problem being solved going back to the 1980's even before John McCaskill first formulated a partition function approach to RNA secondary structure prediction. Below are a couple significant papers in this series, or at least those that are known to me. I'll also put my undergraduate thesis here, where I wrote extensively on a slightly more complicated version of this problem. Partition Functions: McCaskill, J.S., 1990. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers, 29(6‐7), pp.1105-1119. pdf Stochastic Sampling: Ding, Y. and Lawrence, C.E., 2003. A statistical sampling algorithm for RNA secondary structure prediction. Nucleic acids research, 31(24), pp.7280-7301. pdf More complicated energy models: Mathews, D.H., Sabina, J., Zuker, M. and Turner, D.H., 1999. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. Journal of molecular biology, 288(5), pp.911-940. pdf Mathews, D.H., Disney, M.D., Childs, J.L., Schroeder, S.J., Zuker, M. and Turner, D.H., 2004. Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proceedings of the National Academy of Sciences of the United States of America, 101(19), pp.7287-7292. pdf Some clear formulations, useful to me: Dirks, R.M. and Pierce, N.A., 2003. A partition function algorithm for nucleic acid secondary structure including pseudoknots. Journal of computational chemistry, 24(13), pp.1664-1677. pdf RNAbows: Aalberts, D.P. and Jannen, W.K., 2013. Visualizing RNA base-pairing probabilities with RNAbow diagrams. RNA, 19(4), pp.475-478. pdf My thesis: Flynn, M.J. and Aalberts, D.P., 2015. RNA Macrostates and Computational Tools. Williams College. github pdf Posted in: Computational Physics Tagged: Dynamic Programming, Haskell, Partition Function, Recursion, RNA My Favorite Algorithm: Metropolis-Hastings June 1, 2015 / 6 Comments I’m writing this post because I feel like any computational scientist should have a favorite algorithm, and in my research I have run into this particular algorithm again and again and every time it impresses me. I am also writing this because I googled “What is your favorite algorithm” and was surprised to find that no one mentioned it in the top results. Few algorithms in computational physics have the scope and power of the Metropolis-Hastings algorithm. Briefly, given a domain and a function with bounded integral over that domain , the MH-algorithm samples points from a probability distribution proportional to . This subtle technique can solve very difficult problems in the design of engines, rockets, generators, nuclear weapons and reactors, plasma physics, the simulation of thermodynamic systems  and transport of light in Computer Graphics, the computation of partition functions of RNA macrostates and their transition states,  and the computation of posterior distributions in Bayesian statistics (as well as a family of statistical tools that I know relatively less about). In this post I plan to describe the algorithm, Monte Carlo integration, and some examples of applications. The Algorithm Let’s say we have some function over some domain such that     For example could be and could be the real line , or the uniform distribution and the interval :         or some higher dimensional function with a corresponding higher dimensional domain. It turns out we can generate a list of samples from the probability distribution proportional to this function,     (for our examples and , respectively) by following a simple procedure: Start at some initial state and generate the output, a list of samples , by continuously sampling from some probability distribution function (that you choose) set by :  This list of samples will converge to provided some conditions on are satisfied, namely ergodicity, which means that every state can eventually reach any other state, and detailed balance. If there are samples in state at some point in the chain, the number of samples in state created by transitions from these states is expected to be .  is called the transition rate from to because it is the expected flow of states from to (or the other way, if it is negative). Detailed balance is said to hold if the transition rates between all states are zero when the sample has reached the correct probability distribution. This way the net flow of probability is zero and the distribution is expected to stay the same. Therefore if we have sampled states and we want to reach distribution , if we have that and want , our transition probabilities should be constrained by:     From here we can break down into the product of 2 different distributions, , the probability of generating a transition from to , and , the probability of accepting that transition. We do this to let the definition of be flexible by having handle the constraint. This is done by solving for with a general :             One choice that satisfies this equation is     Provided that the function satisfies these constraints, we can construct a mutation function that samples from  to give us a new element, create the chain by accepting or rejecting the new element, and the samples’ distribution will converge to . Example Let’s say we want to sample from a difficult distribution on the -plane, perhaps the one proportional to     Since this distribution is pretty difficult to sample directly, instead we use Metropolis-Hastings. Let’s construct a mutation function that chooses from all directions uniformly and moves in that direction a distance with probability . Our acceptance probability should therefore be:     This is all we need to know to implement a sampling algorithm in R: ## Metropolis Hastings demonstration by Mike Flynn ## Warning: This code has a runtime of around 4 hours on my machine samples = matrix(nrow = 10^6, ncol = 2) currentSample = c(0.001,0.001) set.seed(44) for(i in 1:(10^8)) { ## choose direction and distance for transition direction = runif(1, 0, 2*pi) distance = rexp(1, 2) ## current point x1 = currentSample[1] y1 = currentSample[2] ## compute next point x2 = x1 + distance*cos(direction) y2 = y1 + distance*sin(direction) ## accept? accept = min(sin(sqrt(x2^2 + y2^2))^2*(x1^2 + y1^2)^(3/2)/ (sin(sqrt(x1^2 + y1^2))^2*(x2^2 + y2^2)^(3/2)),1) if(accept > runif(1, 0, 1)) { currentSample = c(x2, y2) } else { currentSample = c(x1, y1) } ## only take 1 of every 100 to reduce autocorrelation if(i %% 100 == 0) { samples[i/100,] = currentSample } } library(ggplot2) library(ggthemes) plotdat = data.frame(x = samples[,1], y = samples[,2]) ## display p to plot p = ggplot(data = plotdat, aes(x = x, y=y)) + geom_point(alpha = .05, size =.01) + theme_bw() + xlim(-6*pi, 6*pi) + ylim(-6*pi, 6*pi) which outputs this picture: This plot looks a lot like interference fringes to me, which pleases me as a physicist. Of course, this example is completely contrived: interference fringes sometimes roughly follow this distribution. More importantly, the samples produced by the Metropolis-Hastings algorithm do seem to match the probability distribution proportional to . Monte Carlo Integration and Importance Sampling Monte Carlo integration is the process of evaluating an integral by sampling randomly from the domain and averaging. Often Monte Carlo integration of will make use of the expectation value identity for samples, , from a probability distribution on the same domain     For example we might evaluate the integral by randomly sampling from the uniform density on , letting , , and averaging:     The closer gets to the shape of the integrated function, the faster the convergence will be. For example, if , then the above derivation will yield     No matter what is, it will converge after the first iteration. This is cheating because it assumes we already know the answer, . In general we won’t have access to this knowledge, or else we wouldn’t be doing the integral in the first place. However, we do have the Metropolis-Hastings algorithm to let us sample from without explicitly knowing . We can use this in many creative ways, stemming from the fact that the expectation value of any statistic on these samples is . Integrating the Example Luckily enough our example function can be integrated over the entire x-y plane by hand:     (The integral can be done via some complex analysis). Does our Metropolis-Hastings result corroborate this?  Consider a function and the results of the MH-algorithm on that function, samples from the distribution . We are looking for . Let’s say we come up with some function and take the following statistic:     As long as integrates to over the domain, the answer comes out to be , so we can just take the reciprocal to get our answer. On the plane, a good function that integrates to 1 and matches the shape of our function pretty well is , so our statistic becomes:     We hope the answer comes out close to .  I compute the statistic with 1 line of R code: estimate = sum(apply(samples, 1, function(row) exp(- row[1]^2 - row[2]^2) * (row[1]^2 + row[2]^2)^(3/2)/(sin(sqrt(row[1]^2 + row[2]^2))^2) ))/(nrow(samples) * pi) When running this line on the data generated by the previous code block, I get the answer 0.1016102, less than 1% away from the true value. Taking the square root of the reciprocal I get 3.137121, not a bad estimate of for a seemingly arbitrary sum. Pretty remarkable. Applications The applications of Metropolis-Hastings is truly where the algorithm shines. Many integrals are not solvable in “closed form”  via the Fundamental Theorem of Calculus because they have no antiderivative that can be expressed in terms of elementary functions (apparently this is provable) , for example the integral     For these problems, if you really need to know the solution you can use numerical methods of estimating the integral, such as the trapezoidal rule (bringing back fond memories of AP calculus). For some integrals we cannot even use those techniques. This could be because either there are too many dimensions we are integrating over, which forces us to use exponentially many “trapezoids”, or it could be that the domain is really large (perhaps infinitely large) and contributions from different parts are uneven. For these problems Monte Carlo integration is the best option. There are many examples of such problems in statistical mechanics. The spread of heat through conduction, convection, and radiation is essential for the design of engines, rockets, and electric generators. The many pathways of heat transfer must be integrated, a large space over which Monte Carlo integration is the only practical option. Neutron transport for the design of nuclear weapons and reactors is another similar problem. A third example would be the transport of light or photons through a scene for the purpose of rendering a physically accurate image in Computer Graphics, for which the application of the MH-algorithm has it’s own name: Metropolis Light Transport. To illustrate how the MH-algorithm makes hard integrals solvable, here is the luminosity equation from Computer Graphics:     A quick description: the is the light outgoing from a flat surface in direction , is the light emitted by that surface in that direction, is the light incoming to that surface from direction , is the chance that light incoming from will scatter in direction , and is integrated over the positive hemisphere . In English terms, what this equation is saying is that to find the amount of light that is bouncing off a surface towards you, you have to add up the amount of light being bounced towards that surface from all other surfaces, and find all the light that is being bounced towards them, and so on. This integral is defined recursively! It is actually an infinite recursion of infinite integrals. This qualifies as a very hard integral. Luckily a surface absorbs some of the light bouncing off of it so it converges and we don’t go blind, but how do we integrate this? A slick way to do it is to expand the integral into an integral over paths through each pixel instead of a recursive integral. That way we would have that the light received by pixel , , could be expressed as:     Where is the space of all paths of light through the camera, is what proportion of light of each color that path contributes to that pixel, is the total amount of light flowing along that path, and is a measure on that path. We can then treat this as a Monte Carlo integration problem, sample paths () from the distribution using the MH-algorithm, and use the expectation value identity:     to get the value of the integral, multiplied by a normalization constant (but this can be quickly found and set to whatever looks good at the end of the process). Because Metropolis Light Transport samples the most important points first, it has the fastest general convergence time of any unbiased Monte Carlo method. Additionally, it is trivially parallelized. The Metropolis-Hastings algorithm makes normally impossible integrals solvable with relative efficiency, and because of the wide range of applications of the algorithm to problems that are very cool I consider it my favorite. References Much of what I have written here is knowledge that I built up over summer research on sampling methods and a Computer Graphics class I took this fall, including the luminosity equation I got from The Graphics Codex by Morgan McGuire, the Metropolis Light Transport paper and Ph. D thesis of Eric Veach. In addition I learned much about Monte Carlo methods from Computer Graphics: Principles and Practice by by John F. Hughes, Andries van Dam, Morgan McGuire, David F. Sklar, James D. Foley, Steven K. Feiner, and Kurt Akeley, which could be considered a holy text. Also important are the papers where the algorithm originates by Metropolis et al and Williams Hastings, and an explanatory article by Chib and Greenberg. Full citations for these are below: Chib, Siddhartha, and Edward Greenberg. “Understanding the metropolis-hastings algorithm.” The american statistician 49, no. 4 (1995): 327-335. Hastings, W.K. (1970). “Monte Carlo Sampling Methods Using Markov Chains and Their Applications”. Biometrika 57 (1): 97–109. Hughes, John F.; van Dam, Andries; McGuire, Morgan; Sklar, David F.; Foley, James D.; Feiner, Steven K.; and Akeley, Kurt. Computer graphics: principles and practice. Pearson Education, 2013. McGuire, Morgan. The Graphics Codex. Online book at www.graphicscodex.com. Metropolis, N.; Rosenbluth, A.W.; Rosenbluth, M.N.; Teller, A.H.; Teller, E. (1953). “Equations of State Calculations by Fast Computing Machines”. Journal of Chemical Physics 21 (6): 1087–1092. Veach, Eric. “Robust monte carlo methods for light transport simulation.” PhD diss., Stanford University, 1997. Veach, Eric, and Leonidas J. Guibas. “Metropolis light transport.” In Proceedings of the 24th annual conference on Computer graphics and interactive techniques, pp. 65-76. ACM Press/Addison-Wesley Publishing Co., 1997.     Posted in: Computational Physics Tagged: Computational Physics, Metropolis Hastings, MLT, Monte Carlo, Numerical Integration