I’m writing this post because I feel like any computational scientist should have a favorite algorithm, and in my research I have run into this particular algorithm again and again and every time it impresses me. I am also writing this because I googled “What is your favorite algorithm” and was surprised to find that no one mentioned it in the top results. Few algorithms in computational physics have the scope and power of the Metropolis-Hastings algorithm. Briefly, given a domain and a function with bounded integral over that domain , the MH-algorithm samples points from a probability distribution proportional to . This subtle technique can solve very difficult problems in the design of engines, rockets, generators, nuclear weapons and reactors, plasma physics, the simulation of thermodynamic systems and transport of light in Computer Graphics, the computation of partition functions of RNA macrostates and their transition states, and the computation of posterior distributions in Bayesian statistics (as well as a family of statistical tools that I know relatively less about). In this post I plan to describe the algorithm, Monte Carlo integration, and some examples of applications.

**The Algorithm**

Let’s say we have some function over some domain such that

For example could be and could be the real line , or the uniform distribution and the interval :

or some higher dimensional function with a corresponding higher dimensional domain. It turns out we can generate a list of samples from the probability distribution proportional to this function,

(for our examples and , respectively) by following a simple procedure: Start at some initial state and generate the output, a list of samples , by continuously sampling from some probability distribution function (that you choose) set by :

This list of samples will converge to provided some conditions on are satisfied, namely *ergodicity,* which means that every state can eventually reach any other state,* *and *detailed balance*. If there are samples in state at some point in the chain, the number of samples in state created by transitions from these states is expected to be . is called the *transition rate *from to because it is the expected flow of states from to (or the other way, if it is negative). Detailed balance is said to hold if the transition rates between all states are zero when the sample has reached the correct probability distribution. This way the net flow of probability is zero and the distribution is expected to stay the same. Therefore if we have sampled states and we want to reach distribution , if we have that and want , our transition probabilities should be constrained by:

From here we can break down into the product of 2 different distributions, , the probability of generating a transition from to , and , the probability of accepting that transition. We do this to let the definition of be flexible by having handle the constraint. This is done by solving for with a general :

One choice that satisfies this equation is

Provided that the function satisfies these constraints, we can construct a *mutation function *that samples from to give us a new element, create the chain by accepting or rejecting the new element, and the samples’ distribution will converge to .

**Example**

Let’s say we want to sample from a difficult distribution on the -plane, perhaps the one proportional to

Since this distribution is pretty difficult to sample directly, instead we use Metropolis-Hastings. Let’s construct a mutation function that chooses from all directions uniformly and moves in that direction a distance with probability . Our acceptance probability should therefore be:

This is all we need to know to implement a sampling algorithm in R:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
## Metropolis Hastings demonstration by Mike Flynn ## Warning: This code has a runtime of around 4 hours on my machine samples = matrix(nrow = 10^6, ncol = 2) currentSample = c(0.001,0.001) set.seed(44) for(i in 1:(10^8)) { ## choose direction and distance for transition direction = runif(1, 0, 2*pi) distance = rexp(1, 2) ## current point x1 = currentSample[1] y1 = currentSample[2] ## compute next point x2 = x1 + distance*cos(direction) y2 = y1 + distance*sin(direction) ## accept? accept = min(sin(sqrt(x2^2 + y2^2))^2*(x1^2 + y1^2)^(3/2)/ (sin(sqrt(x1^2 + y1^2))^2*(x2^2 + y2^2)^(3/2)),1) if(accept > runif(1, 0, 1)) { currentSample = c(x2, y2) } else { currentSample = c(x1, y1) } ## only take 1 of every 100 to reduce autocorrelation if(i %% 100 == 0) { samples[i/100,] = currentSample } } library(ggplot2) library(ggthemes) plotdat = data.frame(x = samples[,1], y = samples[,2]) ## display p to plot p = ggplot(data = plotdat, aes(x = x, y=y)) + geom_point(alpha = .05, size =.01) + theme_bw() + xlim(-6*pi, 6*pi) + ylim(-6*pi, 6*pi) |

which outputs this picture:

This plot looks a lot like interference fringes to me, which pleases me as a physicist. Of course, this example is completely contrived: interference fringes sometimes roughly follow this distribution. More importantly, the samples produced by the Metropolis-Hastings algorithm do seem to match the probability distribution proportional to .

**Monte Carlo Integration and Importance Sampling**

Monte Carlo integration is the process of evaluating an integral by sampling randomly from the domain and averaging. Often Monte Carlo integration of will make use of the expectation value identity for samples, , from a probability distribution on the same domain

For example we might evaluate the integral by randomly sampling from the uniform density on , letting , , and averaging:

The closer gets to the shape of the integrated function, the faster the convergence will be. For example, if , then the above derivation will yield

No matter what is, it will converge after the first iteration. This is cheating because it assumes we already know the answer, . In general we won’t have access to this knowledge, or else we wouldn’t be doing the integral in the first place. However, we do have the Metropolis-Hastings algorithm to let us sample* *from without *explicitly* knowing . We can use this in many creative ways, stemming from the fact that the expectation value of any statistic on these samples is .

**Integrating the Example**

Luckily enough our example function can be integrated over the entire x-y plane by hand:

(The integral can be done via some complex analysis). Does our Metropolis-Hastings result corroborate this? Consider a function and the results of the MH-algorithm on that function, samples from the distribution . We are looking for . Let’s say we come up with some function and take the following statistic:

As long as integrates to over the domain, the answer comes out to be , so we can just take the reciprocal to get our answer. On the plane, a good function that integrates to 1 and matches the shape of our function pretty well is , so our statistic becomes:

We hope the answer comes out close to . I compute the statistic with 1 line of R code:

1 |
estimate = sum(apply(samples, 1, function(row) exp(- row[1]^2 - row[2]^2) * (row[1]^2 + row[2]^2)^(3/2)/(sin(sqrt(row[1]^2 + row[2]^2))^2) ))/(nrow(samples) * pi) |

When running this line on the data generated by the previous code block, I get the answer 0.1016102, less than 1% away from the true value. Taking the square root of the reciprocal I get 3.137121, not a bad estimate of for a seemingly arbitrary sum. Pretty remarkable.

**Applications**

The applications of Metropolis-Hastings is truly where the algorithm shines. Many integrals are not solvable in “closed form” via the Fundamental Theorem of Calculus because they have no antiderivative that can be expressed in terms of elementary functions (apparently this is provable) , for example the integral

For these problems, if you really need to know the solution you can use numerical methods of estimating the integral, such as the trapezoidal rule (bringing back fond memories of AP calculus). For some* *integrals we cannot even use those techniques. This could be because either there are too many dimensions we are integrating over, which forces us to use exponentially many “trapezoids”, or it could be that the domain is really large (perhaps infinitely large) and contributions from different parts are uneven. For these problems Monte Carlo integration is the best option.

There are many examples of such problems in statistical mechanics. The spread of heat through conduction, convection, and radiation is essential for the design of engines, rockets, and electric generators. The many pathways of heat transfer must be integrated, a large space over which Monte Carlo integration is the only practical option. Neutron transport for the design of nuclear weapons and reactors is another similar problem. A third example would be the transport of light or photons through a scene for the purpose of rendering a physically accurate image in Computer Graphics, for which the application of the MH-algorithm has it’s own name: Metropolis Light Transport.

To illustrate how the MH-algorithm makes *hard *integrals solvable, here is the luminosity equation from Computer Graphics:

A quick description: the is the light outgoing from a flat surface in direction , is the light emitted by that surface in that direction, is the light incoming to that surface from direction , is the chance that light incoming from will scatter in direction , and is integrated over the positive hemisphere . In English terms, what this equation is saying is that to find the amount of light that is bouncing off a surface towards you, you have to add up the amount of light being bounced towards that surface from *all *other surfaces, and find all the light that is being bounced towards *them*, and so on. This integral is defined recursively! It is actually an infinite recursion of infinite integrals. This qualifies as a very hard integral. Luckily a surface absorbs some of the light bouncing off of it so it converges and we don’t go blind, but how do we integrate this?

A slick way to do it is to expand the integral into an integral over paths through each pixel instead of a recursive integral. That way we would have that the light received by pixel , , could be expressed as:

Where is the space of all paths of light through the camera, is what proportion of light of each color that path contributes to that pixel, is the total amount of light flowing along that path, and is a measure on that path. We can then treat this as a Monte Carlo integration problem, sample paths () from the distribution using the MH-algorithm, and use the expectation value identity:

to get the value of the integral, multiplied by a normalization constant (but this can be quickly found and set to whatever looks good at the end of the process). Because Metropolis Light Transport samples the most important points first, it has the fastest general convergence time of any unbiased Monte Carlo method. Additionally, it is trivially parallelized.

The Metropolis-Hastings algorithm makes normally impossible integrals solvable with relative efficiency, and because of the wide range of applications of the algorithm to problems that are very cool I consider it my favorite.

**References**

Much of what I have written here is knowledge that I built up over summer research on sampling methods and a Computer Graphics class I took this fall, including the luminosity equation I got from The Graphics Codex by Morgan McGuire, the Metropolis Light Transport paper and Ph. D thesis of Eric Veach. In addition I learned much about Monte Carlo methods from Computer Graphics: Principles and Practice by by

Chib, Siddhartha, and Edward Greenberg. “Understanding the metropolis-hastings algorithm.” *The american statistician* 49, no. 4 (1995): 327-335.

Hastings, W.K. (1970). “Monte Carlo Sampling Methods Using Markov Chains and Their Applications”. *Biometrika* **57** (1): 97–109.

Hughes, John F.; van Dam, Andries; McGuire, Morgan; Sklar, David F.; Foley, James D.; Feiner, Steven K.; and Akeley, Kurt. *Computer graphics: principles and practice*. Pearson Education, 2013.

McGuire, Morgan. *The Graphics Codex. *Online book at www.graphicscodex.com.

Metropolis, N.; Rosenbluth, A.W.; Rosenbluth, M.N.; Teller, A.H.; Teller, E. (1953). “Equations of State Calculations by Fast Computing Machines”. *Journal of Chemical Physics* **21** (6): 1087–1092.

Veach, Eric. “Robust monte carlo methods for light transport simulation.” PhD diss., Stanford University, 1997.

Veach, Eric, and Leonidas J. Guibas. “Metropolis light transport.” In *Proceedings of the 24th annual conference on Computer graphics and interactive techniques*, pp. 65-76. ACM Press/Addison-Wesley Publishing Co., 1997.

David Rachel

Nice article! I’d not heard of Metropolis-Hastings, but it’s awesome.

Since the limit behaviour is independent of the jump width, does that mean that the transition function T (jump width) can be arbitrarily altered mid-algorithm without undermining the result?

If so, is there any potential benefit in this? Something like annealing, or even fancy and adaptive things – e.g. target desired acceptance/rejection rate, or transition function determined by variance (scalar or matrix) of existing selection?

Patrick Schnell

Changing the transition function mid-algorithm can affect the asymptotic result if you’re not careful. For example, making jumps smaller or when the current position is in a specific region could lead to that region being over-represented.

However, there are benefits to tuning the acceptance rate or reducing the autocorrelation of the sampling process, such as reducing the variance of the integral estimate or making sure that you adequately explore all local maxima. To do this without damaging the convergence of the algorithm, one can specify a “burn-in” segment at the beginning in which parameters of the algorithm are tuned, and then throw out those samples. This has the added advantage of making the final result less dependent on the starting point.

Alex Campbell

Interesting stuff Michael, have you ever encounter stochastic evolutionary game dynamics?

It’s a way you can take these methods and merge them with some pretty interesting frameworks coming out of game theory. Basically a way to connect the world of analyzing dynamic, stochastic systems to the world of human behavior. Where does the ‘system’ end up, when the system is a complex world of players, beliefs, actions, and payoff functions.

Link below is a pretty good paper on the topic from my thesis advisors at Oxford. Think you would enjoy it.

http://www.econ2.jhu.edu/people/Young/Stochastic_Evolutionary_Game_Dynamics.pdf

mflynn

Sounds interesting, I’ll check that paper out!

Dennis Snell

Thanks Michael for the great article. It’s such an amazing thing to see when we can take previously unsolvable problems and transform them into nothing more than a tedious calculation. Probability FTW!

Kshitij Lauria

@David Rachel

If you change transition probabilities mid-algorithm, you must be very careful that you have not made your Markov chain irreversible.

An intuitive way to understand detailed balance is that it means your state transitions are reversible: the likelihood of a path is the same in either direction.

Simulated annealing makes use of the idea of Metropolis-Hastings by constructing a FAMILY of reversible Markov chains (each temperature). http://www.mit.edu/~dbertsim/papers/Optimization/Simulated%20annealing.pdf is a survey paper.