## Simple RNA folding in 130 lines of Haskell

How do we predict how RNA folds? First of all, we need to define what we mean by folding. RNA is a polymer made up of nucleic acids: adenine, guanine, cytosine, and uracil. There are 3 levels of RNA structure that are traditionally considered:

**primary structure:**the sequence of bases, ‘A’, ‘G’, ‘C’, and ‘U’ that form the strand of RNA, ‘AAAAGGGGCCCCUUUU’ for example,**secondary structure:**which bases are hydrogen bonded together, ‘A’ to ‘U’ and ‘G’ to ‘C’, can be specified by a sequence of pairs corresponding to which indices are paired together (1,24), (2,23)…, and**tertiary structure:**how the molecule bends on itself on a larger scale.

Usually, primary structure is already known or given as an input and tertiary structure is hard to specify and compute the energy for. Secondary structure is easy to specify using a sequence of pairs and the energy is roughly dominated by hydrogen bonding between bases which is discrete. Since RNA functions mostly by binding to things using exposed (non-paired) bases, the secondary structure is also the most relevant when discerning function. For these reasons, biophysicists usually focus on predicting the secondary structure of RNA.

From the laws of thermal physics, if RNA is modeled as a system in thermal equilibrium then the probability of any state with free energy is equal to

where is the full set of states. The denominator is called the *partition function*, usually denoted , it normalizes the probability distribution and with it we can compute the probability of any structure. How many terms are in the partition function? How many states are in ? Depends on what assumptions you make. In full generality, even only allowing Watson-Crick base pairs, there are terms in that sum. However, if you assume that there are no *crossing* pairs, i.e. arcs that would intersect in the above diagram, or pairs such that , , and , there are only terms in the sum. This is still an exponential quantity, but it turns out it is actually computable in sub-exponential time via dynamic programming. This assumption is called the *no-pseudoknot assumption,* and it means that if I have an pair, then for all bases in-between them, I can ignore all the bases outside of them when enumerating structures. This begs the question whether it is a physically valid assumption. The answer is NO. There are many RNAs found in nature that have “crossing” pairs such as Group I and Group II introns. However, without this assumption, computing the partition function is intractable, so most RNA folding software packages make it.

While it is tempting to do analysis on a structure with the minimum free energy and highest probability, since RNA is a thermal system with an exponential multiplicity of states even the probability of the most probable state is extremely low* *and any quantity computed on even the most probable state has negligible contribution to the physical expected value of that quantity. However, I will show here that not only can we compute the partition function, but we can also sample from the Boltzmann distribution. Once we are able to do that we can compute arbitrary expected values on the full Boltzmann distribution via Monte Carlo integration. (In fact, Monte Carlo integration is not even needed if you can express your quantity in terms of the partition function values, which can give you full expected values without statistical error! But that is outside of the scope of this post).

Putting this all together, to predict the folding and general properties of RNA we need 4 things:

- a specification of a state of RNA ,
- an energy function , which takes a state and outputs a physical

energy, - the partition function, the sum of the Boltzmann factors over every

possible state , and - an algorithm for sampling structures from the Boltzmann distribution, .

**Quick Code Note**

All code going forward will be in Haskell. You should be able to extract all the code here into a file and it should compile. To set up our environment, make sure to have this header at the top of the file:

{-# LANGUAGE NoMonomorphismRestriction #-} {-# LANGUAGE FlexibleContexts #-} {-# LANGUAGE TypeFamilies #-} import Data.Array import Data.Time.Clock.POSIX import Data.Colour import System.Random import Data.List import Diagrams.Prelude import Diagrams.Backend.SVG.CmdLine

**States of RNA**

Because of the no-pseudoknot assumption, we can actually specify the states using a recursive data structure.

data Structure = Paired Char Char (Structure, Structure) | Slip Char Structure | None deriving Show

The thought behind this definition is that we are specifying the structure base by base. If base is unpaired, then we move on to the next base. If base is paired, then we fork the structure into the bases *underneath *the pair and the bases *after *the pair. We carry along the letter of the base for convenience and to compute the energies of any pairing that might happen.

**Energy Model**

A fully realistic energy model of RNA would be very difficult to compute. There are exposed ions everywhere on the strand which interact with each other and there is tension in the backbone. Many of these interactions are non-linear which breaks the recursion and therefore computability of the problem. In practice, there are several levels of complexity that are implemented which try to account for the non-linearities in a computable fashion (this actually is the only difference between professional RNA secondary structure prediction and an Algorithms homework problem). Here I will implement a very simple energy model: adding the free energies of each hydrogen bond. A hydrogen bond contributes about . Adenine-uracil bonds have 2 hydrogen bonds, guanine-cytosine bonds have 3. Therefore in the exponent of the Boltzmann factor, I will put -4.6 for A-U bonds and -6.9 for G-C bonds. Anything else is represented by infinite energy, meaning that it will never happen.

baseEnergy :: Char -> Char -> Double baseEnergy a b = case (a, b) of ('A', 'U') -> -4.6 ('G', 'C') -> -6.9 ('U', 'A') -> -4.6 ('C', 'G') -> -6.9 otherwise -> 1.0/0.0 -- non-allowed pairs treated like an infinite hill energyModel :: Structure -> Double energyModel (Paired a b (s1,s2)) = (baseEnergy a b) + -- energy of pair (energyModel s1) + -- energy of substructure under pair (energyModel s2) -- energy of substructure after pair energyModel (Slip _ s) = energyModel s energyModel None = 0

**Partition Function**

The sum initially seems daunting. However, from the recursive definition of our energy model, we can easily see that the partition function can be defined recursively as well. To be precise, if we define as the sub-partition-function from to , it must be equal to the sum of the Boltzmann terms of all structures where is paired and all structures where is unpaired:

where . This becomes a standard dynamic programming problem. We can turn Z into a table of values with cells. If cell is on the bottom left, and on the top right, each row of the table only depends on the entries below and to the left. We can compute the full partition function by starting from the bottom left and working out to the top right in time. This is even parallelizable along the diagonals (I’m not going to parallelize it here).

type PFunArray = Array (Int, Int) Double boltz :: Double -> Double boltz x = exp (-x) partitionFunction :: String -> PFunArray partitionFunction strand = arr where len = length strand arr = array ((0,0), (len - 1, len - 1)) [((i,j), (pfCell i j strand arr)) | i <- [0..len-1], j <- [i..len-1]] pfCell :: Int -> Int -> String -> PFunArray -> Double pfCell i j strand arr = if (i==j) then 1 else (sum pairTerms) + slipTerm where pairTerm i k = boltz (baseEnergy (strand !! i) (strand !!k)) * (if k-i > 2 then arr ! (i+1, k-1) else 1) * (if j-k > 1 then arr ! (k+1, j) else 1) pairTerms = [pairTerm i k | k <- [(i+1)..j]] slipTerm = if i == j then 1 else arr ! (i+1, j)

**Sampling RNA Structures**

To actually predict the structure of RNA molecules in nature, we can work backward through the partition function table to sample structures from the Boltzmann distribution. Consider the length of the strand along base to base ,

- the probability that is unpaired is equal to the sum of the probabilities of every structure along to with unpaired, or ,
- the probability that is paired to base is equal to the sum of the probabilities of every structure along to where and are paired, or ,
- completing the rest of the structure can be achieved by calling the sampling function recursively for every length of the strand left general, to for the unpaired case, to and to for the second case.

I encode this logic in Haskell below. At each level I roll a random number from 0 to and sort through the cases. When I pass cumulative probability greater than the number I've rolled I choose that case and recurse.

sampleStructure :: Int -> Int -> PFunArray -> String -> StdGen -> Structure sampleStructure i j arr strand gen = if (i > j) then None else struct where -- setup pair terms pairTerm i k = boltz (baseEnergy (strand !! i) (strand !! k)) * (if k-i > 2 then arr ! (i+1, k-1) else 1) * (if j-k > 1 then arr ! (k+1, j) else 1) pairTerms = [pairTerm i k | k <- [(i+1)..j]] -- structure functions slipStructure genr = Slip (strand !! i) (sampleStructure (i+1) j arr strand genr) innerPairStructure k genr = if k - i > 1 then sampleStructure (i+1) (k-1) arr strand genr else None outerPairStructure k genr = if j - k > 0 then sampleStructure (k+1) (j) arr strand genr else None pairStructure k genr = Paired (strand !! i) (strand !! k) (innerPairStructure k genr, outerPairStructure k genr) -- (cumulative probability, structure) pairs slipCase = (0, slipStructure) -- starting at offset of 0 pairCases = [(sum (take (k-i-1) pairTerms) + arr ! (i+1, j), pairStructure k) | k <- [(i+1)..j]] allCases = slipCase:pairCases -- sample a random number according to the current sub (roll, newgen) = randomR (0, arr ! (i, j)) gen -- if roll >= cumulative probability switch canidates, else not checkCase current candidate = if roll >= (fst candidate) then (snd candidate) newgen else current struct = foldl' checkCase None allCases

**Drawing Structures**

One last thing to do is to draw these structures, which I do using Brent Yorgey's cool diagrams library. Every base is a circle with a letter underneath, color coded by base, with backbone and arcs denoting bonds. This depicts RNA with the bonds fanned out and visible, whereas other depictions of RNA will try to keep the length of each bond pretty constant, which is a more physical picture.

colorMap a = case a of 'G' -> sRGB24read "#107896" 'C' -> sRGB24read "#829356" 'U' -> sRGB24read "#F26D21" 'A' -> sRGB24read "#C02F1D" otherwise -> white base :: Char -> Diagram B base b = circle 1 # fc (colorMap b) <> text [b] # fontSize (local 2) # (translateY (-2.6)) backbone = hrule 1 isNone :: Structure -> Bool isNone s = case s of None -> True otherwise -> False structLength :: Structure -> Int structLength None = 0 structLength (Slip _ s) = 1 + structLength s structLength (Paired _ _ (s1, s2)) = 2 + structLength s1 + structLength s2 renderStructure :: Structure -> Diagram B renderStructure (Paired a b (s1, s2)) = base a <> translateX 1.5 backbone <> translateX 3 innerElement <> translateX (3.0*k') (base b) <> translateX (3.0*k'+3) outerElement <> arc xDir (0.5 @@ turn) # scale (1.5*k') # translateX (1.5*k') where k = structLength s1 + 1 k' = fromIntegral k innerElement = if isNone s1 then mempty else renderStructure s1 ||| backbone outerElement = if isNone s2 then mempty else translateX (-1.5) backbone <> renderStructure s2 renderStructure (Slip c s) = if isNone s then base c else base c <> translateX 1.5 backbone <> translateX 3 (renderStructure s) renderStructure None = mempty

**Example Output**

Let's try it out:

main = do let gen = mkStdGen 42 let rna = "GAAAAAAGGGGAAACCAAAGCCCAAUUUGCUUUUAAAAGGCCAA" let pf = partitionFunction rna let struct = sampleStructure 0 (length rna - 1) pf rna gen let diagram = renderStructure struct mainWith diagram

In shell:

ghc RNAfolding.hs RNAfolding.o -o RNAt.svg -w 600 -h 300

Cool!

**Further Reading**

There is a long history of this problem being solved going back to the 1980's even before John McCaskill first formulated a partition function approach to RNA secondary structure prediction. Below are a couple significant papers in this series, or at least those that are known to me. I'll also put my undergraduate thesis here, where I wrote extensively on a slightly more complicated version of this problem.

Partition Functions:

McCaskill, J.S., 1990. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers, 29(6‐7), pp.1105-1119. pdf

Stochastic Sampling:

Ding, Y. and Lawrence, C.E., 2003. A statistical sampling algorithm for RNA secondary structure prediction. Nucleic acids research, 31(24), pp.7280-7301. pdf

More complicated energy models:

Mathews, D.H., Sabina, J., Zuker, M. and Turner, D.H., 1999. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. Journal of molecular biology, 288(5), pp.911-940. pdf

Mathews, D.H., Disney, M.D., Childs, J.L., Schroeder, S.J., Zuker, M. and Turner, D.H., 2004. Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proceedings of the National Academy of Sciences of the United States of America, 101(19), pp.7287-7292. pdf

Some clear formulations, useful to me:

Dirks, R.M. and Pierce, N.A., 2003. A partition function algorithm for nucleic acid secondary structure including pseudoknots. Journal of computational chemistry, 24(13), pp.1664-1677. pdf

RNAbows:

Aalberts, D.P. and Jannen, W.K., 2013. Visualizing RNA base-pairing probabilities with RNAbow diagrams. RNA, 19(4), pp.475-478. pdf

My thesis:

Flynn, M.J. and Aalberts, D.P., 2015. RNA Macrostates and Computational Tools. Williams College. github pdf