Вы находитесь на странице: 1из 7

RNA Secondary Structure Prediction

Abstract
The prediction of the secondary structure starting from an RNA sequence, is one of the most intense
research field in bioinformatics. The Minimum Free Energy is the most common method for the
prediction (MFE). Recently has emerged that a more accurate secondary structure can be predicted
computing the Maximum Expected Accuracy (MEA). With the current article we want briefly show
how MEA and MEF work and supply two working protocols to show how to use them.

Introduction
To think simple, consider an RNA molecule to be a strand of four types of bases: Adenine (A),
Cytosine (C), Guanine (G), and Uracil (U). We represent an RNA molecule as a string over {A, C,
G, U}, with the left end corresponding to the 5' end of the molecule. In the hybridization process,
pairs of bases in RNA form hydrogen bonds, with the complementary pairs C-G and A-U being the
strongest than others. A folded molecule is largely held together by the resulting set of bonds called
its secondary structure, as showed in Figure 1. Knowledge of the secondary structure of a folded
RNA molecule gives valuable insight on its function. Abstractly, we
represent the secondary structure of an RNA molecule

of length n as a set S of integer pairs { ( i,j )1i<jn } , where


each i is contained in at most one pair of S . The pair ( i,j )
indicates
a bond between the bases at positions i and j of the corresponding
strand. The
Figure 1: RNA Fold structure
secondary structure is named pseudoknot free if and only if for all
' ' ' '
pairs ( i,j ) and ( i ,j ) , it is not the case that i <i<j <j . This problem
can be easily solved if we assume all pairings are properly nested,
i.e. the structure can be expressed as a series of parantheses:

>Sample RNA
AAAAAAAAAAAAAGGGGGGGUUUUUUUUUUUUUUCCCCCCCCCCCCCCCCC
.............(((((((..............)))))))..........

Models for prediction

MFE
Free Energy Minimization (MFE) has been the most popular method for RNA secondary structure
prediction for decades. It is based on a set of empirical free energy change parameters derived from
experiments using a nearest-neighbor model. The thermodynamic model for RNA structure
formation posits that, out of the exponentially many possibilities, an RNA molecule folds into that
structure with the minimum free energy. Free energy models typically assume that the total free
energy of a given secondary structure for a molecule is the sum of independent contributions of
adjacent, or stacked, base pairs in stems (which tend to stabilize the structure) and of loops (which
tend to destabilize the structure). These contributions depend on temperature, the concentration of
the molecule in solution, and the ionic concentration of the solution. Zuker and Steigler [1] describe
a dynamic programming algorithm for finding the mfe pseudoknot free secondary structure of a
given molecule.
b1 ,b 2 ,. . .,b n
Let the input strand be . Suppose that W ( i,j ) is the energy of the mfe pseudoknot free
b , . .. ,b j
secondary structure for strand i , and V (i,j ) be the energy of the mfe pseudoknot free
b , . .. ,b j
secondary structure for strand i , among those structures containing base pair ( i,j ) . Then,
W satisfies the following recurrence (base cases excluded):

W ( i,j ) =min [ V ( i,j ) ,mink : i<k<j {W ( i,k ) +W ( k+ 1, j ) } ]


V (i,j ) also satisfies a recurrence that is expressed in terms of the different types of loops.
Implementations of this algorithm are available on the world wide web as part of the mfold [2] and
the Vienna [3] packages.

MEA
As an alternative method to finding the most probable structure, the structure with Maximum
Expected Accuracy (MEA) can be predicted. Pseudo-knot-free structures are predicted by
maximizing the sum of the base-paired and single-stranded nucleotide probabilities, called expected
accuracy, where pairing probabilities can be weighted by a specific factor. CONTRAfold [4], the
first implementation of MEA, uses probabilistic parameters learned from a set of RNA secondary
structures to predict base-pair probabilities and then predicts structures using the maximum
P (i,j )
expected accuracy approach. The base-pair probability of an i-j pair, bp , is predicted from a
partition function calculation based on a biophysical model, the nearest-neighbor model. The
probability of nucleotide i being single-stranded is:
Pss =1 P bp ( i,j )
After base-pair probabilities are calculated with the partition function calculation, it is utilized a
dynamic programming algorithm to find the structure with maximum expected accuracy.

Computational Tools
The Vienna RNA Package (http://www.tbi.univie.ac.at/RNA/) consists of a C code library and
several stand-alone programs for the prediction and comparison of RNA secondary structures.
Vienna provides three kinds of dynamic programming algorithms for structure prediction: the
minimum free energy algorithm of (Zuker & Stiegler) which yields a single optimal structure, the
partition function algorithm of (McCaskill 1990) which calculates base pair probabilities in the
thermodynamic ensemble, and the suboptimal folding algorithm of (Wuchty et.al 1999) which
generates all suboptimal structures within a given energy range of the optimal energy. In this case of
study, we are interested to RNAfold program.
RNAfold reads RNA sequences from stdin, calculates their minimum free energy (mfe) structure
and prints to stdout the mfe structure in bracket notation and its free energy. If the -p option was
given it also computes the partition function (Z) and base pairing probability matrix, and prints the
free energy of the thermodynamic ensemble, the frequency of the MFE structure in the ensemble,
and the ensemble diversity to stdout.
It also produces PostScript files with plots of the resulting secondary structure graph and a "dot
plot" of the base pairing matrix. The dot plot shows a matrix of squares with area proportional to the
pairing probability in the upper right half, and one square for each pair in the minimum free energy
structure in the lower left half.
Secondary Structure Prediction using RNAfold (Vienna PKG)
We can suppose that our RNA sequence is the following CCGCACAGCGGGCAGUGCC.
1. As first step, we need to create a file (test1.fasta) in fasta format containing our sequence,
like the following:

> RNA sequence test


CCGCACAGCGGGCAGUGCC
2. Then we compute the best (MFE) structure for this sequence, with the command
RNAfold < test1.fasta > test1_out obtaining as result the following (contained into test1_out):
> RNA sequence test
CCGCACAGCGGGCAGUGCC
((((...))))........ ( -5.40)

The last line of the text output contains the predicted MFE structure as bracket notation and
its free energy in kcal/mol.
A dot in the bracket notation represents an unpaired position, while a base pair (i, j) is
represented by a pair of matching parentheses at position i and j.
It has been also created a postscript file (RNA_ss.ps) containing a graphic representation
of the predicted secondary structure. You can open it typing the command gv RNA_ss.ps.
It should be similar to the following:

F
igure 2: Plot of RNA_ss.ps

Note that you can understand better about the pair probability and alternative structural
configuration producing the dot plot.
You can produce it adding the option -p to the previous command RNAfold -p < test1.fasta >
test1_out. At this point, you will obtain also a file named RNA_dp.ps showing the pair
probabilities within the equilibrium ensemble. A square at row i and column j matrix indicates a
base pair. The area of a square in the upper right half of the matrix is proportional to the probability
of the base pair (i, j) within the equilibrium ensemble. The lower left half shows all pairs belonging
to the MFE structure. While the MFE consists of a single helix, 3 different helices are clearly visible
in the pair probabilities.
You can open it typing the command gv RNA_dp.ps. It should be similar to the following:

Figure 3: Plot of RNA_dp.ps

To visualize which parts of a predicted MFE are well-defined and thus more reliable (producing a
new diagram), we have to use the following commands:
RNAfold -p < test1.fasta, to generate the RNA_dp.ps and RNA_ss.ps files;
/usr/share/ViennaRNA/bin/mountain.pl RNA_dp.ps | xmgrace -pipe , to produce a mountain
plot. It is a xy-diagram plotting the number of base pairs enclosing a sequence position
versus the position. The resulting plot shows three curves, two mountain plots derived from
the MFE structure (red) and the pairing probabilities (black) and a positional entropy curve
(green). Note that Well-defined regions are identified by low entropy. By superimposing
several mountain plots structures can easily be compared;
/usr/share/ViennaRNA/bin/relplot.pl RNA_ss.ps RNA_dp.ps > RNA_rss.ps , to produce a
diagram of the predicted structure containing also information about probability. The Perl
script relplot.pl adds reliability information to a RNA secondary structure plot in the form
of color annotation.
The script computes a well-definedness measure we call ``positional entropy'' and encodes it as
color hue, ranging from red (low entropy, well-defined) via green to blue and violet (high entropy,
ill-defined).

Figure 5: Mountain Diagram


Figure 4: Replot Diagram
Multiple Runs of RNAfold
The current section want describe how to run multiple executions of RNAfold at one time.
Sometimes, like in our case of study, it's needed to run many times the execution of RNAfold to
understand how to the prediction of a secondary structure changes modifying something in the RNA
sequence. In our case, for example, we would like to observe structural modification of the
prediction by changing one by one the interested nucleotides.
Suppose that we have 100 RNA files containing the desired sequences and named base1,
base2,...base100.
The reported procedure is working on Linux distribution. Command to be executed are between
... and has to be ran one by one:
1. for i in {1..100}; do RNAfold < base$i > mutation$i.out; mv rna.ps mutation$i.ps; done , this
step is to execute RNAfold reading from all 100 input files and generate related outputs;
2. for i in {1..100}; do convert -font FreeMono-Medium -fill red -draw text 30,30 file n.:$i
mutation$i.ps mutation$i.jpg; done, this is to convert all produced files from ps to jpeg
format and insert a note regarding to the relative test.

The MaxExpect program (http://rna.urmc.rochester.edu/RNAstructureWeb), is part of the


RNAstructure by Mathews Lab,University of Rochester Medical Center,Department of
Biochemistry and Biophysics. Continued development of RNAstructure is made possible by the
support of NIH grant R01GM076485. MaxExpect predicts the maximum expected accuracy
structure, a structure that maximizes pair probabilities (MEA), with the following usage MaxExpect
<input file> <ct file> [options], where:

The name of a file containing input data. This input data can be in one of two formats:

1. Partition function save file (holds base pairing probability data for all pairs and can be
generated using the partition interface).
<input file>
2. Sequence file (holds raw sequence: .seq or .fasta).
Note that lowercase nucleotides are forced single-stranded in structure prediction.
Note that in order to use a sequence file, the "sequence" flag must be specified (see "--
sequence" below).
<ct file> The name of a CT file to which output will be written.

The simplest input file layout is in fasta format:


>clusterN1.seq

GCTACATGGAGATTAACTCAATCTAGAGGGTATTAATAA

Supposing that your input file is named clusterN1.seq, you can produce the
secondary structure prediction using MaxExpect just typing the following
command MaxExpect --sequence clusterN1.seq clusterN1.out --gamma 1
--percent 10 --structures 20 --window 3. T his will generate a file named
clusterN1.out (as specified) containing information about the predicted
structure, as showed by the Figure 6.

Figure 6: MEA distribution


After that, you can create the picture of the predicted structure with the command draw
clusterN1.out clusterN1.ps, which creates clusterN1.ps file reading data from clusterN1.out. The
resulting graphic file clusterN1.ps can be opened typing gv clusterN1.ps. Something like the
following will be shown:

Figure 7: MEA Secondary Structure Prediction

With the following command line you can run MaxExpect many times on different input files, like
we have done previously using RNAfold:

for i in {1..10}; do MaxExpect --sequence clusterN$i.seq clusterN$i.out --gamma 1 --percent 10


--structures 20 --window 3; draw clusterN$i.out clusterN$i.ps; done
Bibliography
[1] M. Zuker and P. Steigler, Optimal computer folding of large RNA sequences using thermodynamics
and auxiliary information, Nucleic Acids Res 9, 1981, 133148.

[2] D.H. Mathews, J. Sabina, M. Zuker, and D.H. Turner, Expanded sequence dependence of
thermodynamic parameters improves prediction of RNA secondary structure, J. Molecular Biology, 288,
1999, 911940.

[3] P.Gultyaev, F.H.D.van Batenburg, and C.W.A.Pleij, The computer simulation of RNA folding pathways
using a genetic algorithm, J. Mol. Biol., 250, 1995, 3751.

[4] Do CB, Woods DA, Batzoglou S: CONTRAfold: RNA secondary structure prediction without physics-based
models. Bioinformatics 2006, 22(14):e90-e98.

Вам также может понравиться