Академический Документы
Профессиональный Документы
Культура Документы
Abstract
The prediction of the secondary structure starting from an RNA sequence, is one of the most intense
research field in bioinformatics. The Minimum Free Energy is the most common method for the
prediction (MFE). Recently has emerged that a more accurate secondary structure can be predicted
computing the Maximum Expected Accuracy (MEA). With the current article we want briefly show
how MEA and MEF work and supply two working protocols to show how to use them.
Introduction
To think simple, consider an RNA molecule to be a strand of four types of bases: Adenine (A),
Cytosine (C), Guanine (G), and Uracil (U). We represent an RNA molecule as a string over {A, C,
G, U}, with the left end corresponding to the 5' end of the molecule. In the hybridization process,
pairs of bases in RNA form hydrogen bonds, with the complementary pairs C-G and A-U being the
strongest than others. A folded molecule is largely held together by the resulting set of bonds called
its secondary structure, as showed in Figure 1. Knowledge of the secondary structure of a folded
RNA molecule gives valuable insight on its function. Abstractly, we
represent the secondary structure of an RNA molecule
>Sample RNA
AAAAAAAAAAAAAGGGGGGGUUUUUUUUUUUUUUCCCCCCCCCCCCCCCCC
.............(((((((..............)))))))..........
MFE
Free Energy Minimization (MFE) has been the most popular method for RNA secondary structure
prediction for decades. It is based on a set of empirical free energy change parameters derived from
experiments using a nearest-neighbor model. The thermodynamic model for RNA structure
formation posits that, out of the exponentially many possibilities, an RNA molecule folds into that
structure with the minimum free energy. Free energy models typically assume that the total free
energy of a given secondary structure for a molecule is the sum of independent contributions of
adjacent, or stacked, base pairs in stems (which tend to stabilize the structure) and of loops (which
tend to destabilize the structure). These contributions depend on temperature, the concentration of
the molecule in solution, and the ionic concentration of the solution. Zuker and Steigler [1] describe
a dynamic programming algorithm for finding the mfe pseudoknot free secondary structure of a
given molecule.
b1 ,b 2 ,. . .,b n
Let the input strand be . Suppose that W ( i,j ) is the energy of the mfe pseudoknot free
b , . .. ,b j
secondary structure for strand i , and V (i,j ) be the energy of the mfe pseudoknot free
b , . .. ,b j
secondary structure for strand i , among those structures containing base pair ( i,j ) . Then,
W satisfies the following recurrence (base cases excluded):
MEA
As an alternative method to finding the most probable structure, the structure with Maximum
Expected Accuracy (MEA) can be predicted. Pseudo-knot-free structures are predicted by
maximizing the sum of the base-paired and single-stranded nucleotide probabilities, called expected
accuracy, where pairing probabilities can be weighted by a specific factor. CONTRAfold [4], the
first implementation of MEA, uses probabilistic parameters learned from a set of RNA secondary
structures to predict base-pair probabilities and then predicts structures using the maximum
P (i,j )
expected accuracy approach. The base-pair probability of an i-j pair, bp , is predicted from a
partition function calculation based on a biophysical model, the nearest-neighbor model. The
probability of nucleotide i being single-stranded is:
Pss =1 P bp ( i,j )
After base-pair probabilities are calculated with the partition function calculation, it is utilized a
dynamic programming algorithm to find the structure with maximum expected accuracy.
Computational Tools
The Vienna RNA Package (http://www.tbi.univie.ac.at/RNA/) consists of a C code library and
several stand-alone programs for the prediction and comparison of RNA secondary structures.
Vienna provides three kinds of dynamic programming algorithms for structure prediction: the
minimum free energy algorithm of (Zuker & Stiegler) which yields a single optimal structure, the
partition function algorithm of (McCaskill 1990) which calculates base pair probabilities in the
thermodynamic ensemble, and the suboptimal folding algorithm of (Wuchty et.al 1999) which
generates all suboptimal structures within a given energy range of the optimal energy. In this case of
study, we are interested to RNAfold program.
RNAfold reads RNA sequences from stdin, calculates their minimum free energy (mfe) structure
and prints to stdout the mfe structure in bracket notation and its free energy. If the -p option was
given it also computes the partition function (Z) and base pairing probability matrix, and prints the
free energy of the thermodynamic ensemble, the frequency of the MFE structure in the ensemble,
and the ensemble diversity to stdout.
It also produces PostScript files with plots of the resulting secondary structure graph and a "dot
plot" of the base pairing matrix. The dot plot shows a matrix of squares with area proportional to the
pairing probability in the upper right half, and one square for each pair in the minimum free energy
structure in the lower left half.
Secondary Structure Prediction using RNAfold (Vienna PKG)
We can suppose that our RNA sequence is the following CCGCACAGCGGGCAGUGCC.
1. As first step, we need to create a file (test1.fasta) in fasta format containing our sequence,
like the following:
The last line of the text output contains the predicted MFE structure as bracket notation and
its free energy in kcal/mol.
A dot in the bracket notation represents an unpaired position, while a base pair (i, j) is
represented by a pair of matching parentheses at position i and j.
It has been also created a postscript file (RNA_ss.ps) containing a graphic representation
of the predicted secondary structure. You can open it typing the command gv RNA_ss.ps.
It should be similar to the following:
F
igure 2: Plot of RNA_ss.ps
Note that you can understand better about the pair probability and alternative structural
configuration producing the dot plot.
You can produce it adding the option -p to the previous command RNAfold -p < test1.fasta >
test1_out. At this point, you will obtain also a file named RNA_dp.ps showing the pair
probabilities within the equilibrium ensemble. A square at row i and column j matrix indicates a
base pair. The area of a square in the upper right half of the matrix is proportional to the probability
of the base pair (i, j) within the equilibrium ensemble. The lower left half shows all pairs belonging
to the MFE structure. While the MFE consists of a single helix, 3 different helices are clearly visible
in the pair probabilities.
You can open it typing the command gv RNA_dp.ps. It should be similar to the following:
To visualize which parts of a predicted MFE are well-defined and thus more reliable (producing a
new diagram), we have to use the following commands:
RNAfold -p < test1.fasta, to generate the RNA_dp.ps and RNA_ss.ps files;
/usr/share/ViennaRNA/bin/mountain.pl RNA_dp.ps | xmgrace -pipe , to produce a mountain
plot. It is a xy-diagram plotting the number of base pairs enclosing a sequence position
versus the position. The resulting plot shows three curves, two mountain plots derived from
the MFE structure (red) and the pairing probabilities (black) and a positional entropy curve
(green). Note that Well-defined regions are identified by low entropy. By superimposing
several mountain plots structures can easily be compared;
/usr/share/ViennaRNA/bin/relplot.pl RNA_ss.ps RNA_dp.ps > RNA_rss.ps , to produce a
diagram of the predicted structure containing also information about probability. The Perl
script relplot.pl adds reliability information to a RNA secondary structure plot in the form
of color annotation.
The script computes a well-definedness measure we call ``positional entropy'' and encodes it as
color hue, ranging from red (low entropy, well-defined) via green to blue and violet (high entropy,
ill-defined).
The name of a file containing input data. This input data can be in one of two formats:
1. Partition function save file (holds base pairing probability data for all pairs and can be
generated using the partition interface).
<input file>
2. Sequence file (holds raw sequence: .seq or .fasta).
Note that lowercase nucleotides are forced single-stranded in structure prediction.
Note that in order to use a sequence file, the "sequence" flag must be specified (see "--
sequence" below).
<ct file> The name of a CT file to which output will be written.
GCTACATGGAGATTAACTCAATCTAGAGGGTATTAATAA
Supposing that your input file is named clusterN1.seq, you can produce the
secondary structure prediction using MaxExpect just typing the following
command MaxExpect --sequence clusterN1.seq clusterN1.out --gamma 1
--percent 10 --structures 20 --window 3. T his will generate a file named
clusterN1.out (as specified) containing information about the predicted
structure, as showed by the Figure 6.
With the following command line you can run MaxExpect many times on different input files, like
we have done previously using RNAfold:
[2] D.H. Mathews, J. Sabina, M. Zuker, and D.H. Turner, Expanded sequence dependence of
thermodynamic parameters improves prediction of RNA secondary structure, J. Molecular Biology, 288,
1999, 911940.
[3] P.Gultyaev, F.H.D.van Batenburg, and C.W.A.Pleij, The computer simulation of RNA folding pathways
using a genetic algorithm, J. Mol. Biol., 250, 1995, 3751.
[4] Do CB, Woods DA, Batzoglou S: CONTRAfold: RNA secondary structure prediction without physics-based
models. Bioinformatics 2006, 22(14):e90-e98.