Академический Документы
Профессиональный Документы
Культура Документы
INTRODUCTION
ABSTRACT
Protein annotation lags behind in the rapidly increasing amount of
Compared with globular proteins, transmem-
sequence data resulting from the numerous ongoing genome sequenc-
brane proteins are surrounded by a more intri-
ing projects.1 In general, the easiest way to obtain information about
cate environment and, consequently, amino acid
composition varies between the different com- newly sequenced proteins is to transfer annotations from well-charac-
partments. Existing algorithms for homology terized homologous proteins.2,3 Accordingly, development of existing
detection are generally developed with globular algorithms for homology detection is of great importance for better
proteins in mind and may not be optimal to characterization of new protein sequences.
detect distant homology between transmembrane Searching for protein homologs by aligning a single sequence to a
proteins. Here, we introduce a new profilepro- sequence database (e.g., BLAST4) relies on the assumption that amino
file based alignment method for remote homol- acid substitution rates are independent of the position in the align-
ogy detection of transmembrane proteins in a ment and is unable to capture variable conservation levels along the
hidden Markov model framework that takes sequence. Using sequence profiles instead allows for position specific
advantage of the sequence constraints placed by
scoring (e.g., PSI-BLAST5) and improves distant homology detection
the hydrophobic interior of the membrane. We
for globular proteins. Profileprofile based methods, which incorpo-
expect that, for distant membrane protein
homologs, even if the sequences have diverged rate evolutionary information for both query and database sequences,
too far to be recognized, the hydrophobicity pat- are often able to detect even more remote relationships.69
tern and the transmembrane topology are better Integral membrane proteins constitute about 25% of most pro-
conserved. By using this information in parallel teomes10 and are present in all different kinds of membranes in all
with sequence information, we show that both cells. In addition to the biological interest of membrane proteins, their
sensitivity and specificity can be substantially medical importance as drug targets is also well-recognized.11 As a
improved for remote homology detection in two result of the hydrophobic environment within the membrane, trans-
independent test sets. In addition, we show that membrane proteins are subject to particular structural constraints
alignment quality can be improved for the most and, as a consequence, have different amino acid composition and res-
distant homologs in a public dataset of mem-
idue exchangeabilities compared with globular proteins.12 Since algo-
brane protein structures. Applying the method
rithms for homology detection are often developed with globular pro-
to the Pfam domain database, we are able to
suggest new putative evolutionary relationships teins in mind, they may not be optimal to detect distant homology
for a few relatively uncharacterized protein do- between membrane proteins.
main families, of which several are confirmed The detection of globular proteins is a well-studied field and it has
by other methods. The method is called Searcher been established that the detection can be improved by including in-
for Homology Relationships of Integral Mem- formation from secondary structure predictions1316 and by the use
brane Proteins (SHRIMP) and is available for of evolutionary information for both query and target proteins.7,9,1722
download at http://www.sbc.su.se/shrimp/.
Proteins 2008; 71:13871399.
V
C 2007 Wiley-Liss, Inc. The Supplementary Material referred to in this article can be found online at http://www.
interscience.wiley.com/jpages/0887-3585/suppmat/
Grant sponsor: European Commission (BioSapiens and EMBRACE); Grant numbers: LSHG-CT-
Key words: membrane protein; homology detec- 2003-503265 and LSHG-CT-2004-512092; Grant sponsor: Swedish Research Council.
tion; hidden Markov model; membrane topology. *Correspondence to: Arne Elofsson, Department of Biochemistry and Biophysics, Stockholm
University, SE-10691, Stockholm, Sweden. E-mail: arne@bioinfo.se
Received 3 May 2007; Revised 10 August 2007; Accepted 13 September 2007
Published online 12 December 2007 in Wiley InterScience (www.interscience.wiley.com).
DOI: 10.1002/prot.21825
V
C 2007 WILEY-LISS, INC. PROTEINS 1387
A. Bernsel et al.
In contrast to globular proteins only a few attempts have al- amino acid probabilities, each position in the profile was
ready been made to utilize the sequence constraints specific also assigned one of two hydrophobicity or topology
to membrane proteins for homology detection. Ng et al.23 classes, representing the second alphabet as described
clustered predicted transmembrane regions to calculate a later.
transmembrane specific substitution matrix that is used in
place of the usual BLOSUM or PAM matrices when aligning Hydrophobicity calculations
membrane proteins. In a previous study, we used topology
predictions to promote alignments where membrane For each column in the alignment, the average hydro-
regions are aligned against each other, in a profilesequence phobicity was calculated according to the included
based approach.24 Recently, a benchmark study showed sequences, using the GES hydrophobicity scale.31 Col-
that profileprofile based methods were more accurate than umns were then assigned a hydrophobicity value based
profilesequence alignments for detecting remote mem- on the weighted average of hydrophobicites for the
brane protein homologs, and additionally, that secondary neighborhood of 21 columns (including itself), using a
structure prediction algorithms developed for globular pro- sliding window with symmetric trapezoidal weights (for
teins perform well also for membrane proteins.25 details, see the first step of the Toppred algorithm).32 To
Here, we are combining the membrane protein specific incorporate the hydrophobicity information into the
sequence constraints with the strengths of profileprofile model, it first needs to be discretized, and after exploring
based alignments in a hidden Markov model (HMM) several different possibilities, we achieved the best results
framework, which is able to combine the sequence signal simply using a two-class representation corresponding to
with either hydrophobicity or predictions of transmem- hydrophobic and hydrophilic, respectively. To choose
brane topology. We expect that, for distant membrane the hydrophobicity cutoff value between these two classes
protein homologs, the topology or hydrophobicity pat- as well as possible, we calculated the hydrophobicities for
tern should be more closely correlated with the protein all columns of the alignments against UniRef90 of a large
structure, and therefore could be better conserved than collection of 31,550 predicted membrane proteins from
sequence alone. It is thus possible that introducing this Swiss-Prot,33 required to contain at least one transmem-
kind of information into the model, and using it in par- brane helix according to a TMHMM10 prediction. After
allel with sequence, could improve alignments between averaging over neighboring columns, the hydrophobicity
remote membrane protein homologs compared with distribution forms two quite distinct peaks (Supplemen-
purely sequence based approaches. tary Figure S1), corresponding to transmembrane and
The idea is to construct profile hidden Markov mod- nontransmembrane positions, respectively, and it can be
els26,27 from multiple sequence alignments and extend seen that the best choice of cutoff value to separate
these with a second alphabet corresponding to predic- between these peaks is zero, which is also the biologically
tions of either hydrophobicity or transmembrane topol- most reasonable value. In this way, the columns of all
ogy, in a similar way as has previously been done for alignments could be classified into one of the two classes.
globular proteins.16,28,29 Sequence profiles, which also
contain the same type of additional information, are Topology predictions
then scored against the model and paths through the As an alternative for additional information to be used
model corresponding to alignments where transmem- as a second alphabet in our models, we also predicted
brane regions are matched against each other will have transmembrane topologies from the alignments. Topolo-
a relatively higher probability. Accordingly, alignments gies were predicted for all sequences in the alignment
between proteins with similar patterns in hydrophobicity separately, using the TMHMM (version 2.010;) topology
or similar topologies will obtain a higher score and dis- prediction algorithm. Each position in the alignment was
crimination between protein families with distinct topol- then classified as being transmembrane, if more than
ogies will be improved. 50% of the sequences were predicted to be transmem-
brane at that particular position, and otherwise classified
as a loop region. Any sequences with gaps at that posi-
METHODS
tion were not taken into account for the classification.
Profile construction We also tried using PRO-TMHMM (version 0.9234;)
to predict topologies of whole alignments rather than
Sequence profiles were constructed from PSI-BLAST5 of all sequences separately, which gave slightly inferior
(version 2.2.10) alignments against a large nonredundant although comparable results.
sequence database30 (UniRef90). Different numbers of
iterations and E-value cutoff limits for inclusion in the
HMM construction and scoring
alignment were explored and evaluated according to pre-
dictive performance (E-value cutoff 1025 and three itera- From the alignments, profile HMMs were constructed
tions were used, see Results section). In addition to the using the HMMER package (version 2.3.227;), with each
1388 PROTEINS
Distant Homology Detection of Membrane Proteins
Figure 1
Dependencies between random variables in the multitrack HMM. The state at the current time step, St is only dependent on the state in the previous time step, St21. At
each time step, one symbol from each of the two alphabets is emitted, At and Bt. Emission probabilities for the two alphabets are independent of one another and are
completely determined by the current state, St.
position in the query sequence represented by one match time step t is only dependent on the state at the previous
state in the model, and the whole model was configured time step, St21. Emission probabilities for both alphabets
for a single local/local alignment (Smith/Waterman style). are completely determined by the state in the current
A match state in the model could either be classified as time step and are independent of one another.
transmembrane/hydrophobic or loop/hydro- The probability for emitting a particular amino acid
philic, in accordance with the classifications of the cor- distribution vector was calculated as the dot product
responding positions in the alignment as described ear- between the profile vector and the HMM state vector
lier. Emission probabilities for the second alphabet eM (technically, this is not a probability, however it corre-
and eL, corresponding to transmembrane/hydro- sponds to the probability of emitting an amino acid in
phobic and loop/hydrophilic respectively, were then the single sequence case), as has been previously des-
set to: cribed.7,9 The total probability for emitting a profile
column was calculated as the product between the prob-
eM 0:5 abilities for the two alphabets. The Forward algorithm35
1
eL 0:5 was employed to score the profiles against the model and
all calculations were performed using the modhmm
for insert states and match states classified as loop/ package (http://modhmm.org34;). In comparison with
hydrophilic, and two, out of several, previous multitrack HMM methods
using secondary structure as second alphabet, SAM-
eM 0:5 n T0229 and ssHMM,16 the main differences from our
2
eL 0:5 n method are as follows: (1) whereas SAM-T02 and
ssHMM only score the models against single sequences,
for match states classified as transmembrane/hydro- we score against sequence profiles and so include evolu-
phobic, where 0 n 0.5. The parameter n can be tionary information from both proteins being aligned;
seen as a weight that reflects how much confidence and (2) in our case, the distribution for the second
should be put to the second alphabet, as compared with alphabet is fixed according to equations 1 and 2, whereas,
the sequence alphabet. A value of n close to 0.5 indicates SAM-T02 and ssHMM use the actual observed frequen-
that we believe hydrophobicity or topology to be highly cies from the alignment. In addition, the SAM-T02 pa-
reliable when judging homologous relationships. We also rameters are optimized in an iterative sequence database
tried using the actual frequencies of the classes from the search, whereas our method and ssHMM rely on ob-
alignment as parameters as well as different numbers of served frequencies from columns of multiple sequence
classes and continuous distributions, but none of these alignments.
trials improved the results. Alignments were ranked according to a score that
Each emitting state in the model thus contained both reflects the overall similarity between the aligned pro-
an amino acid distribution and a hydrophobicity/topol- teins. There are several possible options for what score to
ogy distribution as given by Eqs (1) and (2). The use, and here, we examined two different possibilities.
dependencies between the random variables in the model The logodds score, defined as
are illustrated in Figure 1. Two symbol sequences, corre-
sponding to amino acid sequence and hydrophobicity/to-
Psequencejmodel
pology sequence, are generated from the emitting states scorelogodds log 3
of the model. The probability for being in state St at Psequencejnull model
PROTEINS 1389
A. Bernsel et al.
1390 PROTEINS
Distant Homology Detection of Membrane Proteins
do not belong to the same family (to filter out obvious RESULTS AND DISCUSSION
cases), and (iii) that they have a Structal alignment P-
value < 1023. GPCRDB
Aiming to reveal new putative homologous relation- In evaluating methods for homology identification, it
ships, we also applied our method to the Pfam domain is obviously crucial that the true evolutionary relation-
database45 version 19.0. Probable transmembrane pro- ships between proteins in the test set are well known. In
tein domains, containing at least one TM-helix, were particular, negative relationships, that is, knowledge
extracted using PRO-TMHMM predictions,34 and pro- about two proteins not being related, can be difficult to
files were built from the Pfam seed alignments. In obtain and therefore high scoring false positives might
addition, profile HMMs were built from the Pfam actually be undiscovered true relationships. To minimize
alignments using the HMMer27 package and the this potential problem, we chose to initially test our
HMMs were then scored against the profiles using the method on one of the most well studied and character-
modhmm program34 (available at http://modhmm. ized membrane protein superfamilies, namely the G-pro-
org). tein coupled receptors (GPCRs). Our dataset used here
includes and extends the one used by Hedman et al. in a
Transmembrane overlap previous study.24
PROTEINS 1391
A. Bernsel et al.
results. These values of the parameters were used for the SHRIMP-topo or SHRIMP-hphob, these two putative
remaining predictions, and also in the OPM, HOMEP GPCR classes are found at higher FPR, i.e. with lower
and Pfam test sets (see later). No other parameters were confidence, than the non-GPCR class of bacteriorhodop-
optimized, that is, default values were used for all other sins (rightmost column in Table II; see Supplementary
parameters in PSI-BLAST and TMHMM. Fig. S4). On the other hand, the remaining five putative
classes (third column in Table II) are found at higher
fractions than two known GPCR classes (the Metabo-
ROC evaluation of homology detection
tropic glutamate/pheromone receptors and the Fungal
Extending the analysis to the complete dataset, we pheromone receptors, second column in Table II).
tested homology detection between GPCR classes both In conclusion, the main difference between the meth-
with a purely sequence based model containing just the ods tried here lies primarily in its ability to identify
amino acid sequence alphabet (SHRIMP-seq), and using the more distant members of the GPCR superfamily,
either predicted hydrophobicity (SHRIMP-hphob) or to- the Metabotropic glutamate/pheromone receptors and
pology (SHRIMP-topo) as additional alphabet [Fig. the Fungal pheromone receptors. SHRIMP-topo is able
2(A)]. As a reference, we also made predictions on the to find virtually all of the known GPCRs, and in addition
same dataset using three available methods, namely Pal- classifies five of the seven putative GPCR classes as being
ign, Pmembr and HHpred (see Methods). Interestingly, true GPCRs, whereas, the remaining two are classified
adding information about either hydrophobicity or as non-GPCRs. SHRIMP-topo and GPCRHMM give
topology (SHRIMP-hphob or SHRIMP-topo) seems comparable results, however, GPCRHMM uses only one
to improve the purely sequence based predictions model for the whole superfamily of GPCRs whereas
(SHRIMP-seq) over the whole range of FPRs. This indi- SHRIMP uses several, and the potential to discover com-
cates that for these remote homologs from different pletely new classes of GPCRs could thus be higher with
GPCR classes, where the sequences have apparently GPCRHMM. On the other hand, SHRIMP-topo is not
diverged quite far, the hydrophobicity pattern or topol- restricted to model just GPCRs but, as shown below, can
ogy is better conserved and can improve the predictions. be used to explore evolutionary relationships between
Compared with the profilesequence based methods, Pal- any two membrane proteins.
ign and Pmembr, the profileprofile based methods, In contrast to the results for SHRIMP, we were not
SHRIMP and HHpred, perform better especially at low able to improve the HHpred results by including in-
FPRs. formation about predicted secondary structure using
Investigating further the ability of all methods to PSIPRED. It is likely that, whereas, the locations of the
detect proteins from the individual GPCR classes, as well TM helices is sufficiently conserved to provide useful in-
as some additional classes described in GPCRDB as pu- formation for improving alignments, secondary structure
tative GPCRs and one non-GPCR family (ba- of globular domains is not informative in this respect,
cteriorhodopsin; Class Z), gives the results as in Table II. for aligning membrane proteins.
Here, our results are also compared to those from We also evaluated the ability of the model to find close
GPCRHMM,46 an HMM-based method specifically homologs within, rather than between, GPCR superfami-
designed to recognize GPCRs. GPCRHMM gives a Yes/ lies (Supplementary Fig. S5). For these close homologs,
No output to a given query sequence, and obtains a sequence alone is sufficient to reach high TPR even at
false positive rate of 1.0% on our negative data set; thus, very low FPRs, and the improvement with the addition
for a fair comparison, we used that same FPR for all of the second alphabet is not as clear as before. However,
methods in Table II. the evolutionary information in the profileprofile based
Among the known GPCR classes (leftmost columns in methods seems to help at low FPRs.
Table II), the Metabotropic glutamate/pheromone recep-
tors and the Fungal pheromone receptors seem more
Analysis of false positives
difficult to detect than the others, and are only well-
detected by the topology/hydrophobicity-improved To investigate the types of errors made by the different
SHRIMP models and GPCRHMM at this false positive methods, we looked further into the highest scoring 1%
rate. The finding of these two classes by the model also of the negative examples, that is, the false positives at the
explains the leap from 0.7 TPR to 1.0 TPR of the FPR used in Table II, with respect to their average length
SHRIMP-topo curve at about 0.01 FPR in Figure 2(A) and number of predicted TM helices (Table III). In gen-
(see also Supplementary Fig. S4). eral, it might be expected that proteins with similar
Among the seven classes annotated as putative lengths and topologies as the positive examples are more
GPCRs in GPCRDB (middle columns in Table II), two difficult to filter out, and that such proteins were to be
classes, the Insect odorant receptors and Plant Mlo recep- overrepresented in the highest scoring false positives.
tors (fourth column in Table II), are not found at this This is also the case for all profileprofile based methods
FPR by any of the methods. In fact, using either and in particular for the GPCR specialized predictor
1392 PROTEINS
Distant Homology Detection of Membrane Proteins
Figure 2
ROC curve of classification between GPCRs and (A) non-GPCRs, (B) non-GPCRs with 68 predicted TM regions only. Red curves: the different variants of SHRIMP;
green curves: the profileprofile based method HHpred; blue curves: the profilesequence based methods Palign and Pmembr. Introducing additional information
(SHRIMP-topo and SHRIMP-hphob) improves the sequence based prediction (SHRIMP-seq) at all FPRs. The profileprofile based methods perform better especially at
low FPRs. Note the logarithmic scale on the x-axis.
PROTEINS 1393
A. Bernsel et al.
Table II
Fraction Proteins Found Among the Different GPCRDB Classes by all Methods, at 1% False Positive Rate
The Metabotropic glutamate/pheromone and Fungal pheromone receptors (second column) are more difficult to find than the other known GPCR classes, and are only
well detected by GPCRHMM and the topology/hydrophobicity-improved SHRIMP models at this FPR. Among the putative GPCR classes, the Insect odorant receptors
and Plant Mlo receptors (fourth column) are not detected by any method, whereas the remaining five classes (third column) are detected in various degrees by all
methods.
GPCRHMM, for which the fraction of proteins with 68 it is clear that, whereas the distribution of Dscore for
predicted TM helices is significantly higher than the cor- false hits is quite centered around zero, the true hits are
responding fraction for the negative dataset as a whole generally being given higher score with the addition of
(10.9%). Somewhat unexpectedly, both profilesequence the second alphabet [Fig. 3(A)]. There is also a number
based methods (Palign and Pmembr) are primarily not of false hits with high Dscore, which probably corre-
fooled by such proteins, but instead sometimes give very sponds to cases where the topology is similar by coinci-
high scores to large globular domains in proteins with dence, rather than for evolutionary reasons.
few predicted TM helices (Table III). Since such single- It is possible that the increase in logodds score for the
spanning TM proteins are easily dismissed as GPCR can- true hits is resulting from an actual change in the align-
didates, these methods might actually perform better if ment between the proteins, that is, that the regions of
only proteins with 68 TM helices were to be used as the sequences which are aligned against each other differs
negative examples, whereas, the other methods should between the different models. However, there is no
give slightly worse results in this more difficult but also requirement that this should be the case, and it would
more realistic test. To see this, we also tested the ability also be perfectly reasonable that the true hits get higher
of all methods to distinguish true GPCRs from 1.295 score even though the alignments remain quite intact,
proteins with 68 predicted TM helices in our negative simply because aligning transmembrane regions against
dataset [Fig. 2(B)]. Clearly, both Pmembr and Palign per- each other is promoted by the model with two alphabets.
form significantly better at low FPRs in this test, now To investigate the degree to which the transmembrane
finding more than 60% of the GPCRs immediately. How- regions are aligned against each other in the different
ever, the advantage of SHRIMP over the other methods models, we measured the percentage of transmembrane
remains in this test, both at low FPRs and at the point
where the Metabotropic glutamate/pheromone receptors
and the Fungal pheromone receptors are found, now at a Table III
FPR of around 23%. Analysis of the 1% Highest Scoring False Positives
1394 PROTEINS
Distant Homology Detection of Membrane Proteins
OPM
Figure 3
A Distributions of Dscore for false and true hits. Whereas alignments between
truly homologous proteins tend to have positive Dscore, alignments between
nonhomologous proteins usually do not. Dots indicate frequency of alignments
with Dscore in the interval of 0.5 from the value given by the x-axis. B:
Percentage of overlapping transmembrane residues in alignments between two
GPCRs. The addition of the second alphabet results in an increase of the very
good alignments (TM-overlap above 75%), and at the same time, the very poor
alignments (TM-overlap below 10%) are decreased. Dots indicate frequency of
alignments with a percentage of overlapping TM residues in the interval of 2.5
from the value given by the x-axis.
PROTEINS 1395
A. Bernsel et al.
Figure 5
ROC curve of superfamily classification within the OPM43 database using different methods. Red curves: the different variants of SHRIMP; green curves: the profile
profile based method HHpred; blue curves: the profilesequence based methods Palign and Pmembr. Note the logarithmic scale on the x-axis.
structures. Four hundred fifty-four protein chains from set of homologous membrane proteins, and for this pur-
28 superfamilies, with topologies ranging from 1 to 14 pose, the HOMEP data set25 was used. HOMEP contains
TM helices, were aligned against each other, and the abil- 94 alignments between 57 polypeptide chains from 36
ity to detect homology between proteins within the same nonredundant membrane protein structures, divided into
superfamily, but not from the same family, was tested 11 families, and we tested the alignment quality between
using the same methods and the same parameter values homologous proteins using the same methods as men-
as above (Fig. 5). In agreement with our previous results, tioned earlier. For each homologous protein pair, align-
adding information about topology or hydrophobicity ments were built using the different methods, and were
(SHRIMP-topo/SHRIMP-hphob) improves the sequence subsequently used as input to MODELLER,40 to produce
based method (SHRIMP-seq), and the profileprofile full 3D-models of the target proteins (Fig. 6). Among the
based methods (SHRIMP and HHpred) perform better most distantly related protein pairs in HOMEP, with
than the profilesequence based methods (Palign and 10% sequence identity or less, adding secondary structure
Pmembr) especially at low FPRs (Fig. 5). Searching for information seems to help, as seen in the difference both
homology within families instead, the sequence informa- between Pmembr and Palign and between SHRIMP-topo
tion alone is sufficient and all methods tested perform and SHRIMP-seq. This confirms the finding from the
equally well (data not shown). GPCR dataset (Fig. 2), that secondary structure is espe-
cially useful at high FPRs. At sequence identities between
10% and 30%, the profileprofile based methods on av-
Homep
erage perform better than the profilesequence based
In addition to the GPCR and OPM datasets, we also ones. Although the dataset is small, because of the gen-
wanted to test our method on a public benchmark data eral lack of membrane protein structures, the general
1396 PROTEINS
Distant Homology Detection of Membrane Proteins
PROTEINS 1397
A. Bernsel et al.
1398 PROTEINS
Distant Homology Detection of Membrane Proteins
DA, ODonovan C, Redaschi N, Yeh L-SL. The Universal Protein 39. Li W, Godzik A. Cd-hit: a fast program for clustering and compar-
Resource (UniProt). Nucleic Acids Res 2005;33(Database issue): ing large sets of protein or nucleotide sequences. Bioinformatics
154159. 2006;22(13):16581659.
31. Engelman DM, Steitz TA, Goldman A. Identifying nonpolar trans- 40. Sali A, Blundell T. Comparative protein modelling by satisfaction of
bilayer helices in amino acid sequences of membrane proteins. spatial restraints. J Mol Biol 1993;234(3):779815.
Annu Rev Biophys Biophys Chem 1986;15:321353. 41. Zemla A. LGA: a method for finding 3d similarities in protein
32. vonHeijne G. Membrane protein structure prediction. Hydropho- structures. Nucleic Acids Res 2003;31(13):33703374.
bicity analysis and the positive-inside rule. J Mol Biol 1992;225(2): 42. Zhang Y, Skolnick J. Scoring function for automated assessment of
487494. protein structure template quality. Proteins 2004;57(4):702710.
33. Boeckmann B, Bairoch A, Apweiler R, Blatter M-C, Estreicher A, 43. Lomize A, Pogozheva I, Lomize M, Mosberg H. Positioning of pro-
Gasteiger E, Martin MJ, Michoud K, ODonovan C, Phan I, Pilbout teins in membranes: a computational approach. Protein Sci
S, Schneider M. The SWISS-PROT protein knowledgebase and its 2006;15(6):13181333.
supplement TrEMBL in 2003. Nucleic Acids Res 2003;31(1):365 44. Gerstein M, Levitt M. Comprehensive assessment of automatic
370. structural alignment against a manual standard, the scop classifica-
34. Viklund H, Elofsson A. Best a-helical transmembrane protein to- tion of proteins. Protein Sci 1998;7(2):445456.
pology predictions are achieved using hidden Markov models and 45. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones
evolutionary information. Protein Sci 2004;13(7):19081917. S, Khanna A, Marshall M, Moxon S, Sonnhammer ELL, Studholme
35. Rabiner LR. A tutorial on hidden markov models and selected DJ, Yeats C, Eddy SR. The Pfam protein families database. Nucleic
applications in speech recognition. Proceedings of the IEEE Acids Res 2004;32:138141.
1989;77(2):257285. 46. Wistrand M, Kall L, Sonnhammer ELL. A general model of G pro-
36. Elofsson A. A study on protein sequence alignment quality. Proteins tein-coupled receptor sequences and its application to detect remote
2002;46(3):330339. homologs. Protein Sci 2006;15(3):509521.
37. Jones D. Protein secondary structure prediction based on position- 47. Provost F, Fawcett T. Robust classification for imprecise environ-
specific scoring matrices. J Mol Biol 1999;292(2):195202. ments. Mach Learn 2001;42(3):203231.
38. Horn F, Bettler E, Oliveira L, Campagne F, Cohen FE, Vriend G. 48. Kall L, Krogh A, Sonnhammer ELL. An HMM posterior decoder
GPCRDB information system for G protein-coupled receptors. for sequence feature prediction that includes homology informa-
Nucleic Acids Res 2003;31(1):294297. tion. Bioinformatics 2005;21(Suppl 1):251257.
PROTEINS 1399