Вы находитесь на странице: 1из 13

proteins

STRUCTURE O FUNCTION O BIOINFORMATICS

Remote homology detection of integral


membrane proteins using conserved
sequence features
Andreas Bernsel, Hakan Viklund, and Arne Elofsson*
Center for Biomembrane Research, Department of Biochemistry and Biophysics, Stockholm University,
SE-106 91 Stockholm, Sweden

INTRODUCTION
ABSTRACT
Protein annotation lags behind in the rapidly increasing amount of
Compared with globular proteins, transmem-
sequence data resulting from the numerous ongoing genome sequenc-
brane proteins are surrounded by a more intri-
ing projects.1 In general, the easiest way to obtain information about
cate environment and, consequently, amino acid
composition varies between the different com- newly sequenced proteins is to transfer annotations from well-charac-
partments. Existing algorithms for homology terized homologous proteins.2,3 Accordingly, development of existing
detection are generally developed with globular algorithms for homology detection is of great importance for better
proteins in mind and may not be optimal to characterization of new protein sequences.
detect distant homology between transmembrane Searching for protein homologs by aligning a single sequence to a
proteins. Here, we introduce a new profilepro- sequence database (e.g., BLAST4) relies on the assumption that amino
file based alignment method for remote homol- acid substitution rates are independent of the position in the align-
ogy detection of transmembrane proteins in a ment and is unable to capture variable conservation levels along the
hidden Markov model framework that takes sequence. Using sequence profiles instead allows for position specific
advantage of the sequence constraints placed by
scoring (e.g., PSI-BLAST5) and improves distant homology detection
the hydrophobic interior of the membrane. We
for globular proteins. Profileprofile based methods, which incorpo-
expect that, for distant membrane protein
homologs, even if the sequences have diverged rate evolutionary information for both query and database sequences,
too far to be recognized, the hydrophobicity pat- are often able to detect even more remote relationships.69
tern and the transmembrane topology are better Integral membrane proteins constitute about 25% of most pro-
conserved. By using this information in parallel teomes10 and are present in all different kinds of membranes in all
with sequence information, we show that both cells. In addition to the biological interest of membrane proteins, their
sensitivity and specificity can be substantially medical importance as drug targets is also well-recognized.11 As a
improved for remote homology detection in two result of the hydrophobic environment within the membrane, trans-
independent test sets. In addition, we show that membrane proteins are subject to particular structural constraints
alignment quality can be improved for the most and, as a consequence, have different amino acid composition and res-
distant homologs in a public dataset of mem-
idue exchangeabilities compared with globular proteins.12 Since algo-
brane protein structures. Applying the method
rithms for homology detection are often developed with globular pro-
to the Pfam domain database, we are able to
suggest new putative evolutionary relationships teins in mind, they may not be optimal to detect distant homology
for a few relatively uncharacterized protein do- between membrane proteins.
main families, of which several are confirmed The detection of globular proteins is a well-studied field and it has
by other methods. The method is called Searcher been established that the detection can be improved by including in-
for Homology Relationships of Integral Mem- formation from secondary structure predictions1316 and by the use
brane Proteins (SHRIMP) and is available for of evolutionary information for both query and target proteins.7,9,1722
download at http://www.sbc.su.se/shrimp/.
Proteins 2008; 71:13871399.
V
C 2007 Wiley-Liss, Inc. The Supplementary Material referred to in this article can be found online at http://www.
interscience.wiley.com/jpages/0887-3585/suppmat/
Grant sponsor: European Commission (BioSapiens and EMBRACE); Grant numbers: LSHG-CT-
Key words: membrane protein; homology detec- 2003-503265 and LSHG-CT-2004-512092; Grant sponsor: Swedish Research Council.
tion; hidden Markov model; membrane topology. *Correspondence to: Arne Elofsson, Department of Biochemistry and Biophysics, Stockholm
University, SE-10691, Stockholm, Sweden. E-mail: arne@bioinfo.se
Received 3 May 2007; Revised 10 August 2007; Accepted 13 September 2007
Published online 12 December 2007 in Wiley InterScience (www.interscience.wiley.com).
DOI: 10.1002/prot.21825

V
C 2007 WILEY-LISS, INC. PROTEINS 1387
A. Bernsel et al.

In contrast to globular proteins only a few attempts have al- amino acid probabilities, each position in the profile was
ready been made to utilize the sequence constraints specific also assigned one of two hydrophobicity or topology
to membrane proteins for homology detection. Ng et al.23 classes, representing the second alphabet as described
clustered predicted transmembrane regions to calculate a later.
transmembrane specific substitution matrix that is used in
place of the usual BLOSUM or PAM matrices when aligning Hydrophobicity calculations
membrane proteins. In a previous study, we used topology
predictions to promote alignments where membrane For each column in the alignment, the average hydro-
regions are aligned against each other, in a profilesequence phobicity was calculated according to the included
based approach.24 Recently, a benchmark study showed sequences, using the GES hydrophobicity scale.31 Col-
that profileprofile based methods were more accurate than umns were then assigned a hydrophobicity value based
profilesequence alignments for detecting remote mem- on the weighted average of hydrophobicites for the
brane protein homologs, and additionally, that secondary neighborhood of 21 columns (including itself), using a
structure prediction algorithms developed for globular pro- sliding window with symmetric trapezoidal weights (for
teins perform well also for membrane proteins.25 details, see the first step of the Toppred algorithm).32 To
Here, we are combining the membrane protein specific incorporate the hydrophobicity information into the
sequence constraints with the strengths of profileprofile model, it first needs to be discretized, and after exploring
based alignments in a hidden Markov model (HMM) several different possibilities, we achieved the best results
framework, which is able to combine the sequence signal simply using a two-class representation corresponding to
with either hydrophobicity or predictions of transmem- hydrophobic and hydrophilic, respectively. To choose
brane topology. We expect that, for distant membrane the hydrophobicity cutoff value between these two classes
protein homologs, the topology or hydrophobicity pat- as well as possible, we calculated the hydrophobicities for
tern should be more closely correlated with the protein all columns of the alignments against UniRef90 of a large
structure, and therefore could be better conserved than collection of 31,550 predicted membrane proteins from
sequence alone. It is thus possible that introducing this Swiss-Prot,33 required to contain at least one transmem-
kind of information into the model, and using it in par- brane helix according to a TMHMM10 prediction. After
allel with sequence, could improve alignments between averaging over neighboring columns, the hydrophobicity
remote membrane protein homologs compared with distribution forms two quite distinct peaks (Supplemen-
purely sequence based approaches. tary Figure S1), corresponding to transmembrane and
The idea is to construct profile hidden Markov mod- nontransmembrane positions, respectively, and it can be
els26,27 from multiple sequence alignments and extend seen that the best choice of cutoff value to separate
these with a second alphabet corresponding to predic- between these peaks is zero, which is also the biologically
tions of either hydrophobicity or transmembrane topol- most reasonable value. In this way, the columns of all
ogy, in a similar way as has previously been done for alignments could be classified into one of the two classes.
globular proteins.16,28,29 Sequence profiles, which also
contain the same type of additional information, are Topology predictions
then scored against the model and paths through the As an alternative for additional information to be used
model corresponding to alignments where transmem- as a second alphabet in our models, we also predicted
brane regions are matched against each other will have transmembrane topologies from the alignments. Topolo-
a relatively higher probability. Accordingly, alignments gies were predicted for all sequences in the alignment
between proteins with similar patterns in hydrophobicity separately, using the TMHMM (version 2.010;) topology
or similar topologies will obtain a higher score and dis- prediction algorithm. Each position in the alignment was
crimination between protein families with distinct topol- then classified as being transmembrane, if more than
ogies will be improved. 50% of the sequences were predicted to be transmem-
brane at that particular position, and otherwise classified
as a loop region. Any sequences with gaps at that posi-
METHODS
tion were not taken into account for the classification.
Profile construction We also tried using PRO-TMHMM (version 0.9234;)
to predict topologies of whole alignments rather than
Sequence profiles were constructed from PSI-BLAST5 of all sequences separately, which gave slightly inferior
(version 2.2.10) alignments against a large nonredundant although comparable results.
sequence database30 (UniRef90). Different numbers of
iterations and E-value cutoff limits for inclusion in the
HMM construction and scoring
alignment were explored and evaluated according to pre-
dictive performance (E-value cutoff 1025 and three itera- From the alignments, profile HMMs were constructed
tions were used, see Results section). In addition to the using the HMMER package (version 2.3.227;), with each

1388 PROTEINS
Distant Homology Detection of Membrane Proteins

Figure 1
Dependencies between random variables in the multitrack HMM. The state at the current time step, St is only dependent on the state in the previous time step, St21. At
each time step, one symbol from each of the two alphabets is emitted, At and Bt. Emission probabilities for the two alphabets are independent of one another and are
completely determined by the current state, St.

position in the query sequence represented by one match time step t is only dependent on the state at the previous
state in the model, and the whole model was configured time step, St21. Emission probabilities for both alphabets
for a single local/local alignment (Smith/Waterman style). are completely determined by the state in the current
A match state in the model could either be classified as time step and are independent of one another.
transmembrane/hydrophobic or loop/hydro- The probability for emitting a particular amino acid
philic, in accordance with the classifications of the cor- distribution vector was calculated as the dot product
responding positions in the alignment as described ear- between the profile vector and the HMM state vector
lier. Emission probabilities for the second alphabet eM (technically, this is not a probability, however it corre-
and eL, corresponding to transmembrane/hydro- sponds to the probability of emitting an amino acid in
phobic and loop/hydrophilic respectively, were then the single sequence case), as has been previously des-
set to: cribed.7,9 The total probability for emitting a profile
column was calculated as the product between the prob-
eM 0:5 abilities for the two alphabets. The Forward algorithm35
1
eL 0:5 was employed to score the profiles against the model and
all calculations were performed using the modhmm
for insert states and match states classified as loop/ package (http://modhmm.org34;). In comparison with
hydrophilic, and two, out of several, previous multitrack HMM methods
using secondary structure as second alphabet, SAM-
eM 0:5 n T0229 and ssHMM,16 the main differences from our
2
eL 0:5  n method are as follows: (1) whereas SAM-T02 and
ssHMM only score the models against single sequences,
for match states classified as transmembrane/hydro- we score against sequence profiles and so include evolu-
phobic, where 0  n  0.5. The parameter n can be tionary information from both proteins being aligned;
seen as a weight that reflects how much confidence and (2) in our case, the distribution for the second
should be put to the second alphabet, as compared with alphabet is fixed according to equations 1 and 2, whereas,
the sequence alphabet. A value of n close to 0.5 indicates SAM-T02 and ssHMM use the actual observed frequen-
that we believe hydrophobicity or topology to be highly cies from the alignment. In addition, the SAM-T02 pa-
reliable when judging homologous relationships. We also rameters are optimized in an iterative sequence database
tried using the actual frequencies of the classes from the search, whereas our method and ssHMM rely on ob-
alignment as parameters as well as different numbers of served frequencies from columns of multiple sequence
classes and continuous distributions, but none of these alignments.
trials improved the results. Alignments were ranked according to a score that
Each emitting state in the model thus contained both reflects the overall similarity between the aligned pro-
an amino acid distribution and a hydrophobicity/topol- teins. There are several possible options for what score to
ogy distribution as given by Eqs (1) and (2). The use, and here, we examined two different possibilities.
dependencies between the random variables in the model The logodds score, defined as
are illustrated in Figure 1. Two symbol sequences, corre-
sponding to amino acid sequence and hydrophobicity/to-  
Psequencejmodel
pology sequence, are generated from the emitting states scorelogodds log 3
of the model. The probability for being in state St at Psequencejnull model

PROTEINS 1389
A. Bernsel et al.

takes the (natural) logarithm of the fraction between


the probability of the sequence (or profile) given the Table I
GPCRDB Classes Used in the Study
model, and the probability of the sequence (or profile)
given a null model. Here, the null model is composed No. of seq. in No. of seq.
of a single emitting state with background emission Class GPCRDB in test set GPCR
probabilities according to the amino acid composition Rhodopsin like 4949 159 Yes
of Swiss-Prot. Alternatively, a reversed sequence score Secretin like 231 52 Yes
was used, defined as Metabotropic 160 46 Yes
glutamate/pheromone
Fungal pheromone 58 32 Yes
 
Psequencejmodel cAMP receptors 7 3 Yes
scorereversi log 4 Ocular albinism proteins 8 1 Putative
Preversed sequencejmodel Frizzled/smoothened family 113 21 Yes
Insect odorant receptors 236 183 Putative
Plant Mlo receptors 52 16 Putative
Here, the probability of the reversed sequence given the Nematode chemoreceptors 755 253 Putative
model is instead used in the denominator, which has the Vomeronasal receptors 286 16 Putative
advantage that no null model is needed. On the other Taste receptors T2R 237 18 Putative
hand, when introducing topology as a second alphabet in Putative/unclassified 1061 526 Putative
Class Z archaeal/ 110 28 No
the model, it might have a negative effect if the topology bacterial/fungal opsins
looks similar in reverse, which could be the case for the
conserved 7TM topology of GPCRs. Hence, we used Sequences were collected such that the sequence identity between any two pro-
teins from the same class was less than 50%. In addition, sequences were ran-
scorelogodds for ranking the GPCRDB alignments, and domly removed from the Rhodopsin like class in order not to bias the true
scorereversi for the more general case of ranking the Pfam GPCR test set towards this class only. All classes are annotated as either
GPCRs, Putative GPCRs, or non-GPCRs in GPCRDB (rightmost column).
alignments (see later).

Methods used in benchmark

In three benchmark tests (see later), we compared


SHRIMP with several available methods for homology As a negative set, we extracted 31,237 proteins from
detection, namely, searches against PSI-BLAST5 profiles Swiss-Prot33 with at least one transmembrane helix
using standard sequenceprofile alignments. Here, one according to a TMHMM10 prediction, which were not
sequence is searched against a profile database using the present in GPCRDB, and whose Swiss-Prot entries were
Palign program,36 and this method is referred to as not allowed to contain the phrase G-protein coupled re-
Palign throughout the text; Pmembr,24 which is an ceptor. After homology reduction at 40% sequence iden-
adaption of profile searches for detection of membrane tity threshold, using cd-hit,39 11,864 proteins remained.
proteins using predicted topologies; and HHpred,28 an In addition to the GPCR dataset, we also tested our
HMM-HMM comparison algorithm that has proven suc- method on a publically available dataset of homologous
cessful to detect distant homology between globular pro- membrane protein structures, called HOMEP, from a
teins. A variant of HHpred, where a PSIPRED37 second- recent study.25 SHRIMP was employed to create global
ary structure prediction is included in the HMMs, was alignments between homologous membrane proteins,
also tested for comparison. For HHpred, the same E- and structural models were then built using MODELLER
value and number of iterations as optimized for SHRIMP (version 6v240;). For evaluation of alignments, the
was used. GDT_TS41 score between the model and the native
structure was calculated using the TM-score program.42
For a fair comparison between all methods used in this
Datasets
benchmark (see later), global alignments were produced
Sequences for 7674 GPCRs from six classes were such that the GDT_TS score is not biased by alignment
downloaded from GPCRDB,38 which is a manually length.
curated and well maintained database of GPCRs. Out of To test homology detection for membrane proteins of
those, 313 proteins were randomly selected such that other topologies than GPCRs, we also applied our
they were fairly evenly distributed among the classes, method to the Orientations of Proteins in Membranes
and such that the sequence identity between any two (OPM) database of known membrane protein struc-
proteins was less than 50% (Table I). In addition to tures.43 Since superfamilies in OPM are given for full
these true GPCRs, 1013 protein sequences from classes protein molecules, we first performed structural align-
annotated as putative GPCRs in GPCRDB as well as ment, using Structal,44 between individual subunits to
28 sequences from a non-GPCR class (bacteriorhodop- determine the true homology relationships. Thus, to con-
sin) in GPCRDB were collected, with the same sequence sider two polypeptide chains homologous we require: (i)
identity threshold. that they belong to the same superfamily, (ii) that they

1390 PROTEINS
Distant Homology Detection of Membrane Proteins

do not belong to the same family (to filter out obvious RESULTS AND DISCUSSION
cases), and (iii) that they have a Structal alignment P-
value < 1023. GPCRDB
Aiming to reveal new putative homologous relation- In evaluating methods for homology identification, it
ships, we also applied our method to the Pfam domain is obviously crucial that the true evolutionary relation-
database45 version 19.0. Probable transmembrane pro- ships between proteins in the test set are well known. In
tein domains, containing at least one TM-helix, were particular, negative relationships, that is, knowledge
extracted using PRO-TMHMM predictions,34 and pro- about two proteins not being related, can be difficult to
files were built from the Pfam seed alignments. In obtain and therefore high scoring false positives might
addition, profile HMMs were built from the Pfam actually be undiscovered true relationships. To minimize
alignments using the HMMer27 package and the this potential problem, we chose to initially test our
HMMs were then scored against the profiles using the method on one of the most well studied and character-
modhmm program34 (available at http://modhmm. ized membrane protein superfamilies, namely the G-pro-
org). tein coupled receptors (GPCRs). Our dataset used here
includes and extends the one used by Hedman et al. in a
Transmembrane overlap previous study.24

Overlap of transmembrane regions between two


Dataset
aligned proteins was calculated using GPCRHMM46 pre-
dictions as the reference topology. Overlapping trans- A representative set of 1353 nonredundant proteins
membrane regions were defined as stretches of aligned from 14 classes were extracted from GPCRDB (Table I),
residues where transmembrane regions were predicted and 11,864 nonredundant non-GPCRs were taken from
for both proteins in the alignment. For a protein, the Swiss-Prot as a negative set (see Methods section). All
percentage of all transmembrane residues that were pres- sequences from GPCRDB were then aligned against the
ent in overlapping transmembrane regions was calcu- complete set of proteins, and the logodds score from the
lated, and the mean value was taken over the two alignment was taken as a measure of the degree of
aligned proteins. homology between the two proteins. In the ROC evalua-
tion of homology detection (see later), only the six
Classifier evaluation
classes of true GPCRs (Table I) were used as positive
examples, whereas, the putative GPCRs and the bacterio-
Receiver operating characteristic (ROC) analysis47 was rhodopsin class were left out. To measure primarily the
used to evaluate the performance of the classifier. In a ability to detect distant homologs, only hits to GPCRs
ROC curve, the sensitivity is plotted against (1 specific- from different classes were considered positive, whereas,
ity) as the discrimination threshold of the classifier is hits within a GPCR class were ignored and hits to non-
varied. More specifically, in our case the True Positives GPCRs were considered negative. Thus, the highest log-
(TP) and False Negatives (FN) together comprise the odds score to any GPCR not belonging to the same class
pairs of truly homologous proteins, whereas, the unre- as the query protein was used to rank all proteins, and
lated protein pairs are either False Positives (FP) or True the TPR and FPR [(Eqs. (5) and (6)] were calculated for
Negatives (TN) depending on the prediction by the clas- different cutoffs of this score.
sifier.
Varying the alignment score threshold, the true posi- Parameter tuning
tive rate (TPR) is then plotted against the false positive
rate (FPR), where Different E-values for inclusion in the sequence profile,
different numbers of PSI-BLAST iterations, and different
TP values of the n parameter in Eq. 2 were tested and eval-
TPR 5 uated by manual inspection of ROC curves (Supplemen-
TP FN
tary Fig. S2). Because of computational limitations, this
FP parameter tuning was performed using 39 of the GPCRs
FPR 6 and 3151 of the non-GPCRs only, randomly selected
FP TN
from the complete test set. Of the limited range of values
we tried, the best results were achieved using three itera-
Generally, it is desireable to keep the FPR as low as pos- tions and 1025 as the E-value cutoff for inclusion in the
sible to minimize the number of mispredictions. When sequence profile, although the difference in discrimina-
searching for very distant homologs, however, a higher tion seemed to be quite insensitive to these parameters.
FPR can sometimes be accepted at the benefit of finding Performance also drops slightly when n is close to 0 or
more true positives. 0.5 (Supplementary Fig. S3), whereas, n 5 0.1 gives good

PROTEINS 1391
A. Bernsel et al.

results. These values of the parameters were used for the SHRIMP-topo or SHRIMP-hphob, these two putative
remaining predictions, and also in the OPM, HOMEP GPCR classes are found at higher FPR, i.e. with lower
and Pfam test sets (see later). No other parameters were confidence, than the non-GPCR class of bacteriorhodop-
optimized, that is, default values were used for all other sins (rightmost column in Table II; see Supplementary
parameters in PSI-BLAST and TMHMM. Fig. S4). On the other hand, the remaining five putative
classes (third column in Table II) are found at higher
fractions than two known GPCR classes (the Metabo-
ROC evaluation of homology detection
tropic glutamate/pheromone receptors and the Fungal
Extending the analysis to the complete dataset, we pheromone receptors, second column in Table II).
tested homology detection between GPCR classes both In conclusion, the main difference between the meth-
with a purely sequence based model containing just the ods tried here lies primarily in its ability to identify
amino acid sequence alphabet (SHRIMP-seq), and using the more distant members of the GPCR superfamily,
either predicted hydrophobicity (SHRIMP-hphob) or to- the Metabotropic glutamate/pheromone receptors and
pology (SHRIMP-topo) as additional alphabet [Fig. the Fungal pheromone receptors. SHRIMP-topo is able
2(A)]. As a reference, we also made predictions on the to find virtually all of the known GPCRs, and in addition
same dataset using three available methods, namely Pal- classifies five of the seven putative GPCR classes as being
ign, Pmembr and HHpred (see Methods). Interestingly, true GPCRs, whereas, the remaining two are classified
adding information about either hydrophobicity or as non-GPCRs. SHRIMP-topo and GPCRHMM give
topology (SHRIMP-hphob or SHRIMP-topo) seems comparable results, however, GPCRHMM uses only one
to improve the purely sequence based predictions model for the whole superfamily of GPCRs whereas
(SHRIMP-seq) over the whole range of FPRs. This indi- SHRIMP uses several, and the potential to discover com-
cates that for these remote homologs from different pletely new classes of GPCRs could thus be higher with
GPCR classes, where the sequences have apparently GPCRHMM. On the other hand, SHRIMP-topo is not
diverged quite far, the hydrophobicity pattern or topol- restricted to model just GPCRs but, as shown below, can
ogy is better conserved and can improve the predictions. be used to explore evolutionary relationships between
Compared with the profilesequence based methods, Pal- any two membrane proteins.
ign and Pmembr, the profileprofile based methods, In contrast to the results for SHRIMP, we were not
SHRIMP and HHpred, perform better especially at low able to improve the HHpred results by including in-
FPRs. formation about predicted secondary structure using
Investigating further the ability of all methods to PSIPRED. It is likely that, whereas, the locations of the
detect proteins from the individual GPCR classes, as well TM helices is sufficiently conserved to provide useful in-
as some additional classes described in GPCRDB as pu- formation for improving alignments, secondary structure
tative GPCRs and one non-GPCR family (ba- of globular domains is not informative in this respect,
cteriorhodopsin; Class Z), gives the results as in Table II. for aligning membrane proteins.
Here, our results are also compared to those from We also evaluated the ability of the model to find close
GPCRHMM,46 an HMM-based method specifically homologs within, rather than between, GPCR superfami-
designed to recognize GPCRs. GPCRHMM gives a Yes/ lies (Supplementary Fig. S5). For these close homologs,
No output to a given query sequence, and obtains a sequence alone is sufficient to reach high TPR even at
false positive rate of 1.0% on our negative data set; thus, very low FPRs, and the improvement with the addition
for a fair comparison, we used that same FPR for all of the second alphabet is not as clear as before. However,
methods in Table II. the evolutionary information in the profileprofile based
Among the known GPCR classes (leftmost columns in methods seems to help at low FPRs.
Table II), the Metabotropic glutamate/pheromone recep-
tors and the Fungal pheromone receptors seem more
Analysis of false positives
difficult to detect than the others, and are only well-
detected by the topology/hydrophobicity-improved To investigate the types of errors made by the different
SHRIMP models and GPCRHMM at this false positive methods, we looked further into the highest scoring 1%
rate. The finding of these two classes by the model also of the negative examples, that is, the false positives at the
explains the leap from 0.7 TPR to 1.0 TPR of the FPR used in Table II, with respect to their average length
SHRIMP-topo curve at about 0.01 FPR in Figure 2(A) and number of predicted TM helices (Table III). In gen-
(see also Supplementary Fig. S4). eral, it might be expected that proteins with similar
Among the seven classes annotated as putative lengths and topologies as the positive examples are more
GPCRs in GPCRDB (middle columns in Table II), two difficult to filter out, and that such proteins were to be
classes, the Insect odorant receptors and Plant Mlo recep- overrepresented in the highest scoring false positives.
tors (fourth column in Table II), are not found at this This is also the case for all profileprofile based methods
FPR by any of the methods. In fact, using either and in particular for the GPCR specialized predictor

1392 PROTEINS
Distant Homology Detection of Membrane Proteins

Figure 2
ROC curve of classification between GPCRs and (A) non-GPCRs, (B) non-GPCRs with 68 predicted TM regions only. Red curves: the different variants of SHRIMP;
green curves: the profileprofile based method HHpred; blue curves: the profilesequence based methods Palign and Pmembr. Introducing additional information
(SHRIMP-topo and SHRIMP-hphob) improves the sequence based prediction (SHRIMP-seq) at all FPRs. The profileprofile based methods perform better especially at
low FPRs. Note the logarithmic scale on the x-axis.

PROTEINS 1393
A. Bernsel et al.

Table II
Fraction Proteins Found Among the Different GPCRDB Classes by all Methods, at 1% False Positive Rate

Known GPCRs Putative GPCRs non-GPCRs

Rhodopsin like; Ocular albinism proteins; Nematode


Secretin like; cAMP Metabotropic glutamate/ chemoreceptors; Vomeronasal Insect odorant
receptors; Frizzled/ pheromone; Fungal receptors; Taste receptors T2R; receptors; Plant Archaeal/bacterial/
GPCRDB class Smoothened family pheromone Putative/unclassified Mlo receptors fungal opsins
SHRIMP-topo 99.6% 94.5% 96.8% 0.0% 0.0%
SHRIMP-hphob 99.2% 88.5% 95.8% 0.0% 0.0%
SHRIMP-seq 99.2% 18.0% 94.8% 0.0% 0.0%
GPCRHMM 97.5% 93.6% 87.7% 0.0% 0.0%
HHpred 93.2% 34.6% 94.8% 0.0% 0.0%
HHpred1PSIPRED 93.2% 34.6% 94.8% 0.0% 0.0%
Pmembr 87.2% 10.3% 56.3% 0.0% 0.0%
Palign 69.8% 9.0% 54.7% 0.0% 0.0%

The Metabotropic glutamate/pheromone and Fungal pheromone receptors (second column) are more difficult to find than the other known GPCR classes, and are only
well detected by GPCRHMM and the topology/hydrophobicity-improved SHRIMP models at this FPR. Among the putative GPCR classes, the Insect odorant receptors
and Plant Mlo receptors (fourth column) are not detected by any method, whereas the remaining five classes (third column) are detected in various degrees by all
methods.

GPCRHMM, for which the fraction of proteins with 68 it is clear that, whereas the distribution of Dscore for
predicted TM helices is significantly higher than the cor- false hits is quite centered around zero, the true hits are
responding fraction for the negative dataset as a whole generally being given higher score with the addition of
(10.9%). Somewhat unexpectedly, both profilesequence the second alphabet [Fig. 3(A)]. There is also a number
based methods (Palign and Pmembr) are primarily not of false hits with high Dscore, which probably corre-
fooled by such proteins, but instead sometimes give very sponds to cases where the topology is similar by coinci-
high scores to large globular domains in proteins with dence, rather than for evolutionary reasons.
few predicted TM helices (Table III). Since such single- It is possible that the increase in logodds score for the
spanning TM proteins are easily dismissed as GPCR can- true hits is resulting from an actual change in the align-
didates, these methods might actually perform better if ment between the proteins, that is, that the regions of
only proteins with 68 TM helices were to be used as the sequences which are aligned against each other differs
negative examples, whereas, the other methods should between the different models. However, there is no
give slightly worse results in this more difficult but also requirement that this should be the case, and it would
more realistic test. To see this, we also tested the ability also be perfectly reasonable that the true hits get higher
of all methods to distinguish true GPCRs from 1.295 score even though the alignments remain quite intact,
proteins with 68 predicted TM helices in our negative simply because aligning transmembrane regions against
dataset [Fig. 2(B)]. Clearly, both Pmembr and Palign per- each other is promoted by the model with two alphabets.
form significantly better at low FPRs in this test, now To investigate the degree to which the transmembrane
finding more than 60% of the GPCRs immediately. How- regions are aligned against each other in the different
ever, the advantage of SHRIMP over the other methods models, we measured the percentage of transmembrane
remains in this test, both at low FPRs and at the point
where the Metabotropic glutamate/pheromone receptors
and the Fungal pheromone receptors are found, now at a Table III
FPR of around 23%. Analysis of the 1% Highest Scoring False Positives

Dscore and TM-overlap Average Average number


Fraction 68 TMs length of TM helices
To analyze further why the separation is improved
SHRIMP-topo 25.2% 868 3.6
with the addition of the second alphabet, we made more SHRIMP-hphob 28.6% 871 4.3
detailed investigations into the difference in individual SHRIMP-seq 21.0% 873 3.0
scores and alignments. In principle, the improved dis- GPCRHMM 49.1% 409 8.3
HHpred 26.0% 552 3.4
crimination could either be the result of true hits being HHpred1PSIPRED 25.4% 564 3.2
given higher scores, or conversely, due to false hits being Pmembr 2.5% 1240 1.8
given lower scores. By looking at the distribution of the Palign 0.0% 1261 1.5
logodds score difference for the alignments, defined as
For all methods except Pmembr and Palign, proteins with 68 predicted TM heli-
ces are enriched in this set, as compared with the full negative dataset, of which
Dscore score2 alphabets  score1 alphabet 7 such proteins comprise 10.9%.

1394 PROTEINS
Distant Homology Detection of Membrane Proteins

From these analyses, and by manually viewing the


alignments, we conclude that the improvement in dis-
crimination is mainly a result of alignments between true
hits being given higher score, and that in some cases this
is a consequence of the alignment being altered, but there
are also cases where the score is increased even though
the alignment remains intact.
For instance, the alignment between the two Swiss-
Prot entries BAR1_SCHCO and Q752Q1_ASHGO, both
GPCRs, is significantly altered and improved with
SHRIMP-topo, compared with SHRIMP-seq (Fig. 4).
Using only sequence information (one alphabet), the
alignment is shifted by approximately two TM-helices
such that, for example, helix 2 in the profile is aligned
against helix 4 in the HMM. The resulting alignment has
a negative logodds score. With the introduction of topol-
ogy information however (two alphabets), the correct
alignments are made between corresponding TM-helices
in the profile and HMM. This results in a positive log-
odds score, indicating a probable although weak homol-
ogy between the proteins. At the same time, the trans-
membrane overlap increases from 44% (one alphabet) to
70% (two alphabets).

OPM

To see if the results from GPCRDB would generalize


to other membrane protein families with different trans-
membrane topologies, we also tested homology detection
within the OPM43 database of known membrane protein

Figure 3
A Distributions of Dscore for false and true hits. Whereas alignments between
truly homologous proteins tend to have positive Dscore, alignments between
nonhomologous proteins usually do not. Dots indicate frequency of alignments
with Dscore in the interval of 0.5 from the value given by the x-axis. B:
Percentage of overlapping transmembrane residues in alignments between two
GPCRs. The addition of the second alphabet results in an increase of the very
good alignments (TM-overlap above 75%), and at the same time, the very poor
alignments (TM-overlap below 10%) are decreased. Dots indicate frequency of
alignments with a percentage of overlapping TM residues in the interval of 2.5
from the value given by the x-axis.

residues that overlap between the two aligned proteins


for all alignments between true hits [Fig. 3(B)]. Here, it
was necessary to use the Viterbi algorithm rather than
the Forward algorithm (see Methods), to find the single
best path through the model; the difference in score
between these algorithms should be small in most cases.
Comparing the two distributions it seems that some of Figure 4
the very poor alignments from the purely sequence based Alignment between BAR1_SCHCO and Q752Q1_ASHGO (both GPCRs), using
model, with TM-overlap below 10%, are being eliminated the models with one and two alphabets, respectively. Adding topology
information shifts the alignment such that the correct TM-regions are aligned
in the model with two alphabets. In addition, the num- against each other. At the same time, the logodds score becomes positive and the
ber of alignments with TM-overlap of 75% or more is TM-overlap is increased.
increased.

PROTEINS 1395
A. Bernsel et al.

Figure 5
ROC curve of superfamily classification within the OPM43 database using different methods. Red curves: the different variants of SHRIMP; green curves: the profile
profile based method HHpred; blue curves: the profilesequence based methods Palign and Pmembr. Note the logarithmic scale on the x-axis.

structures. Four hundred fifty-four protein chains from set of homologous membrane proteins, and for this pur-
28 superfamilies, with topologies ranging from 1 to 14 pose, the HOMEP data set25 was used. HOMEP contains
TM helices, were aligned against each other, and the abil- 94 alignments between 57 polypeptide chains from 36
ity to detect homology between proteins within the same nonredundant membrane protein structures, divided into
superfamily, but not from the same family, was tested 11 families, and we tested the alignment quality between
using the same methods and the same parameter values homologous proteins using the same methods as men-
as above (Fig. 5). In agreement with our previous results, tioned earlier. For each homologous protein pair, align-
adding information about topology or hydrophobicity ments were built using the different methods, and were
(SHRIMP-topo/SHRIMP-hphob) improves the sequence subsequently used as input to MODELLER,40 to produce
based method (SHRIMP-seq), and the profileprofile full 3D-models of the target proteins (Fig. 6). Among the
based methods (SHRIMP and HHpred) perform better most distantly related protein pairs in HOMEP, with
than the profilesequence based methods (Palign and 10% sequence identity or less, adding secondary structure
Pmembr) especially at low FPRs (Fig. 5). Searching for information seems to help, as seen in the difference both
homology within families instead, the sequence informa- between Pmembr and Palign and between SHRIMP-topo
tion alone is sufficient and all methods tested perform and SHRIMP-seq. This confirms the finding from the
equally well (data not shown). GPCR dataset (Fig. 2), that secondary structure is espe-
cially useful at high FPRs. At sequence identities between
10% and 30%, the profileprofile based methods on av-
Homep
erage perform better than the profilesequence based
In addition to the GPCR and OPM datasets, we also ones. Although the dataset is small, because of the gen-
wanted to test our method on a public benchmark data eral lack of membrane protein structures, the general

1396 PROTEINS
Distant Homology Detection of Membrane Proteins

preparation of this manuscript, however, a new version


of Pfam was released (20.0), in which both MatE and
MVIN belong to a new clan, called MVIN, MatE-like
superfamily. This new clan also includes the Polysacc
domain, and hence explains rows 1, 4, 6, 9, and 13 in
Table IV.
In the second highest scoring alignment, the 7tm_1 do-
main, which belongs to the Rhodopsin class of GPCRs, is
scored against TAS2R, a mammalian family of taste recep-
tors. In this case, both domains belong to the GPCR
superfamily and are clearly homologous, which is also
true for alignment number 12, 15, and 17 in Table IV.
The third highest scoring alignment in Table IV is
between NrfD (Polysulphide reductase) and DmsC
(DMSO reductase anchor subunit), which are both
Figure 6 involved in electron transport and might be evolutionar-
GDT_TS for homology models constructed using the different alignment ily related. This is also confirmed by the Profile Com-
methods. At 10% sequence identity or less, secondary structure information helps parer (PRC) method (Madera, 2005, http://supfam.org/
for both the profileprofile and profilesequence based methods. Between 10
30% sequence identity, the profileprofile based methods generally perform better PRC/), as indicated in the Pfam database.
than the profilesequence based ones. Numbers inside parenthesis correspond to Pfam domains DUF405 and DUF1624 (number 19 in
the number of alignments within each window.
Table IV), suggested to be homologous according to our
method, are both domains of unknown function. The
complete list of scores for all 1,853,682 Pfam alignments
is available on our website (http://www.sbc.su.se/shrimp/).
findings here are in good agreement with the results We believe that several new putative homology relation-
from the homology recognition test. ships might be found among the top scoring alignments
in this list.
Pfam

In an attempt to reveal new potential homology, Table IV


Top Scoring Pfam Domain Alignments
SHRIMP was employed to search for homologs within
the Pfam45 domain database. Out of 8183 Pfam-A
Supporting
domains, in total 1365 were predicted to contain at least No. HMM Profile Score evidence
one TM-helix, and all of those were aligned against each
1 MatE MVIN 18.46 NC
other using SHRIMP. 2 7tm_1 TAS2R 18.30 7TMR
A set of 158 of the transmembrane Pfam domains 3 NrfD DmsC 18.02 PRC
were already assigned to one of 38 different Pfam clans, 4 Polysacc_synt MVIN 17.37 NC
5 VIRB2 TrbC 16.63 PRC
which represent evolutionary relationships within the 6 MVIN Polysacc_synt 16.32 NC
database. Among the top scoring alignments between 7 TrbC VIRB2 13.36 PRC
Pfam domains, the vast majority was not surprisingly 8 CTP_transf_1 DUF46 13.08 PRC
alignments between domain families belonging to the 9 MatE Polysacc_synt 12.73 NC
10 BPD_transp_2 PNTB 12.32
same clan. However, from a novelty perspective, the 11 ATP-synt_B Mt_ATP-synt_B 11.91 PRC
potentially interesting ones are the top scoring align- 12 TAS2R 7tm_1 11.85 7TMR
ments that remain when the same clan-alignments 13 MVIN MatE 11.85 NC
have been filtered out, since this is where new putative 14 MFS_1 Nucleoside_tran 11.82 NIC
15 Serpentine_recp Srg 11.71 7TMR
homology might be detected. The 20 highest scoring 16 ABC_membrane_2 ABC_membrane 11.49 NC
alignments between Pfam domains, that are not members 17 7tm_1 7tm_2 11.46 7TMR
of the same clan, are listed in Table IV. In total, there 18 Lipoprotein_14 Lipoprotein_5 11.39
were 126 alignments scoring at least as high as the last 19 DUF405 DUF1624 11.33
20 FAE_3-kCoA_syn1 DUF1632 11.25
alignment in Table IV, meaning that 106 (80%) of these
were between domains belonging to the same clan. Among the 20 highest scoring domain pairs using SHRIMP, other sources sup-
The highest scoring alignment in Table IV is between porting the evolutionary relationship can be traced for 16 pairs (rightmost col-
umn). NC 5 New Clan indicates that a new clan has been introduced in Pfam
MatE (Multi Antimicrobial Extrusion family) and MVIN where both domains are now members; 7TMR 5 7TM-Receptor: Both domains
(a virulence protein found in Salmonella and other bacte- are 7TM receptors; PRC 5 Profile Comparer: The evolutionary relationship is
supported by the PRC method (Madera, 2005, http://supfam.org/PRC/); NIC 5
ria). In the Pfam version used in this study (19.0), none New In Clan: One of the domains has been included in the same (existing) clan
of these domains were assigned to a clan. During the as the other domain in a recent version of Pfam.

PROTEINS 1397
A. Bernsel et al.

CONCLUSION 7. Rychlewski L, Jaroszewski L, Li W, Godzik A. Comparison of


sequence profiles, strategies for structural predictions using
Aiming to improve membrane protein homology sequence information. Protein Sci 2000;9(2):232241.
detection, we explored the idea that topology should be 8. Mittelman D, Sadreyev R, Grishin N. Probabilistic scoring measures
for profileprofile comparison yield more accurate short seed align-
better conserved than sequence alone for distant mem- ments. Bioinformatics 2003;19(12):15311539.
brane protein homologs. Introducing this kind of infor- 9. Ohlson T, Wallner B, Elofsson A. Profileprofile methods provide
mation into profile-HMMs in parallel with sequence, we improved fold-recognition: a study of different profileprofile align-
managed to enhance discrimination between membrane ment methods. Proteins 2004;57(1):188197.
protein families in two independent test sets and improve 10. Krogh A, Larsson B, vonHeijne G, Sonnhammer EL. Predicting
transmembrane protein topology with a hidden Markov model:
alignment quality for distantly related membrane pro- application to complete genomes. J Mol Biol 2001;305(3):567580.
teins in a third public benchmark data set. Additionally, 11. Drews J. Genomic sciences and the medicine of tomorrow. Nat Bio-
we are able to suggest new potential homology for some technol 1996;14(11):15161518.
relatively uncharacterized membrane protein families. 12. Tourasse NJ, Li WH. Selective constraints, amino acid composition,
By altering the value of the n parameter in Eq. (2), we and the rate of protein evolution. Mol Biol Evol 2000;17(4):656664.
13. Fischer D, Eisenberg D. Protein fold recognition using sequence-
could investigate how different weightings between the derived predictions. Protein Sci 1996;5:947955.
two alphabets influence discriminative ability at different 14. Rice D, Eisenberg D. A 3D-1D substitution matrix for protein fold
levels of homology (Supplementary Fig. S3). Briefly, by recognition that includes predicted secondary structure of the
giving larger weight to the topology alphabet (increasing sequence. J Mol Biol 1997;267:10261038.
n), the true positive rate (TPR) increases at high false 15. Rost B, Schneider R, Sander C. Protein fold recognition by
prediction-based threading. J Mol Biol 1997;270:471480.
positive rates (FPR), but when n becomes too large, TPR 16. Hargbo J, Elofsson A. Hidden markov models that use predicted sec-
at low FPR:s starts to decrease. Thus, the topology infor- ondary structures for fold recognition. Proteins 1999;36(1):6876.
mation seems to help primarily to find the distant 17. Siew N, Elofsson A, Rychlewski L, Fischer D. Maxsub: An auto-
homologs (high FPR:s), confirming the original hypothe- mated measure to assess the quality of protein structure predic-
sis that in those cases where the sequences have probably tions. Bionformatics 2000;16:776785.
18. vonOhsen N, Sommer I, Zimmer R. Profile-profile alignment: a
diverged quite far, topology is better conserved and can powerful tool for protein structure prediction. In: Altman RB,
help improve homology detection. Dunker AK, Hunter Jung TAL, Klein TE, editors, Pacific Sympo-
Here, we have shown that information about trans- sium on Biocomputing. 2003:252263.
membrane topology can help to improve homology de- 19. Yona G, Levitt M. Within the twilight zone: a sensitive profile-
tection for integral membrane proteins. Several other profile comparison tool based on information theory. J Mol Biol
2002;315(5):12571275.
studies, (e.g., Ref. 34 and 48), have earlier demonstrated 20. Sadreyev R, Baker D, Grishin N. Profile-profile comparisons by
the converse: that information about membrane protein COMPASS predict intricate homologies between protein families.
homologs can often help to improve prediction of trans- Protein Sci 2003;12(10):22622272.
membrane topology. This opens up for the interesting pos- 21. Edgar R, Sjolander K. SATCHMO: sequence alignment and tree
sibility that by combining these methods in an iterative construction using hidden Markov models. Bioinformatics 2003;19:
procedure, the output from one algorithm could be used 14041411.
22. Pei J, Sadreyev R, Grishin N. PCMA: fast and accurate multiple
as input to the other. Both improved homology detection
sequence alignment based on profile consistency. Bioinformatics
as well as improved topology prediction could possibly 2003;19(3):427428.
result from such a combined model. 23. Ng PC, Henikoff JG, Henikoff S. PHAT: a transmembrane-specific
substitution matrix. Predicted hydrophobic and transmembrane.
Bioinformatics 2000;16(9):760766.
24. Hedman M, Deloof H, Von Heijne G, Elofsson A. Improved detec-
REFERENCES tion of homologous membrane proteins by inclusion of informa-
tion from topology predictions. Protein Sci 2002;11(3):652658.
1. Cane D. Back to basics: assigning biochemical function in the post- 25. Forrest L, Tang C, Honig B. On the accuracy of homology model-
genomic era. Chem Biol 2004;11(6):741743. ing and sequence alignment methods applied to membrane pro-
2. Whisstock J, Lesk A. Prediction of protein function from protein teins. Biophys J 2006;91:508517.
sequence and structure. Q Rev Biophys 2003;36(3):307340. 26. Hughey R, Krogh A. Hidden Markov models for sequence analysis:
3. Wallner B, Fang H, Ohlson T, Frey-Skott J, Elofsson A. Using evolu- extension and analysis of the basic method. Comput Appl Biosci
tionary information for the query and target improves fold recogni- 1996;12(2):95107.
tion. Proteins 2004;54(2):342350. 27. Eddy SR. Profile hidden Markov models. Bioinformatics 1998;
4. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local 14(9):755763.
alignment search tool. J Mol Biol 1990;215(3):403410. 28. Soding J, Biegert A, Lupas AN. The HHpred interactive server for
5. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, protein homology detection and structure prediction. Nucleic Acids
Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of Res 2005;33(Web Server issue):244248.
protein database search programs. Nucleic Acids Res 1997;25(17): 29. Karplus K, Karchin R, Draper J, Casper J, Mandel-Gutfreund Y,
33893402. Diekhans M, Hughey R. Combining local-structure, fold-recogni-
6. Park J, Karplus K, Barrett C, Hughey R, Haussler D, Hubbard T, tion, and new fold methods for protein structure prediction.
Chothia C. Sequence comparisons using multiple sequences detect Proteins 2003;53(Suppl 6):491496.
three times as many remote homologues as pairwise methods. 30. Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro
J Mol Biol 1998;284(4):12011210. S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale

1398 PROTEINS
Distant Homology Detection of Membrane Proteins

DA, ODonovan C, Redaschi N, Yeh L-SL. The Universal Protein 39. Li W, Godzik A. Cd-hit: a fast program for clustering and compar-
Resource (UniProt). Nucleic Acids Res 2005;33(Database issue): ing large sets of protein or nucleotide sequences. Bioinformatics
154159. 2006;22(13):16581659.
31. Engelman DM, Steitz TA, Goldman A. Identifying nonpolar trans- 40. Sali A, Blundell T. Comparative protein modelling by satisfaction of
bilayer helices in amino acid sequences of membrane proteins. spatial restraints. J Mol Biol 1993;234(3):779815.
Annu Rev Biophys Biophys Chem 1986;15:321353. 41. Zemla A. LGA: a method for finding 3d similarities in protein
32. vonHeijne G. Membrane protein structure prediction. Hydropho- structures. Nucleic Acids Res 2003;31(13):33703374.
bicity analysis and the positive-inside rule. J Mol Biol 1992;225(2): 42. Zhang Y, Skolnick J. Scoring function for automated assessment of
487494. protein structure template quality. Proteins 2004;57(4):702710.
33. Boeckmann B, Bairoch A, Apweiler R, Blatter M-C, Estreicher A, 43. Lomize A, Pogozheva I, Lomize M, Mosberg H. Positioning of pro-
Gasteiger E, Martin MJ, Michoud K, ODonovan C, Phan I, Pilbout teins in membranes: a computational approach. Protein Sci
S, Schneider M. The SWISS-PROT protein knowledgebase and its 2006;15(6):13181333.
supplement TrEMBL in 2003. Nucleic Acids Res 2003;31(1):365 44. Gerstein M, Levitt M. Comprehensive assessment of automatic
370. structural alignment against a manual standard, the scop classifica-
34. Viklund H, Elofsson A. Best a-helical transmembrane protein to- tion of proteins. Protein Sci 1998;7(2):445456.
pology predictions are achieved using hidden Markov models and 45. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones
evolutionary information. Protein Sci 2004;13(7):19081917. S, Khanna A, Marshall M, Moxon S, Sonnhammer ELL, Studholme
35. Rabiner LR. A tutorial on hidden markov models and selected DJ, Yeats C, Eddy SR. The Pfam protein families database. Nucleic
applications in speech recognition. Proceedings of the IEEE Acids Res 2004;32:138141.
1989;77(2):257285. 46. Wistrand M, Kall L, Sonnhammer ELL. A general model of G pro-
36. Elofsson A. A study on protein sequence alignment quality. Proteins tein-coupled receptor sequences and its application to detect remote
2002;46(3):330339. homologs. Protein Sci 2006;15(3):509521.
37. Jones D. Protein secondary structure prediction based on position- 47. Provost F, Fawcett T. Robust classification for imprecise environ-
specific scoring matrices. J Mol Biol 1999;292(2):195202. ments. Mach Learn 2001;42(3):203231.
38. Horn F, Bettler E, Oliveira L, Campagne F, Cohen FE, Vriend G. 48. Kall L, Krogh A, Sonnhammer ELL. An HMM posterior decoder
GPCRDB information system for G protein-coupled receptors. for sequence feature prediction that includes homology informa-
Nucleic Acids Res 2003;31(1):294297. tion. Bioinformatics 2005;21(Suppl 1):251257.

PROTEINS 1399

Вам также может понравиться