Вы находитесь на странице: 1из 72

Swiss Institute of Bioinformatics

EMBnet course: Introduction to Protein Structure Bioinformatics

Homology Modeling
Lausanne, February 22, 2007

Torsten Schwede Biozentrum - Universitt Basel Swiss Institute of Bioinformatics Klingelbergstr 50-70 CH - 4056 Basel, Switzerland Tel: +41-61 267 15 81

How many structures do we know?

http://www.wwpdb.org/

How many structures do we know?

Growth of the Protein Data Bank PDB

Total Yearly

[ PDB: http://www.pdb.org ]

[ PDB: http://www.pdb.org ]

How many structures do we know?


10,000,000

1,000,000
No experimental structure for most protein sequences

100,000

10,000

TrEMBL
1,000

SwissProt PDB

100 1986

1988

1990

1992

1994

1996

1998

2000

2002

2004

2006

(Sources: PDB, EBI, SIB)

In the near future for most of the known protein sequences no experimental structure will be available.

Can we predict protein structures from genome sequences?


MNIFEMLRID HLLTKSPSLN DEAEKLFNQD LDAVRRCALI LQQKRWDEAA TTFRTGTWDA EGLRLKIYKD AAKSELDKAI VDAAVRGILR NMVFQMGETG VNLAKSRWYN YKNL TEGYYTIGIG GRNCNGVITK NAKLKPVYDS VAGFTNSLRM QTPNRAKRVI

MNIFEMLRID HLLTKSPSLN DEAEKLFNQD LDAVRRCALI LQQKRWDEAA TTFRTGTWDA

EGLRLKIYKD AAKSELDKAI VDAAVRGILR NMVFQMGETG VNLAKSRWYN YKNL

TEGYYTIGIG GRNCNGVITK NAKLKPVYDS VAGFTNSLRM QTPNRAKRVI

Many proteins fold spontaneously to their native structure Protein folding is relatively fast (nsec sec) Chaperones speed up folding, but do not alter the structure

The protein sequence contains all information needed to create a correctly folded protein. Can we predict the folding process of a protein structure from their sequences (ab initio)?

Molecular Dynamics
ki = (li li ,0 ) bonds 2 ki + ( i i , 0 ) angles 2
2

VN (1 + cos(n )) + torsions 2 12 6 q q N N ij ij + i j + 4 ij r 4 r rij i =1 j =i +1 0 ij ij

Ab initio protein folding simulation

Physical time for simulation Typical time-step size Number of MD time steps Atoms in a typical protein and water simulation Approximate number of interactions in force calculation Machine instructions per force calculation Total number of machine instructions Petaflop capacity computer (floating point operations per second)

104 seconds 1015 seconds 1011 32000 109 1000 1023 1 petaflop (1015)

Blue Gene will need 1-3 years to simulate 100 sec.


[ http://www.research.ibm.com/bluegene/ ]

Growth of the Protein Data Bank PDB

Old folds per year

New folds per year

[ PDB: http://www.pdb.org ]

CATH - Protein Structure Classification


Class(C) derived from secondary structure content is assigned automatically Architecture(A) describes the gross orientation of secondary structures, independent of connectivity. Topology(T) clusters structures according to their topological connections and numbers of secondary structures Homologous Superfamily (H) This level groups together protein domains which are thought to share a common ancestor and can therefore be described as homologous.

[ http://www.biochem.ucl.ac.uk/bsm/cath_new/ ]

Sequence similarity implies structural similarity?


100 Pairwise sequence identity
.

75

50

Sequence identity implies structural similarity ! Don't know region

25

0 Number of residues aligned


(B.Rost, Columbia, NewYork)

Sequence similarity implies structural similarity?


.

100
Percentage sequence

identity/similarity

80 60 40
Dont Sequence identity implies structural similarity

identity similarity

20 0

know

region .....

50

100

150

200

250

Number of residues aligned


(B.Rost, Columbia, NewYork)

Fold recognition / Threading


Find a compatible fold for a given sequence ....

>Protein XY MSTLYEKLGGTTAVDLAV DKFYERVLQDDRIKHFFA DVDMAKQRAHQKAFLTYA FGGTDKYDGRYMREAHKE LVENHGLNGEHFDAVAED LLATLKEMGVPEDLIAEV AAVAGAPAHKRDVLNQ

Number of protein folds that occurs in nature is limited. Fold Recognition can be used to:
Identify templates for comparative modeling Assign Protein Function

Fold recognition / Threading


The "biological" perspective: Homologous proteins have evolved by molecular evolution from a common ancestor. If we can establish homology, we can predict aspects of structure and function of a new protein by analogy. The "physical" perspective: The native conformation of a protein corresponds to a global free energy minimum of the protein / solvent system. To identify a compatible fold, the protein sequence is "threaded" through a library of folds, and empirical energy calculations are used to evaluate compatibility. No single method is perfect. Consensus methods often perform better:
MetaPP: http://cubic.bioc.columbia.edu/predictprotein/ http://bioinfo.pl/meta/

Further reading: Adam Godzik, "Fold Recognition Methods", in: "Structural Bioinformatics", Bourne & Weissig, Eds.

Protein Structure / Fold Databases

PDB:

http://www.pdb.org

EBI-MSD

http://www.ebi.ac.uk/msd/

SCOP

http://scop.mrc-lmb.cam.ac.uk/scop/

CATH

http://www.biochem.ucl.ac.uk/bsm/cath_new/

Fold Recognition Servers


Meta server
http://bioinfo.pl/meta/

3DPSSM / Phyre
http://www.sbg.bio.ic.ac.uk/servers/3dpssm/ http://www.sbg.bio.ic.ac.uk/~phyre/

GenTHREADER
http://bioinf.cs.ucl.ac.uk/psipred/

FUGUE2
http://www-cryst.bioc.cam.ac.uk/~fugue/prfsearch.html

SAM
http://www.cse.ucsc.edu/research/compbio/HMM-apps/T99query.html

FOLD
http://fold.doe-mbi.ucla.edu/

FFAS/PDBBLAST
http://bioinformatics.burnham-inst.org/

Evolution of the globin family:

Evolution of protein structure families


Rmsd of backbone atoms in core

2.5 2.0 1.5 1.0 0.5 0.0 100 50 0


[ Chothia & Lesk (1986) ]

Percent identical residues in core

Common core = all residues that can be superposed in 3D For proteins > 60% identical residues, the core contains > 90 % of all residues deviating less than 1.0 .

Similar Sequence

Similar Structure

Homology modeling
= Comparative protein modeling = Knowledge-based modeling Idea: Using experimental 3D-structures of related family members (templates) to calculate a model for a new sequence (target).

Comparative Modeling
Known Structures (Templates)

Target Sequence

Template Selection

Alignment Template - Target

Structure Evaluation & Assessment

Structure modeling

Homology Model(s)

Comparative Modeling
Target Sequence

Known Structures (Templates)

Template Selection

Protein Data Bank PDB http://www.pdb.org Database of templates

Alignment Template - Target

Structure Evaluation & Assessment

Structure modeling

Homology Model(s)

Separate into single chains Remove bad structures (models) Create BLASTable database or fold library (profiles, HMMs)

Comparative Modeling
Target Sequence

Known Structures (Templates)

Template Selection

Template selection: 1. Sequence Similarity / Fold recognition Structure quality (resolution, experimental method) Experimental conditions (ligands and cofactors)

Alignment Template - Target

Structure Evaluation & Assessment

Structure modeling

Homology Model(s)

2.

3.

Comparative Modeling
Target Sequence

Known Structures (Templates)

Template Selection

Alignment Template - Target

Multiple sequence alignment for pairs > 40% identity or Use structural alignment of templates to guide sequence alignment of target or Use separate profiles for template and targets

Structure Evaluation & Assessment

Structure modeling

Homology Model(s)

Comparative Modeling
Target Sequence

Known Structures (Templates)

Template Selection

Alignment Template - Target

Structure Evaluation & Assessment

Errors in template selection or alignment result in bad models iterative cycles of alignment, modeling and evaluation Built many models, choose best.

Structure modeling

Homology Model(s)

Comparative Modeling
Target Sequence

Known Structures (Templates)

Template Selection

Alignment Template - Target

Structure Evaluation & Assessment

I.

Manual Model building

Structure modeling

II. Template based fragment assembly Composer (Sybyl, Tripos) SWISS-MODEL III. Satisfaction of spatial restraints Modeller (Insight II, MSI) CPH-Models

Homology Model(s)

I. Manual Modeling

[ http://www.expasy.org/spdbv/ ]

II. Template based fragment assembly


Find structurally conserved core regions

II. Template based fragment assembly


Build model core by averaging core template backbone atoms (weighted by local sequence similarity with the target sequence). Leave non-conserved regions (loops) for later .

II. Template based fragment assembly


Loop (insertion) modeling
Use the spare part algorithm to find compatible fragments in a LoopDatabase, or ab-initio rebuilding (e.g. Monte Carlo, MD, GA, etc.) to build missing loops.

II. Template based fragment assembly


Side Chain placement
Find the most probable side chain conformation, using
homologues structure information back-bone dependent rotamer libraries energetic and packing criteria

II. Template based fragment assembly


Rotamer Libraries
Only a small fraction of all possible side chain conformations is observed in experimental structures Rotamer libraries provide an ensemble of likely conformations The propensity of rotamers depends on the backbone geometry:

II. Template based fragment assembly

Energy minimization
modeling method will produce unfavorable contacts and bonds Energy minimization is used to
regularize local bond and angle geometry Relax close contacts and geometric strain

extensive energy minimization will move coordinates away from real structure keep it to a minimum SWISS-MODEL is using GROMOS 96 force field for a steepest descent

Homology Modeling
III. Satisfaction of Spatial restraints

M A T E A F

Q S G

III. Satisfaction of Spatial restraints


Alignment of target sequence with templates Extraction of spatial restraints from templates Modeling by satisfaction of spatial restraints

M A T E A F

Q S G

III. Satisfaction of Spatial restraints


Some features of a protein structure: R r , t M i,, ci a s d d resolution of X-ray experiment amino acid residue type main chain angles secondary structure class main chain conformation class side chain dihedral angle class residue solvent accessibility residue neighborhood difference Ca - Ca distance difference between two Ca - Ca distances

III. Satisfaction of Spatial restraints


Feature properties can be associated with
a protein (e.g. X-ray resolution) residues (e.g. solvent accessibility) pairs of residues (e.g. Ca - Ca distance) other features (e.g. main chain classes)

How can we derive modeling restraints from this data?


A restraint is defined as probability density function (pdf) p(x):
x1

p ( x1 x < x 2) =

x2

p( x)dx

with

p( x)dx = 1
p( x) > 0

III. Satisfaction of Spatial restraints


Derive pdfs from frequency tables by smoothing:

a) 11 Cys residues Chi-1 angles b) smoothed distribution from a) c) 297 Cys Chi-1 angles as control

III. Satisfaction of Spatial restraints


Combine basis pdfs to molecular probability density functions
0.2 < s ' < 0.4 0 .2 < s ' ' < 0 .4

0.2 < s ' < 0.4

0 .4 < s ' ' < 0 .6

0 .4 < s ' < 0 .6

0 .2 < s ' ' < 0 .4

III. Satisfaction of Spatial restraints


Satisfaction of spatial restraints Find the protein model with the highest probability Variable target function: Start with a linear conformation model or a model close to the template conformation At first, use only local restraints minimize some steps using a conjugate gradient optimization repeat with introducing more and more long range restraints until all restraints are used

III. Satisfaction of Spatial restraints


Optimization schedule and progress

Model Accuracy Evaluation


CASP Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure http://predictioncenter.org/casp7/ Prediction

EVA Evaluation of Automatic protein structure prediction

[ Burkhard Rost, Andrej Sali, http://maple.bioc.columbia.edu/eva/ ]

Evaluation of Automatic protein structure prediction


[ Burkhard Rost, Andrej Sali, http://maple.bioc.columbia.edu/eva/ ]

New PDB Release


Target Sequence
MNIFEMLRID EGLRLKIYKD TEGYYTIGIG HLLTKSPSLN AAKSELDKAI GRNCNGVITK

Prediction Servers e.g. 1

Evaluation of prediction accuracy

Typical types of errors


Sequence alignment errors. Loops which cannot be rebuilt. Inappropriate template selection. Subunit displacement.

Typical types of errors


Sequence alignment errors. Loops which cannot be rebuilt. Inappropriate template selection. Subunit displacement.

Structural rearrangements .
cause problems for template selection and automated evaluation:

e.g. flap-region in adenylate kinases (1AKE, 4AKE)

e.g. DNA-binding domains (1AWC, 1ETC)

because they are sequence independent.

Protein Structure Evaluation


Problem: How can we identify errors in 3-dimensional protein structures (without knowing the correct answer)?
Bond & Angle Geometry Molecular Interactions

Empirical Force Fields Statistical Methods

Empirical Force Fields


e.g. GROMOS, CHARMM, AMBER, ...

Which type of errors in a protein structure can you identify by an empirical force filed? Which type of errors are not recognized?

Statistical Methods
Ramachandran Plot of backbone angles (,)
favored regions generously allowed regions disallowed regions

Amino acids with special properties:


PRO: = 60 GLY ( )

Similar plots for -angle distributions

Useful to identify regions with errors in geometry

1D - 3D Checks
Probability for a feature to occur in a given environment, e.g.
Solvent exposed / buried Hydrophobic / polar environment Electrostatic interactions Secondary structure

See: R. Luthy (1992) Assessment of protein models with three-dimensional profiles, Nature, 356(6364):83-5

Statistical Mean Force Potentials


A
I Val13 + II Phe134 III Ala182

Met80

*, Met80 +, Ile86

I, Val13

III, Ala182

II, Phe134

Atomic non-local interaction energy.

Atom Type Definitions

Statistical Mean Force Potentials


Use inverse Boltzmann law to derive an atomic Potential of Mean Force () from the observed number of atomic pairs (i,j) within a distance shell rr in the training database of protein structures:

N observed (i, j, r ) U (i, j, r ) = RT ln N expected (i, j, r )

R: gas constant T: temperature

Nexpected is the expected number of atomic pairs (i,j) in the same distance shell if there were no interactions between atoms (reference state).

MFP kcal/mol

Methyl-Methyl pairs

Cysteine S-S-pairs Distance

Distance

ANOLEA : (Atomic Non-Local Environment Assessment)


http://protein.bio.puc.cl/cardex/servers/anolea/ http://swissmodel.expasy.org/anolea/

ANOLEA
Correct Structure: PDB: 1GES

Detects local packing errors Errors in alignments

Model with wrong alignment:

PROCHECK
Checks the stereo-chemical quality of a protein structure, producing a number of plots analyzing its overall and residue-by-residue geometry. Covalent geometry Planarity Dihedral angles Chirality Non-bonded interactions Main-chain hydrogen bonds Disulphide bonds Stereochemical parameters Residue-by-residue analysis

Laskowski R A, MacArthur M W, Moss D S & Thornton J M (1993). PROCHECK: a program to check the stereochemical quality of protein structures. J. Appl. Cryst., 26, 283-291. Morris A L, MacArthur M W, Hutchinson E G & Thornton J M (1992). Stereochemical quality of protein structure coordinates. Proteins, 12, 345-364.

WhatCheck / WhatIf
WHAT IF I check my structure? Imagine ... An everyday situation in a biocomputing lab: "Should they use the structure?" An everyday situation in a crystallography lab: "Should they deposit the structure already?" In a WHAT_CHECK report, each reported fact has an assigned severity: error: severe errors encountered during the analyses. Items marked as errors are considered severe problems requiring immediate attention. warning: Either less severe problems or uncommon structural features. These still need special attention. note: Statistical values, plots, or other verbose results of tests and analyses that have been performed.
WHAT IF: A molecular modeling and drug design program. G.Vriend, J. Mol. Graph. (1990) 8, 52-56. Errors in protein structures. R.W.W. Hooft, G. Vriend, C. Sander, E.E. Abola, Nature (1996) 381, 272-272.

WhatCheck / WhatIf report for a bad model ...


# 49 # Note: Summary report for users of a structure This is an overall summary of the quality of the structure as compared with current reliable structures. This summary is most useful for biologists seeking a good structure to use for modelling calculations. The second part of the table mostly gives an impression of how well the model conforms to common refinement constraint values. The first part of the table shows a number of constraint-independent quality indicators. Structure Z-scores, positive is 1st generation packing quality 2nd generation packing quality Ramachandran plot appearance chi-1/chi-2 rotamer normality Backbone conformation better than average: : -2.550 : -5.472 (bad) : -1.898 : -1.433 : -2.173

RMS Z-scores, should be close to 1.0: Bond lengths : 0.905 Bond angles : 1.476 Omega angle restraints : 0.921 Side chain planarity : 2.681 (loose) Improper dihedral distribution : 1.771 (loose) Inside/Outside distribution : 1.333 (unusual)

whatcheck.txt

All checking tools are happy, so can I believe it now? Models are not experimental facts ! Models can be partially inaccurate or sometimes completely wrong ! A model is a tool that helps to interpret biochemical data.

Some useful Evaluation Tools


ANOLEA : (Atomic Non-Local Environment Assessment)

http://protein.bio.puc.cl/cardex/servers/anolea/ http://swissmodel.expasy.org/anolea/
ProCheck http://www.biochem.ucl.ac.uk/~roman/procheck/procheck.html WhatCheck http://www.cmbi.kun.nl/gv/whatcheck/ Verify3D http://www.doe-mbi.ucla.edu/Services/Verify_3D/ Biotech Validation Suite for Protein Structures http://biotech.ebi.ac.uk:8400/

What can models be used for ?

A Model must be wrong, in some respects, else it would be the thing itself. The trick is to see where it is right.

(Henry A. Bent)

Model quality vs. sequence identity

Midnight Zone Twilight Zone

Save Zone

What can models be used for ?


Annotation by fold assignment 3D-motif searching, active site recognition Including NMR restraints Supporting site directed mutagenesis X-Ray Molecular replacement models

Docking of small molecules Drug development; comparable to medium resolution NMR or low resolution X-ray structures

Application example: Understanding drug interactions

The knowledge of 3-dimensional structures of target proteins allows to undertand interactions of inhibitors and drugs with their target proteins.

Discovery of CK2a Inhibitors by in silico docking

Homology model of the target molecule:

Reference: Discovery of a potent and selective protein kinase CK2 inhibitor by high-throughput docking. Vangrevelinghe E, Zimmermann K, Schoepfer J, Portmann R, Fabbro D, Furet P. Oncology Research, Novartis Pharma, Basle, J Med Chem. 2003 Jun 19;46(13):2656-62.

Medicines are not Effective in all Patients

Inter-individual differences in drug efficacy:

Group SSRI ACE-I Beta blockers Statins Beta2 agonists

Incomplete/absent efficacy 10-25% 10-30% 15-25% 30-70% 40-70%

[ Spear BB (2001) Trends Mol Med;7(5):201-204 ]

Structural analysis of human mutations and nsSNPs

4 2

1 5

3 7

-8

-4

+4

+8

kT/e

E.g. Changes in the electrostatic properties upon mutation

Public database holdings


1'000'000

100'000

10'000

1'000

TrEMBL SwissProt PDB

100 1986

1988

1990

1992

1994

1996

1998

2000

2002

2004

Structural Genomics
large scale experimental structure solution projects

Goal:

Most of the sequences in a genome database should match at least one structure with a sufficient sequence identity allowing for reliable modeling.

The modeling error determines selection of targets for structural genomics.

Range of sequence space that can be modeled with acceptable accuracy.

Structural Genomics Target Selection

Protein Modeling Resources


SWISS-MODEL Modeller WhatIf 3D-JIGSAW CPHmodels SDSC1 http://swissmodel.expasy.org http://www.salilab.org http://www.cmbi.kun.nl/whatif/
http://www.bmm.icnet.uk/people/paulb/3dj/form.html

http://www.cbs.dtu.dk/services/CPHmodels/

http://cl.sdsc.edu/hm.html

Some useful Evaluation Tools


ANOLEA : (Atomic Non-Local Environment Assessment)

http://protein.bio.puc.cl/cardex/servers/anolea/ http://swissmodel.expasy.org/anolea/
ProCheck http://www.biochem.ucl.ac.uk/~roman/procheck/procheck.html WhatCheck http://www.cmbi.kun.nl/gv/whatcheck/ Verify3D http://www.doe-mbi.ucla.edu/Services/Verify_3D/ Biotech Validation Suite for Protein Structures http://biotech.ebi.ac.uk:8400/

Вам также может понравиться