3DHomology Modelling

Swiss Institute of Bioinformatics
EMBnet course: Introduction to Protein Structure Bioinformatics
Homology Modeling
Lausanne, February 22, 2007
Torsten Schwede Biozentrum - Universitt Basel Swiss Institute of Bioinformatics Klingelbergstr 50-70 CH - 4056 Basel, Switzerland Tel: +41-61 267 15 81
How many structures do we know?
http://www.wwpdb.org/
Growth of the Protein Data Bank PDB
Total Yearly
[ PDB: http://www.pdb.org ]

10,000,000
1,000,000
No experimental structure for most protein sequences
100,000
10,000
TrEMBL
1,000
SwissProt PDB
100 1986
1988
1990
1992
1994
1996
1998
2000
2002
2004
2006
(Sources: PDB, EBI, SIB)
In the near future for most of the known protein sequences no experimental structure will be available.
Can we predict protein structures from genome sequences?

MNIFEMLRID HLLTKSPSLN DEAEKLFNQD LDAVRRCALI LQQKRWDEAA TTFRTGTWDA EGLRLKIYKD AAKSELDKAI VDAAVRGILR NMVFQMGETG VNLAKSRWYN YKNL TEGYYTIGIG GRNCNGVITK NAKLKPVYDS VAGFTNSLRM QTPNRAKRVI
MNIFEMLRID HLLTKSPSLN DEAEKLFNQD LDAVRRCALI LQQKRWDEAA TTFRTGTWDA
EGLRLKIYKD AAKSELDKAI VDAAVRGILR NMVFQMGETG VNLAKSRWYN YKNL
TEGYYTIGIG GRNCNGVITK NAKLKPVYDS VAGFTNSLRM QTPNRAKRVI
Many proteins fold spontaneously to their native structure Protein folding is relatively fast (nsec sec) Chaperones speed up folding, but do not alter the structure
The protein sequence contains all information needed to create a correctly folded protein. Can we predict the folding process of a protein structure from their sequences (ab initio)?
Molecular Dynamics
ki = (li li ,0 ) bonds 2 ki + ( i i , 0 ) angles 2
2
VN (1 + cos(n )) + torsions 2 12 6 q q N N ij ij + i j + 4 ij r 4 r rij i =1 j =i +1 0 ij ij
Ab initio protein folding simulation
Physical time for simulation Typical time-step size Number of MD time steps Atoms in a typical protein and water simulation Approximate number of interactions in force calculation Machine instructions per force calculation Total number of machine instructions Petaflop capacity computer (floating point operations per second)
104 seconds 1015 seconds 1011 32000 109 1000 1023 1 petaflop (1015)
Blue Gene will need 1-3 years to simulate 100 sec.

[ http://www.research.ibm.com/bluegene/ ]
Growth of the Protein Data Bank PDB
Old folds per year
New folds per year
CATH - Protein Structure Classification

Class(C) derived from secondary structure content is assigned automatically Architecture(A) describes the gross orientation of secondary structures, independent of connectivity. Topology(T) clusters structures according to their topological connections and numbers of secondary structures Homologous Superfamily (H) This level groups together protein domains which are thought to share a common ancestor and can therefore be described as homologous.
[ http://www.biochem.ucl.ac.uk/bsm/cath_new/ ]
Sequence similarity implies structural similarity?

100 Pairwise sequence identity
.
75
50
Sequence identity implies structural similarity ! Don't know region
25
0 Number of residues aligned

(B.Rost, Columbia, NewYork)
Sequence similarity implies structural similarity?

.
100
Percentage sequence
identity/similarity
80 60 40
Dont Sequence identity implies structural similarity
identity similarity
20 0
know
region .....
50
100
150
200
250
Number of residues aligned

(B.Rost, Columbia, NewYork)
Fold recognition / Threading

Find a compatible fold for a given sequence ....
>Protein XY MSTLYEKLGGTTAVDLAV DKFYERVLQDDRIKHFFA DVDMAKQRAHQKAFLTYA FGGTDKYDGRYMREAHKE LVENHGLNGEHFDAVAED LLATLKEMGVPEDLIAEV AAVAGAPAHKRDVLNQ
Number of protein folds that occurs in nature is limited. Fold Recognition can be used to:
Identify templates for comparative modeling Assign Protein Function
Fold recognition / Threading

The "biological" perspective: Homologous proteins have evolved by molecular evolution from a common ancestor. If we can establish homology, we can predict aspects of structure and function of a new protein by analogy. The "physical" perspective: The native conformation of a protein corresponds to a global free energy minimum of the protein / solvent system. To identify a compatible fold, the protein sequence is "threaded" through a library of folds, and empirical energy calculations are used to evaluate compatibility. No single method is perfect. Consensus methods often perform better:
MetaPP: http://cubic.bioc.columbia.edu/predictprotein/ http://bioinfo.pl/meta/
Further reading: Adam Godzik, "Fold Recognition Methods", in: "Structural Bioinformatics", Bourne & Weissig, Eds.
Protein Structure / Fold Databases
PDB:
http://www.pdb.org
EBI-MSD
http://www.ebi.ac.uk/msd/
SCOP
http://scop.mrc-lmb.cam.ac.uk/scop/
CATH
http://www.biochem.ucl.ac.uk/bsm/cath_new/
Fold Recognition Servers

Meta server
http://bioinfo.pl/meta/
3DPSSM / Phyre
http://www.sbg.bio.ic.ac.uk/servers/3dpssm/ http://www.sbg.bio.ic.ac.uk/~phyre/
GenTHREADER
http://bioinf.cs.ucl.ac.uk/psipred/
FUGUE2
http://www-cryst.bioc.cam.ac.uk/~fugue/prfsearch.html
SAM
http://www.cse.ucsc.edu/research/compbio/HMM-apps/T99query.html
FOLD
http://fold.doe-mbi.ucla.edu/
FFAS/PDBBLAST
http://bioinformatics.burnham-inst.org/
Evolution of the globin family:
Evolution of protein structure families

Rmsd of backbone atoms in core
2.5 2.0 1.5 1.0 0.5 0.0 100 50 0

[ Chothia & Lesk (1986) ]
Percent identical residues in core
Common core = all residues that can be superposed in 3D For proteins > 60% identical residues, the core contains > 90 % of all residues deviating less than 1.0 .
Similar Sequence
Similar Structure
Homology modeling
= Comparative protein modeling = Knowledge-based modeling Idea: Using experimental 3D-structures of related family members (templates) to calculate a model for a new sequence (target).
Comparative Modeling
Known Structures (Templates)
Target Sequence
Template Selection
Alignment Template - Target
Structure Evaluation & Assessment
Structure modeling
Homology Model(s)
Target Sequence
Template Selection
Protein Data Bank PDB http://www.pdb.org Database of templates
Structure modeling
Homology Model(s)
Separate into single chains Remove bad structures (models) Create BLASTable database or fold library (profiles, HMMs)
Target Sequence
Template Selection
Template selection: 1. Sequence Similarity / Fold recognition Structure quality (resolution, experimental method) Experimental conditions (ligands and cofactors)
Structure modeling
Homology Model(s)
2.
3.
Target Sequence
Template Selection
Multiple sequence alignment for pairs > 40% identity or Use structural alignment of templates to guide sequence alignment of target or Use separate profiles for template and targets
Structure modeling
Homology Model(s)
Target Sequence
Template Selection
Errors in template selection or alignment result in bad models iterative cycles of alignment, modeling and evaluation Built many models, choose best.
Structure modeling
Homology Model(s)
Target Sequence
Template Selection
I.
Manual Model building
Structure modeling
II. Template based fragment assembly Composer (Sybyl, Tripos) SWISS-MODEL III. Satisfaction of spatial restraints Modeller (Insight II, MSI) CPH-Models
Homology Model(s)
I. Manual Modeling
[ http://www.expasy.org/spdbv/ ]
II. Template based fragment assembly

Find structurally conserved core regions

Build model core by averaging core template backbone atoms (weighted by local sequence similarity with the target sequence). Leave non-conserved regions (loops) for later .

Loop (insertion) modeling
Use the spare part algorithm to find compatible fragments in a LoopDatabase, or ab-initio rebuilding (e.g. Monte Carlo, MD, GA, etc.) to build missing loops.

Side Chain placement
Find the most probable side chain conformation, using
homologues structure information back-bone dependent rotamer libraries energetic and packing criteria

Rotamer Libraries
Only a small fraction of all possible side chain conformations is observed in experimental structures Rotamer libraries provide an ensemble of likely conformations The propensity of rotamers depends on the backbone geometry:
Energy minimization
modeling method will produce unfavorable contacts and bonds Energy minimization is used to
regularize local bond and angle geometry Relax close contacts and geometric strain
extensive energy minimization will move coordinates away from real structure keep it to a minimum SWISS-MODEL is using GROMOS 96 force field for a steepest descent
Homology Modeling
III. Satisfaction of Spatial restraints
M A T E A F
Q S G

Alignment of target sequence with templates Extraction of spatial restraints from templates Modeling by satisfaction of spatial restraints
M A T E A F
Q S G

Some features of a protein structure: R r , t M i,, ci a s d d resolution of X-ray experiment amino acid residue type main chain angles secondary structure class main chain conformation class side chain dihedral angle class residue solvent accessibility residue neighborhood difference Ca - Ca distance difference between two Ca - Ca distances

Feature properties can be associated with
a protein (e.g. X-ray resolution) residues (e.g. solvent accessibility) pairs of residues (e.g. Ca - Ca distance) other features (e.g. main chain classes)
How can we derive modeling restraints from this data?

A restraint is defined as probability density function (pdf) p(x):
x1
p ( x1 x < x 2) =
x2
p( x)dx
with
p( x)dx = 1
p( x) > 0

Derive pdfs from frequency tables by smoothing:
a) 11 Cys residues Chi-1 angles b) smoothed distribution from a) c) 297 Cys Chi-1 angles as control

Combine basis pdfs to molecular probability density functions
0.2 < s ' < 0.4 0 .2 < s ' ' < 0 .4
0.2 < s ' < 0.4
0 .4 < s ' ' < 0 .6
0 .4 < s ' < 0 .6
0 .2 < s ' ' < 0 .4

Satisfaction of spatial restraints Find the protein model with the highest probability Variable target function: Start with a linear conformation model or a model close to the template conformation At first, use only local restraints minimize some steps using a conjugate gradient optimization repeat with introducing more and more long range restraints until all restraints are used

Optimization schedule and progress
Model Accuracy Evaluation

CASP Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure http://predictioncenter.org/casp7/ Prediction
EVA Evaluation of Automatic protein structure prediction
[ Burkhard Rost, Andrej Sali, http://maple.bioc.columbia.edu/eva/ ]
Evaluation of Automatic protein structure prediction

[ Burkhard Rost, Andrej Sali, http://maple.bioc.columbia.edu/eva/ ]
New PDB Release

Target Sequence
MNIFEMLRID EGLRLKIYKD TEGYYTIGIG HLLTKSPSLN AAKSELDKAI GRNCNGVITK
Prediction Servers e.g. 1
Evaluation of prediction accuracy
Typical types of errors

Sequence alignment errors. Loops which cannot be rebuilt. Inappropriate template selection. Subunit displacement.
Typical types of errors

Sequence alignment errors. Loops which cannot be rebuilt. Inappropriate template selection. Subunit displacement.
Structural rearrangements .
cause problems for template selection and automated evaluation:
e.g. flap-region in adenylate kinases (1AKE, 4AKE)
e.g. DNA-binding domains (1AWC, 1ETC)
because they are sequence independent.
Protein Structure Evaluation

Problem: How can we identify errors in 3-dimensional protein structures (without knowing the correct answer)?
Bond & Angle Geometry Molecular Interactions
Empirical Force Fields Statistical Methods
Empirical Force Fields

e.g. GROMOS, CHARMM, AMBER, ...
Which type of errors in a protein structure can you identify by an empirical force filed? Which type of errors are not recognized?
Statistical Methods
Ramachandran Plot of backbone angles (,)
favored regions generously allowed regions disallowed regions
Amino acids with special properties:

PRO: = 60 GLY ( )
Similar plots for -angle distributions
Useful to identify regions with errors in geometry
1D - 3D Checks
Probability for a feature to occur in a given environment, e.g.
Solvent exposed / buried Hydrophobic / polar environment Electrostatic interactions Secondary structure
See: R. Luthy (1992) Assessment of protein models with three-dimensional profiles, Nature, 356(6364):83-5
Statistical Mean Force Potentials

A
I Val13 + II Phe134 III Ala182
Met80
*, Met80 +, Ile86
I, Val13
III, Ala182
II, Phe134
Atomic non-local interaction energy.
Atom Type Definitions
Statistical Mean Force Potentials

Use inverse Boltzmann law to derive an atomic Potential of Mean Force () from the observed number of atomic pairs (i,j) within a distance shell rr in the training database of protein structures:
N observed (i, j, r ) U (i, j, r ) = RT ln N expected (i, j, r )
R: gas constant T: temperature
Nexpected is the expected number of atomic pairs (i,j) in the same distance shell if there were no interactions between atoms (reference state).
MFP kcal/mol
Methyl-Methyl pairs
Cysteine S-S-pairs Distance
Distance
ANOLEA : (Atomic Non-Local Environment Assessment)

http://protein.bio.puc.cl/cardex/servers/anolea/ http://swissmodel.expasy.org/anolea/
ANOLEA
Correct Structure: PDB: 1GES
Detects local packing errors Errors in alignments
Model with wrong alignment:
PROCHECK
Checks the stereo-chemical quality of a protein structure, producing a number of plots analyzing its overall and residue-by-residue geometry. Covalent geometry Planarity Dihedral angles Chirality Non-bonded interactions Main-chain hydrogen bonds Disulphide bonds Stereochemical parameters Residue-by-residue analysis
Laskowski R A, MacArthur M W, Moss D S & Thornton J M (1993). PROCHECK: a program to check the stereochemical quality of protein structures. J. Appl. Cryst., 26, 283-291. Morris A L, MacArthur M W, Hutchinson E G & Thornton J M (1992). Stereochemical quality of protein structure coordinates. Proteins, 12, 345-364.
WhatCheck / WhatIf
WHAT IF I check my structure? Imagine ... An everyday situation in a biocomputing lab: "Should they use the structure?" An everyday situation in a crystallography lab: "Should they deposit the structure already?" In a WHAT_CHECK report, each reported fact has an assigned severity: error: severe errors encountered during the analyses. Items marked as errors are considered severe problems requiring immediate attention. warning: Either less severe problems or uncommon structural features. These still need special attention. note: Statistical values, plots, or other verbose results of tests and analyses that have been performed.
WHAT IF: A molecular modeling and drug design program. G.Vriend, J. Mol. Graph. (1990) 8, 52-56. Errors in protein structures. R.W.W. Hooft, G. Vriend, C. Sander, E.E. Abola, Nature (1996) 381, 272-272.
WhatCheck / WhatIf report for a bad model ...

# 49 # Note: Summary report for users of a structure This is an overall summary of the quality of the structure as compared with current reliable structures. This summary is most useful for biologists seeking a good structure to use for modelling calculations. The second part of the table mostly gives an impression of how well the model conforms to common refinement constraint values. The first part of the table shows a number of constraint-independent quality indicators. Structure Z-scores, positive is 1st generation packing quality 2nd generation packing quality Ramachandran plot appearance chi-1/chi-2 rotamer normality Backbone conformation better than average: : -2.550 : -5.472 (bad) : -1.898 : -1.433 : -2.173
RMS Z-scores, should be close to 1.0: Bond lengths : 0.905 Bond angles : 1.476 Omega angle restraints : 0.921 Side chain planarity : 2.681 (loose) Improper dihedral distribution : 1.771 (loose) Inside/Outside distribution : 1.333 (unusual)
whatcheck.txt
All checking tools are happy, so can I believe it now? Models are not experimental facts ! Models can be partially inaccurate or sometimes completely wrong ! A model is a tool that helps to interpret biochemical data.
Some useful Evaluation Tools

ProCheck http://www.biochem.ucl.ac.uk/~roman/procheck/procheck.html WhatCheck http://www.cmbi.kun.nl/gv/whatcheck/ Verify3D http://www.doe-mbi.ucla.edu/Services/Verify_3D/ Biotech Validation Suite for Protein Structures http://biotech.ebi.ac.uk:8400/
What can models be used for ?
A Model must be wrong, in some respects, else it would be the thing itself. The trick is to see where it is right.
(Henry A. Bent)
Model quality vs. sequence identity
Midnight Zone Twilight Zone
Save Zone
What can models be used for ?

Annotation by fold assignment 3D-motif searching, active site recognition Including NMR restraints Supporting site directed mutagenesis X-Ray Molecular replacement models
Docking of small molecules Drug development; comparable to medium resolution NMR or low resolution X-ray structures
Application example: Understanding drug interactions
The knowledge of 3-dimensional structures of target proteins allows to undertand interactions of inhibitors and drugs with their target proteins.
Discovery of CK2a Inhibitors by in silico docking
Homology model of the target molecule:
Reference: Discovery of a potent and selective protein kinase CK2 inhibitor by high-throughput docking. Vangrevelinghe E, Zimmermann K, Schoepfer J, Portmann R, Fabbro D, Furet P. Oncology Research, Novartis Pharma, Basle, J Med Chem. 2003 Jun 19;46(13):2656-62.
Medicines are not Effective in all Patients
Inter-individual differences in drug efficacy:
Group SSRI ACE-I Beta blockers Statins Beta2 agonists
Incomplete/absent efficacy 10-25% 10-30% 15-25% 30-70% 40-70%
[ Spear BB (2001) Trends Mol Med;7(5):201-204 ]
Structural analysis of human mutations and nsSNPs
4 2
1 5
3 7
-8
-4
+4
+8
kT/e
E.g. Changes in the electrostatic properties upon mutation
Public database holdings

1'000'000
100'000
10'000
1'000
TrEMBL SwissProt PDB
100 1986
1988
1990
1992
1994
1996
1998
2000
2002
2004
Structural Genomics
large scale experimental structure solution projects
Goal:
Most of the sequences in a genome database should match at least one structure with a sufficient sequence identity allowing for reliable modeling.
The modeling error determines selection of targets for structural genomics.
Range of sequence space that can be modeled with acceptable accuracy.
Structural Genomics Target Selection
Protein Modeling Resources

SWISS-MODEL Modeller WhatIf 3D-JIGSAW CPHmodels SDSC1 http://swissmodel.expasy.org http://www.salilab.org http://www.cmbi.kun.nl/whatif/
http://www.bmm.icnet.uk/people/paulb/3dj/form.html
http://www.cbs.dtu.dk/services/CPHmodels/
http://cl.sdsc.edu/hm.html
Some useful Evaluation Tools

ProCheck http://www.biochem.ucl.ac.uk/~roman/procheck/procheck.html WhatCheck http://www.cmbi.kun.nl/gv/whatcheck/ Verify3D http://www.doe-mbi.ucla.edu/Services/Verify_3D/ Biotech Validation Suite for Protein Structures http://biotech.ebi.ac.uk:8400/

3DHomology Modelling

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

3DHomology Modelling

Загружено:

Авторское право:

Доступные форматы

Swiss Institute of Bioinformatics

EMBnet course: Introduction to Protein Structure Bioinformatics

How many structures do we know?

How many structures do we know?

Growth of the Protein Data Bank PDB

How many structures do we know?

(Sources: PDB, EBI, SIB)

Can we predict protein structures from genome sequences?

MNIFEMLRID HLLTKSPSLN DEAEKLFNQD LDAVRRCALI LQQKRWDEAA TTFRTGTWDA

EGLRLKIYKD AAKSELDKAI VDAAVRGILR NMVFQMGETG VNLAKSRWYN YKNL

TEGYYTIGIG GRNCNGVITK NAKLKPVYDS VAGFTNSLRM QTPNRAKRVI

VN (1 + cos(n )) + torsions 2 12 6 q q N N ij ij + i j + 4 ij r 4 r rij i =1 j =i +1 0 ij ij

Ab initio protein folding simulation

Blue Gene will need 1-3 years to simulate 100 sec.

Growth of the Protein Data Bank PDB

Old folds per year

New folds per year

CATH - Protein Structure Classification

Sequence similarity implies structural similarity?

Sequence identity implies structural similarity ! Don't know region

0 Number of residues aligned

Sequence similarity implies structural similarity?

Number of residues aligned

Fold recognition / Threading

>Protein XY MSTLYEKLGGTTAVDLAV DKFYERVLQDDRIKHFFA DVDMAKQRAHQKAFLTYA FGGTDKYDGRYMREAHKE LVENHGLNGEHFDAVAED LLATLKEMGVPEDLIAEV AAVAGAPAHKRDVLNQ

Fold recognition / Threading

Protein Structure / Fold Databases

Fold Recognition Servers

Evolution of the globin family:

Evolution of protein structure families

2.5 2.0 1.5 1.0 0.5 0.0 100 50 0

Percent identical residues in core

Alignment Template - Target

Structure Evaluation & Assessment

Known Structures (Templates)

Protein Data Bank PDB http://www.pdb.org Database of templates

Alignment Template - Target

Structure Evaluation & Assessment

Known Structures (Templates)

Alignment Template - Target

Structure Evaluation & Assessment

Known Structures (Templates)

Alignment Template - Target

Structure Evaluation & Assessment

Known Structures (Templates)

Alignment Template - Target

Structure Evaluation & Assessment

Known Structures (Templates)

Alignment Template - Target

Structure Evaluation & Assessment

Manual Model building

II. Template based fragment assembly

II. Template based fragment assembly

II. Template based fragment assembly

II. Template based fragment assembly

II. Template based fragment assembly

II. Template based fragment assembly

III. Satisfaction of Spatial restraints

III. Satisfaction of Spatial restraints

III. Satisfaction of Spatial restraints

How can we derive modeling restraints from this data?

III. Satisfaction of Spatial restraints

III. Satisfaction of Spatial restraints

0.2 < s ' < 0.4