Вы находитесь на странице: 1из 59

Biomolecular Databases

Roberto Lins EPFL - summer semester 2005

Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton, Bioinformatics, genes, protein & computers; A.M. Lesk, Introduction to Bioinformatics; A.D. Baxevanis & B.F. Ouellette, Bioinformatics, a practical guide to the analysis of genes and proteins; several online materials (George Washington University, University of Houston, Tel-Aviv University) and resources (RCSB, NCBI, SWISS-PROT) as well as personal research data.

Functional Genomics
Genome Expressome database database

algorithm algorithm

Proteome
TERTIARY STRUCTURE (fold) TERTIARY STRUCTURE (fold) algorithm

database

Metabolome algorithm

database

The definition
Main Entry: database Pronunciation: 'dA-t&-"bAs, 'da- also 'dFunction: noun Date: circa 1962 : a usually large collection of data organized especially for rapid search and retrieval (as by a computer) - Webster dictionary

WHAT is a database?
A collection of data that needs to be: Structured Searchable Updated (periodically) Cross referenced Challenge: To change meaningless data into useful information that can be accessed and analysed the best way possible. For example: HOW would YOU organise all biological sequences so that the biological information is optimally accessible? You need an appropriate data management system (DBMS)

DBMS

Internal organization Controls speed and flexibility A unity of programs that Store Extract Modify
Store

Database

Extract

Modify

USER(S)

DBMS organisation types


Flat file databases (flat DBMS) Simple, restrictive, table Hierarchical databases (hierarchical DBMS) Simple, restrictive, tables Relational databases (RDBMS) Complex,versatile, tables Object-oriented databases (ODBMS) Complex, versatile, objects

Relational databases
Data is stored in multiple related tables Data relationships across tables can be either manyto-one or many-to-many A few rules allow the database to be viewed in many ways Lets convert the course details to a relational database

Our flat file database


FLAT DATABASE 2 Course details Name Depart. Course E1 C E2 E3 B C A B B A P1 P2 A A A C .. A .. A ..

Student 1 Chemistry Biology A Student 1 Chemistry Maths Student 1 Chemistry English A . . . . Student 2 Ecology Biology A Student 2 Ecology . . . . Maths A

B D

A A

A A

A .. A ..

Normalize (1NF)
We remove repeating records (rows)
sID Name 1 2 Student1 Student2 dID 1 2 sID 1 1 1 . . . . cID 1 2 3 E1 A C A E2 E3 B C A B B A P1 P2 A A A C .. A .. A ..

cID Course 1 2 3 Biology Maths English

dID Department 1 2 Chemistry Ecology Primary keys

2 2 . . . .

1 2

A A

B D

A A

A A

A .. A ..

Foreign keys

Normalize (2NF)
We remove redundant fields (columns)
sID Name 1 2 Student1 Student2 dID 1 2 cID Course 1 2 3 Biology Maths English sID cID 1 1 1 1 wID Project 1 2 3 4 5 E1 E2 E3 P1 P2 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 gID wID 1 2 2 1 3 1 1 2 .1 .1 . 1 2 3 4 5 1 2 3 4 5

gID Grade 1 2 3 A B C

dID Department 1 2 Chemistry Ecology

Relational Databases
What have we achieved? No repeating information Less storage space Better reality representation Easy modification/management Easy usage of any combination of records Remember the DBMS has programs to access and edit this information so ignore the human reading limitation of the primary keys

Accessing database information

A request for data from a database is called a query Queries can be of three forms: Choose from a list of parameters Query by example (QBE) Query language

Query Languages
The standard SQL (Structured Query Language) originally called SEQUEL (Structured English QUEry Language) Developed by IBM in 1974 Introduced commercially in 1979 by Oracle Corp. RDMS (SQL), ODBMS (Java, C++, OQL etc)

Distributed databases
From local to global attitude Data appears to be in one location but is most definitely not A definition: Two or more data files in different locations, periodically synchronized by the DBMS to keep data in all locations consistent An intricate network for combining and sharing information Administrators praise fast network technologies!!! Users praise the internet!!!

So why do biologists care?

Three main reasons


Database proliferation Dozens to hundreds at the moment In the next few years biological data analysis will be trifurcated Bio-webs : remote data analysis and mining Bio-grids : transparent high-end computing Bio-semantic webs : biological knowledge More and more scientific discoveries result from interdatabase analysis and mining

Biological databases
Like any other database Data organization for optimal analysis Data is of different types Raw data (DNA, RNA, protein sequences) Curated data (DNA, RNA and protein annotated sequences and structures, expression data)

Raw Biological data Nucleic Acids (DNA)

Raw Biological data Amino acid residues (proteins)

Curated Biological Data


DNA, nucleotide sequences
Gene boundaries, topology

Proteins, residue sequences


Extended sequence information

MCTUYTCUYFSTYRCCTYFSCD
Gene structure Secondary structure

Introns, exons, ORFs, splicing

Expression data

Hydrophobicity, motif data Hydrophobicity,

Curated Biological data 3D Structures, folds

A few biological databases


Nucleotide Databases Alternative Splicing, EMBL-Bank, Ensembl, Genomes Server, Genome, MOT, EMBL-Align, Simple Queries, dbSTS Queries, Parasites, Mutations, IMGT Genome Databases Human, Mouse, Yeast, C.elegans, FLYBASE, Parasites Protein Databases Swiss-Prot, TrEMBL, InterPro, CluSTr, IPI, GOA, GO, Proteome Analysis, HPI, IntEnz, TrEMBLnew, SP_ML, NEWT, PANDIT Structure Databases PDB, MSD, FSSP, DALI Microarray Database ArrayExpress Literature Databases Web of Science, MEDLINE, Software Biocatalog, Flybase Archives Alignment Databases BAliBASE, Homstrad, FSSP

PDB - The most important structural biomolecular database


3D Macromolecular structural data Data originates from either NMR or X-ray crystallography techniques Total no of structures 29.326 (Jan 25th, 2005) If the 3D structure of a protein is solved ... they have it

PDB content

PDB information

http://www.rcsb.org The PDB files have a standard format Key features Informative descriptors

PDB - header
HEADER TITLE COMPND COMPND COMPND COMPND COMPND COMPND COMPND COMPND COMPND COMPND COMPND COMPND COMPND COMPND COMPND COMPND COMPND COMPND COMPND SOURCE SOURCE SOURCE SOURCE SOURCE SOURCE SOURCE SOURCE KEYWDS ISOMERASE/DNA 01-MAR-00 1EJ9 CRYSTAL STRUCTURE OF HUMAN TOPOISOMERASE I DNA COMPLEX MOL_ID: 1; 2 MOLECULE: DNA TOPOISOMERASE I; 3 CHAIN: A; 4 FRAGMENT: C-TERMINAL DOMAIN, RESIDUES 203-765; 5 EC: 5.99.1.2; 6 ENGINEERED: YES; 7 MUTATION: YES; 8 MOL_ID: 2; 9 MOLECULE: DNA (5'10 D(*C*AP*AP*AP*AP*AP*GP*AP*CP*TP*CP*AP*GP*AP*AP*AP*AP*AP*TP* 11 TP*TP*TP*T)-3'); 12 CHAIN: C; 13 ENGINEERED: YES; 14 MOL_ID: 3; 15 MOLECULE: DNA (5'16 D(*C*AP*AP*AP*AP*AP*TP*TP*TP*TP*TP*CP*TP*GP*AP*GP*TP*CP*TP* REMARK 1 17 TP*TP*TP*T)-3'); REMARK 2 18 CHAIN: D; REMARK 2 RESOLUTION. 2.60 ANGSTROMS. 19 ENGINEERED: 3 YES REMARK MOL_ID: 1; REMARK 3 REFINEMENT. 2 ORGANISM_SCIENTIFIC: HOMO SAPIENS; REMARK 3 PROGRAM : X-PLOR 3.1 3 EXPRESSION_SYSTEM_COMMON: BACULOVIRUS EXPRESSION SYSTEM; REMARK 3 AUTHORS : BRUNGER 4 EXPRESSION_SYSTEM_CELL: SF9 INSECT CELLS; 5 MOL_ID: 2; REMARK 280 6 SYNTHETIC: YES; REMARK 280 CRYSTALLIZATION CONDITIONS: 27% PEG 400, 145 MM MGCL2, 20 7 MOL_ID: 3; REMARK 280 MM MES PH 6.8, 5 MM TRIS PH 8.0, 30 MM DTT 8 SYNTHETIC: YES REMARK 290 PROTEIN-DNA COMPLEX, TYPE I TOPOISOMERASE, HUMAN ...

PDB - data
ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM Name ATOM ATOM 1 2 3 4 5 6 1 7 8 9 10 11 12 13 14 Atom N TRP CA TRP C TRP O TRP CB TRP CGN TRP CD1 TRP CD2 TRP NE1 TRP CE2 TRP CE3 TRP CZ2 TRP CZ3 TRP CH2 TRP A 203 A 203 A 203 A 203 A 203 A 203 TRP A 203 A 203 A 203 A 203 A 203 A 203 A 203 A 203 30.156 30.797 30.369 29.315 30.518 30.847 203 32.028 29.980 31.956 30.704 28.657 30.149 28.101 28.849 -4.908 37.767 -4.667 36.431 -3.337 35.766 -3.238 35.147 -5.863 35.513 -5.651 34.081 30.156 -5.234 33.553 -5.876 32.984 -5.191 32.177 -5.582 31.805 -6.305 32.877 -5.705 30.539 -6.431 31.622 -6.131 30.463 1.00 50.81 1.00 49.96 1.00 49.18 1.00 49.27 1.00 46.77 1.00 44.60 -4.908 1.00 49.72 1.00 43.73 1.00 45.45 1.00 45.23 1.00 46.48 1.00 46.06 1.00 43.08 1.00 45.77 N C C O C C 37.767 C C N C C C C C

1.00 50.81

Number Atom Name Residue Name

Occupancy Temperature Factor

Residue Number Chain ID

PDB - B-factors
ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM 662 663 664 665 666 667 668 669 670 671 672 673 CZ PHE A 139 N GLY A 140 CA GLY A 140 C GLY A 140 O GLY A 140 N GLY A 149 CA GLY A 149 C GLY A 149 O GLY A 149 N VAL A 150 CA VAL A 150 C VAL A 150 37.757 32.390 31.949 30.460 30.045 22.872 22.264 21.290 20.821 20.985 20.062 20.504 15.302 -6.706 13.648 -11.340 13.328 -12.692 13.047 -12.804 12.027 -12.205 16.582 -19.717 17.670 -20.537 18.517 -19.728 19.560 -20.180 18.087 -18.512 18.782 -17.627 20.191 -17.258 1.00 62.54 1.00 61.26 1.00 62.15 1.00 62.14 1.00 62.35 1 .00 84.56 1.00 84.50 1.00 84.22 1.00 83.69 1.00 84.09 1.00 84.11 1.00 83.82 C N C C O N C C O N C C

b-factors (2)

calculated b-factors (2)

Bi =
" ri
!2

!2 8# 2 " ri , where: 3

resi due number


!

= deviation from the average position

ri

resi due number

PDB - Occupancy
ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM 145 146 147 148 149 150 151 152 153 154 N VAL A 25 CA VAL A 25 C VAL A 25 O VAL A 25 CB AVAL A 25 CB BVAL A 25 CG1AVAL A 25 CG1BVAL A 25 CG2AVAL A 25 CG2BVAL A 25 32.433 16.336 57.540 1.00 11.92 31.132 16.439 58.160 1.00 11.85 30.447 15.105 58.363 1.00 12.34 29.520 15.059 59.174 1.00 15.65 30.385 17.437 57.230 0.28 13.88 30.166 17.399 57.373 0.72 15.41 28.870 17.401 57.336 0.28 12.64 30.805 18.788 57.449 0.72 15.11 30.835 18.826 57.661 0.28 13.58 29.909 16.996 55.922 0.72 13.25 A1 N A1 C A1 C A1 O A1 C A1 C A1 C A1 C A1 C A1 C

PDB - From coordinates to molecular models to annotation

-Helices -strands coils/loops

Creating 3D Domains

1EJ9A1 1EJ9A4 3D Domains 1EJ9A3 1EJ9A5

1EJ9A2

PDB nomenclature - more details

http://rcsb-deposit.rutgers.edu/adit/docs/pdb_atom_format.html

Some databases are PDB content based - Why?


Understanding protein structure, function and dynamics ranks among the most challenging and fascinating problems faced by science today. Since the function of a protein is related to its three dimensional structure, manipulation of the latter by means of mutation in the protein sequence generates functional diversity. The keys that will help us understand this mechanism and consequently protein sequence evolution lie in the yet unknown laws that govern protein folding. The knowledge of these laws would also prove useful for engineering protein molecules to optimize their activities as well as to alter their pharmacokinetic properties in the case of therapeutically important molecules.

Patrice Koehl, Stanford University

Fold, domain/motif and topology

The first protein structure in 1960: myoglobin - fold

Protein folds
There is a continuum of similarity! Fold definition: two folds are similar if they have a similar arrangement of domains (architecture) and connectivity (topology). Sometimes a few domains or supersecondary structures may be missing. Fold classification: To get an idea of the variety of different folds, one must adjust for sequence redundancy and also try to correctly assign homologs that have low sequence identity (e.g. below 25%).

Superfolds (Orengo, Jones, Thornton)


Distribution of fold types is highly non-uniform. There are about 10 types of folds, the superfolds, to which about 30% of the other folds are similar. There are many examples of isolated fold types. Superfolds are characterized by a wide range of sequence diversity and spanning a range of non-similar functions. It is a research question as to the evolutionary relationships of the superfolds, i.e. do they arise by divergent or convergent evolution?

Superfolds and examples

Globin: 1hlm sea cucumber hemoglobin; 1cpcA phycocyanin; 1colA colicin.

/ doubly-wound: 5p21 Ras p21; 4fxn flavodoxin; 3chy CheY

-up-down: 2hmqA hemerythrin; 256bA cytochrome B562; 1lpe apolipoprotein E3

Immunoglobulin: 2rhe BenceJones protein; 2cd4 CD4; 1ten tenascin UB roll: 1ubq ubiquitin; 1fxiA ferredoxin; 1pgx protein G Jelly roll: 2stv tobacco necrosis virus; 1tnfA tumor necrosis factor; 2ltnA pea lectin Plaitfold (Split sandwich): 1aps acylphosphatase; 1fxd ferredoxin; 2hpr histidinecontaining phosphocarrier

Trefoil: 1i1b interleukin-1; 1aaiB ricin; 1tie erythrina trypsin inhibitor TIM barrel: 1timA triosephosphate isomerase; 1ald aldolase; 5rubA rubisco OB fold: 1quqA replication protein A 32kDa subunit; 1mjc major cold-shock protein; 1bcpD pertussis toxin S5 subunit

Whats a domain?
Domains are genetically mobile units, and multidomain families are found in all three kingdoms (Archaea, Bacteria and Eukarya) underlining the finding that Nature is a tinkerer and not an inventor (Jacob, 1977). The majority of genomic proteins, 75% in unicellular organisms and more than 80% in metazoa, are multidomain proteins created as a result of gene duplication events (Apic et al., 2001). Domains in multidomain structures are likely to have once existed as independent proteins, and many domains in eukaryotic multidomain proteins can be found as independent proteins in prokaryotes (Davidson et al., 1993).

Domains and its importance


High resolution structures (e.g. Pfuhl & Pastore, 1995). Sequence analysis (Russell & Ponting, 1998) Multiple alignment methods Sequence database searches Prediction algorithms Fold recognition
Nature is a tinkerer and not an inventor (Jacob, 1977)

Compact, semi-independent unit (Richardson, 1981). Stable unit of a protein structure that can fold autonomously (Wetlaufer, 1973). Recurring functional and evolutionary module (Bork, 1992).

Structural/functional genomics

How big is a Domain?


The size of individual structural domains varies widely from 36 residues in E-selectin to 692 residues in lipoxygenase-1 (Jones et al., 1998), the majority (90%) having less than 200 residues (Siddiqui and Barton, 1995) with an average of about 100 residues (Islam et al., 1995). Small domains (less than 40 residues) are often stabilized by metal ions or disulphide bonds. Large domains (greater than 300 residues) are likely to consist of multiple hydrophobic cores (Garel, 1992).

Domain/motif examples

G.M. Salem et al. J. Mol. Biol. (1999) 287 969-981

From domain to Topology


/ fold Flavodoxin family - TOPS diagrams (Flores et al., 1994) Flavotoxin fold 5() fold

4 3 5 4 5 3

2 1 1 2

Fold, domain/motif topology


/ fold

Protein structure comparison


How to compare 3D protein structures? Analogous computational considerations to sequence comparison, e.g. accuracy, efficiency for database searches, statistical significance of results, etc. Additional complication: working with atomic coordinates in 3D space!

Some protein structure comparison methods

VAST (Vector Alignment Search Tool, NCBI) CE (Combinatorial Extension, RCSB/PDB) DALI (EBI)

VAST outline
1. Parse protein structures into domains/supersecondary structures (helices and strands).

2. Fit vectors to domains. 3. To compare a pair of proteins attempt to superpose as many vectors as possible, subject to constraints. 4. Evaluate the vector alignment for statistical significance (computation of the E-value). 5. If the vector alignment is significant then proceed to a more detailed residue-to-residue alignment (refined alignment).

Two protein with vectors assigned to domains/SSEs

3chy

1ipf A

VAST comparison of 3chy and 1ipfA

Vector superposition

Refined alignment

How many folds ?

Metric matrix distance geometry method applied to all pair-wise distances (structural dissimilarities) to assign three-dimensional coordinates to a set of 498 SCOP folds such that the relative distance between two folds is inversely correlated with the DALI alignment score. The results of the mapping are shown in the figure on the left.

How many folds ?


Chothia (1992) appr. 1,000 folds Estimates vary from 1,000 15,000 With 30,000 human genes, ~3 genes per fold (?) on average (but think about alternative splicing)

Chothia, C., Proteins. One thousand families for the molecular biologist. Nature, 1992. 357(6379): p. 543-4. Zhang, C. and C. DeLisi, Estimating the number of protein folds. J Mol Biol, 1998. 284(5): p. 1301-5.

SCOP
http://scop.berkeley.edu/ Structural Classification Of Proteins 3D Macromolecular structural data grouped based on structural classification Data originates from the PDB Current version (v1.67) 20619 PDB Entries (March 20th, 2005). 65122 Domains (it does not include nucleic acids and theoretical models Levels of the SCOP hierarchy: Family: clear evolutionary relationship Superfamily: probable common evolutionary origin Fold: major structural similarity

SCOP levels bottom-up


1.Family: Clear evolutionarily relationship Proteins clustered together into families are clearly evolutionarily related. Generally, this means that pairwise residue identities between the proteins are 30% and greater. However, in some cases similar functions and structures provide definitive evidence of common descent in the absence of high sequence identity; for example, many globins form a family though some members have sequence identities of only 15%. 2.Superfamily: Probable common evolutionary origin Proteins that have low sequence identities, but whose structural and functional features suggest that a common evolutionary origin is probable are placed together in superfamilies. For example, actin, the ATPase domain of the heat shock protein, and hexakinase together form a superfamily. 3.Fold: Major structural similarity Proteins are defined as having a common fold if they have the same major secondary structures in the same arrangement and with the same topological connections. Different proteins with the same fold often have peripheral elements of secondary structure and turn regions that differ in size and conformation. In some cases, these differing peripheral regions may comprise half the structure. Proteins placed together in the same fold category may not have a common evolutionary origin: the structural similarities could arise just from the physics and chemistry of proteins favouring certain packing arrangements and chain topologies.

CATH
http://www.biochem.ucl.ac.uk/bsm/cath/

Class, derived from secondary structure content, is assigned for more than 90% of protein structures automatically. Architecture, which describes the gross orientation of secondary structures, independent of connectivities, is currently assigned manually. Topology level clusters structures according to their topological connections and numbers of secondary structures. The Homologous superfamilies cluster proteins with highly similar structures and functions. The assignments of structures to topology families and homologous superfamilies are made by sequence and structure comparisons.

DSSP
http://swift.cmbi.kun.nl/swift/servers/moddssp-submit.html Dictionary of secondary structure of proteins The DSSP database comprises the secondary structures of all PDB entries DSSP is actually software that translates the PDB structural co-ordinates into secondary structure elements A similar example is STRIDE

Some other important biological databases in structural bioinformatics


NCBI - http://www.ncbi.nlm.nih.gov/ , creates public databases, conducts research
in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information

Dali - http://www.ebi.ac.uk/dali/ , the Dali server is a network service to compare

three-dimensional protein structures. Comparison of structures may reveal biologically interesting similarities that are not detectable when comparing sequences.

ExPASy - http://www.expasy.org/ , ExPASy (Expert Protein Analysis System) is


a proteomics server of the Swiss Institute of Bioinformatics (SIB). It is mainly used to the analysis of protein sequences and structures.

Why bother?!
Researchers create and use the data Use of known information for analyzing new data New data needs to be screened Structural/Functional information Extends the knowledge and information on a higher level than DNA or protein sequences

In the end .

Computers can figure out all kinds of problems, except the things in the world that just don't add up. James Magary

We should add: For that we employ the human brain, experts and experience.

Biological databases: a short word on problems


Even today we face some key limitations
There is no standard format
Every database or program has its own format

There is no standard nomenclature


Every database has its own names

Data is not fully optimized


Some datasets have missing information without indications of it

Data errors
Data is sometimes of poor quality, erroneous, misspelled

What to take home


Databases are a collection of data Need to access and maintain easily and flexibly Biological information is vast and sometimes very redundant Distributed databases bring it all together with quality controls, cross-referencing and standardization Computers can only create data, they do not give answers

Вам также может понравиться