Академический Документы
Профессиональный Документы
Культура Документы
Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton, Bioinformatics, genes, protein & computers; A.M. Lesk, Introduction to Bioinformatics; A.D. Baxevanis & B.F. Ouellette, Bioinformatics, a practical guide to the analysis of genes and proteins; several online materials (George Washington University, University of Houston, Tel-Aviv University) and resources (RCSB, NCBI, SWISS-PROT) as well as personal research data.
Functional Genomics
Genome Expressome database database
algorithm algorithm
Proteome
TERTIARY STRUCTURE (fold) TERTIARY STRUCTURE (fold) algorithm
database
Metabolome algorithm
database
The definition
Main Entry: database Pronunciation: 'dA-t&-"bAs, 'da- also 'dFunction: noun Date: circa 1962 : a usually large collection of data organized especially for rapid search and retrieval (as by a computer) - Webster dictionary
WHAT is a database?
A collection of data that needs to be: Structured Searchable Updated (periodically) Cross referenced Challenge: To change meaningless data into useful information that can be accessed and analysed the best way possible. For example: HOW would YOU organise all biological sequences so that the biological information is optimally accessible? You need an appropriate data management system (DBMS)
DBMS
Internal organization Controls speed and flexibility A unity of programs that Store Extract Modify
Store
Database
Extract
Modify
USER(S)
Relational databases
Data is stored in multiple related tables Data relationships across tables can be either manyto-one or many-to-many A few rules allow the database to be viewed in many ways Lets convert the course details to a relational database
Student 1 Chemistry Biology A Student 1 Chemistry Maths Student 1 Chemistry English A . . . . Student 2 Ecology Biology A Student 2 Ecology . . . . Maths A
B D
A A
A A
A .. A ..
Normalize (1NF)
We remove repeating records (rows)
sID Name 1 2 Student1 Student2 dID 1 2 sID 1 1 1 . . . . cID 1 2 3 E1 A C A E2 E3 B C A B B A P1 P2 A A A C .. A .. A ..
2 2 . . . .
1 2
A A
B D
A A
A A
A .. A ..
Foreign keys
Normalize (2NF)
We remove redundant fields (columns)
sID Name 1 2 Student1 Student2 dID 1 2 cID Course 1 2 3 Biology Maths English sID cID 1 1 1 1 wID Project 1 2 3 4 5 E1 E2 E3 P1 P2 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 gID wID 1 2 2 1 3 1 1 2 .1 .1 . 1 2 3 4 5 1 2 3 4 5
gID Grade 1 2 3 A B C
Relational Databases
What have we achieved? No repeating information Less storage space Better reality representation Easy modification/management Easy usage of any combination of records Remember the DBMS has programs to access and edit this information so ignore the human reading limitation of the primary keys
A request for data from a database is called a query Queries can be of three forms: Choose from a list of parameters Query by example (QBE) Query language
Query Languages
The standard SQL (Structured Query Language) originally called SEQUEL (Structured English QUEry Language) Developed by IBM in 1974 Introduced commercially in 1979 by Oracle Corp. RDMS (SQL), ODBMS (Java, C++, OQL etc)
Distributed databases
From local to global attitude Data appears to be in one location but is most definitely not A definition: Two or more data files in different locations, periodically synchronized by the DBMS to keep data in all locations consistent An intricate network for combining and sharing information Administrators praise fast network technologies!!! Users praise the internet!!!
Biological databases
Like any other database Data organization for optimal analysis Data is of different types Raw data (DNA, RNA, protein sequences) Curated data (DNA, RNA and protein annotated sequences and structures, expression data)
MCTUYTCUYFSTYRCCTYFSCD
Gene structure Secondary structure
Expression data
PDB content
PDB information
http://www.rcsb.org The PDB files have a standard format Key features Informative descriptors
PDB - header
HEADER TITLE COMPND COMPND COMPND COMPND COMPND COMPND COMPND COMPND COMPND COMPND COMPND COMPND COMPND COMPND COMPND COMPND COMPND COMPND COMPND SOURCE SOURCE SOURCE SOURCE SOURCE SOURCE SOURCE SOURCE KEYWDS ISOMERASE/DNA 01-MAR-00 1EJ9 CRYSTAL STRUCTURE OF HUMAN TOPOISOMERASE I DNA COMPLEX MOL_ID: 1; 2 MOLECULE: DNA TOPOISOMERASE I; 3 CHAIN: A; 4 FRAGMENT: C-TERMINAL DOMAIN, RESIDUES 203-765; 5 EC: 5.99.1.2; 6 ENGINEERED: YES; 7 MUTATION: YES; 8 MOL_ID: 2; 9 MOLECULE: DNA (5'10 D(*C*AP*AP*AP*AP*AP*GP*AP*CP*TP*CP*AP*GP*AP*AP*AP*AP*AP*TP* 11 TP*TP*TP*T)-3'); 12 CHAIN: C; 13 ENGINEERED: YES; 14 MOL_ID: 3; 15 MOLECULE: DNA (5'16 D(*C*AP*AP*AP*AP*AP*TP*TP*TP*TP*TP*CP*TP*GP*AP*GP*TP*CP*TP* REMARK 1 17 TP*TP*TP*T)-3'); REMARK 2 18 CHAIN: D; REMARK 2 RESOLUTION. 2.60 ANGSTROMS. 19 ENGINEERED: 3 YES REMARK MOL_ID: 1; REMARK 3 REFINEMENT. 2 ORGANISM_SCIENTIFIC: HOMO SAPIENS; REMARK 3 PROGRAM : X-PLOR 3.1 3 EXPRESSION_SYSTEM_COMMON: BACULOVIRUS EXPRESSION SYSTEM; REMARK 3 AUTHORS : BRUNGER 4 EXPRESSION_SYSTEM_CELL: SF9 INSECT CELLS; 5 MOL_ID: 2; REMARK 280 6 SYNTHETIC: YES; REMARK 280 CRYSTALLIZATION CONDITIONS: 27% PEG 400, 145 MM MGCL2, 20 7 MOL_ID: 3; REMARK 280 MM MES PH 6.8, 5 MM TRIS PH 8.0, 30 MM DTT 8 SYNTHETIC: YES REMARK 290 PROTEIN-DNA COMPLEX, TYPE I TOPOISOMERASE, HUMAN ...
PDB - data
ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM Name ATOM ATOM 1 2 3 4 5 6 1 7 8 9 10 11 12 13 14 Atom N TRP CA TRP C TRP O TRP CB TRP CGN TRP CD1 TRP CD2 TRP NE1 TRP CE2 TRP CE3 TRP CZ2 TRP CZ3 TRP CH2 TRP A 203 A 203 A 203 A 203 A 203 A 203 TRP A 203 A 203 A 203 A 203 A 203 A 203 A 203 A 203 30.156 30.797 30.369 29.315 30.518 30.847 203 32.028 29.980 31.956 30.704 28.657 30.149 28.101 28.849 -4.908 37.767 -4.667 36.431 -3.337 35.766 -3.238 35.147 -5.863 35.513 -5.651 34.081 30.156 -5.234 33.553 -5.876 32.984 -5.191 32.177 -5.582 31.805 -6.305 32.877 -5.705 30.539 -6.431 31.622 -6.131 30.463 1.00 50.81 1.00 49.96 1.00 49.18 1.00 49.27 1.00 46.77 1.00 44.60 -4.908 1.00 49.72 1.00 43.73 1.00 45.45 1.00 45.23 1.00 46.48 1.00 46.06 1.00 43.08 1.00 45.77 N C C O C C 37.767 C C N C C C C C
1.00 50.81
PDB - B-factors
ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM 662 663 664 665 666 667 668 669 670 671 672 673 CZ PHE A 139 N GLY A 140 CA GLY A 140 C GLY A 140 O GLY A 140 N GLY A 149 CA GLY A 149 C GLY A 149 O GLY A 149 N VAL A 150 CA VAL A 150 C VAL A 150 37.757 32.390 31.949 30.460 30.045 22.872 22.264 21.290 20.821 20.985 20.062 20.504 15.302 -6.706 13.648 -11.340 13.328 -12.692 13.047 -12.804 12.027 -12.205 16.582 -19.717 17.670 -20.537 18.517 -19.728 19.560 -20.180 18.087 -18.512 18.782 -17.627 20.191 -17.258 1.00 62.54 1.00 61.26 1.00 62.15 1.00 62.14 1.00 62.35 1 .00 84.56 1.00 84.50 1.00 84.22 1.00 83.69 1.00 84.09 1.00 84.11 1.00 83.82 C N C C O N C C O N C C
b-factors (2)
Bi =
" ri
!2
!2 8# 2 " ri , where: 3
ri
PDB - Occupancy
ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM 145 146 147 148 149 150 151 152 153 154 N VAL A 25 CA VAL A 25 C VAL A 25 O VAL A 25 CB AVAL A 25 CB BVAL A 25 CG1AVAL A 25 CG1BVAL A 25 CG2AVAL A 25 CG2BVAL A 25 32.433 16.336 57.540 1.00 11.92 31.132 16.439 58.160 1.00 11.85 30.447 15.105 58.363 1.00 12.34 29.520 15.059 59.174 1.00 15.65 30.385 17.437 57.230 0.28 13.88 30.166 17.399 57.373 0.72 15.41 28.870 17.401 57.336 0.28 12.64 30.805 18.788 57.449 0.72 15.11 30.835 18.826 57.661 0.28 13.58 29.909 16.996 55.922 0.72 13.25 A1 N A1 C A1 C A1 O A1 C A1 C A1 C A1 C A1 C A1 C
Creating 3D Domains
1EJ9A2
http://rcsb-deposit.rutgers.edu/adit/docs/pdb_atom_format.html
Protein folds
There is a continuum of similarity! Fold definition: two folds are similar if they have a similar arrangement of domains (architecture) and connectivity (topology). Sometimes a few domains or supersecondary structures may be missing. Fold classification: To get an idea of the variety of different folds, one must adjust for sequence redundancy and also try to correctly assign homologs that have low sequence identity (e.g. below 25%).
Immunoglobulin: 2rhe BenceJones protein; 2cd4 CD4; 1ten tenascin UB roll: 1ubq ubiquitin; 1fxiA ferredoxin; 1pgx protein G Jelly roll: 2stv tobacco necrosis virus; 1tnfA tumor necrosis factor; 2ltnA pea lectin Plaitfold (Split sandwich): 1aps acylphosphatase; 1fxd ferredoxin; 2hpr histidinecontaining phosphocarrier
Trefoil: 1i1b interleukin-1; 1aaiB ricin; 1tie erythrina trypsin inhibitor TIM barrel: 1timA triosephosphate isomerase; 1ald aldolase; 5rubA rubisco OB fold: 1quqA replication protein A 32kDa subunit; 1mjc major cold-shock protein; 1bcpD pertussis toxin S5 subunit
Whats a domain?
Domains are genetically mobile units, and multidomain families are found in all three kingdoms (Archaea, Bacteria and Eukarya) underlining the finding that Nature is a tinkerer and not an inventor (Jacob, 1977). The majority of genomic proteins, 75% in unicellular organisms and more than 80% in metazoa, are multidomain proteins created as a result of gene duplication events (Apic et al., 2001). Domains in multidomain structures are likely to have once existed as independent proteins, and many domains in eukaryotic multidomain proteins can be found as independent proteins in prokaryotes (Davidson et al., 1993).
Compact, semi-independent unit (Richardson, 1981). Stable unit of a protein structure that can fold autonomously (Wetlaufer, 1973). Recurring functional and evolutionary module (Bork, 1992).
Structural/functional genomics
Domain/motif examples
4 3 5 4 5 3
2 1 1 2
VAST (Vector Alignment Search Tool, NCBI) CE (Combinatorial Extension, RCSB/PDB) DALI (EBI)
VAST outline
1. Parse protein structures into domains/supersecondary structures (helices and strands).
2. Fit vectors to domains. 3. To compare a pair of proteins attempt to superpose as many vectors as possible, subject to constraints. 4. Evaluate the vector alignment for statistical significance (computation of the E-value). 5. If the vector alignment is significant then proceed to a more detailed residue-to-residue alignment (refined alignment).
3chy
1ipf A
Vector superposition
Refined alignment
Metric matrix distance geometry method applied to all pair-wise distances (structural dissimilarities) to assign three-dimensional coordinates to a set of 498 SCOP folds such that the relative distance between two folds is inversely correlated with the DALI alignment score. The results of the mapping are shown in the figure on the left.
Chothia, C., Proteins. One thousand families for the molecular biologist. Nature, 1992. 357(6379): p. 543-4. Zhang, C. and C. DeLisi, Estimating the number of protein folds. J Mol Biol, 1998. 284(5): p. 1301-5.
SCOP
http://scop.berkeley.edu/ Structural Classification Of Proteins 3D Macromolecular structural data grouped based on structural classification Data originates from the PDB Current version (v1.67) 20619 PDB Entries (March 20th, 2005). 65122 Domains (it does not include nucleic acids and theoretical models Levels of the SCOP hierarchy: Family: clear evolutionary relationship Superfamily: probable common evolutionary origin Fold: major structural similarity
CATH
http://www.biochem.ucl.ac.uk/bsm/cath/
Class, derived from secondary structure content, is assigned for more than 90% of protein structures automatically. Architecture, which describes the gross orientation of secondary structures, independent of connectivities, is currently assigned manually. Topology level clusters structures according to their topological connections and numbers of secondary structures. The Homologous superfamilies cluster proteins with highly similar structures and functions. The assignments of structures to topology families and homologous superfamilies are made by sequence and structure comparisons.
DSSP
http://swift.cmbi.kun.nl/swift/servers/moddssp-submit.html Dictionary of secondary structure of proteins The DSSP database comprises the secondary structures of all PDB entries DSSP is actually software that translates the PDB structural co-ordinates into secondary structure elements A similar example is STRIDE
three-dimensional protein structures. Comparison of structures may reveal biologically interesting similarities that are not detectable when comparing sequences.
Why bother?!
Researchers create and use the data Use of known information for analyzing new data New data needs to be screened Structural/Functional information Extends the knowledge and information on a higher level than DNA or protein sequences
In the end .
Computers can figure out all kinds of problems, except the things in the world that just don't add up. James Magary
We should add: For that we employ the human brain, experts and experience.
Data errors
Data is sometimes of poor quality, erroneous, misspelled