Вы находитесь на странице: 1из 52

PROTEIN STRUCTURE PREDICTION

BY KUMAR PARIJAT TRIPATHI

WHAT ARE PROTEINS




A protein (from the Greek protas meaning "of primary importance") is a complex, high-molecular-mass, organic compound that consists of amino high-molecularacids joined by peptide bonds. Proteins are essential to the structure and bonds. function of all living cells and viruses. viruses. Different proteins perform a wide variety of biological functions. Some functions. proteins are enzymes, which catalyze chemical reactions. Other proteins reactions. play structural or mechanical roles, immune response and the storage and transport of various ligands. ligands. Proteins are a class of bio-macromolecules, alongside polysaccharides, biolipids, and nucleic acids, that make up the primary constituents of biological organisms. Proteins are essentially polymers made up of a organisms. specific sequence of amino acids. The details of this sequence are stored in acids. the code of a gene. Through the processes of transcription and translation, gene. a cell reads the genetic information and uses it to construct the protein. In protein. many cases, the resulting protein is then chemically altered (post(posttranslational modification), before becoming functional. It is very common functional. for proteins to work together to achieve a particular function, and often physically associate with one another to form a complex. complex.

COMPONENT AND SYNTHESIS: Proteins are polymers built from 20 different L-alpha-amino acids. Proteins are assembled from amino acids using information present in genes. Genes are transcribed into RNA, RNA is then subject to posttranscriptional modification and control, resulting in a mature mRNA that undergoes translation into a protein. mRNA is translated by ribosomes that match the three-base codons of the mRNA to the threebase anti-codons of the appropriate tRNA. The enzyme aminoacyl tRNA synthetase catalyzes the addition of the correct amino acid to their tRNAs.
The two ends of the amino acid chain are referred to as the carboxy terminus (C-terminus) and the amino terminus (N-terminus) based on the nature of the functional group on each extremity.

PROTEIN STRUCTURE:
Proteins are amino acid chains, made up from 20 different L- -amino acids, also referred to as residues, that fold into unique three-dimensional protein structures. The shape into a which a protein naturally folds is known as its native state, which is determined by its sequence of amino acids. Below about 40 residues the term peptide is frequently used. A certain number of residues is necessary to perform a particular biochemical function, and around 40-50 residues appears to be the lower limit for a functional domain size. Protein sizes range from this lower limit to several thousand residues in multi-functional or structural proteins. However, the current estimate for the average protein length is around 300 residues. Very large aggregates can be formed from protein subunits, for example many thousand actin molecules assemble into an actin filament. Large protein complexes with RNA are found in the ribosome particles, which are in fact 'ribozymes'.

Protein: Sequence and 3-D Structure 3

Sequence: Sequence:


20 amino acids

FSDGLAHLDNLKGTFAT

Primary structure: the amino acid sequence Secondary structure: highly patterned sub-structures--alpha helix and beta sheet--or segments of chain that assume no stable shape. Secondary structures are locally defined, meaning that there can be many different secondary motifs present in one single protein molecule Tertiary structure: the overall shape of a single protein molecule; the spatial relationship of the secondary structural motifs to one another Quaternary structure: the shape or structure that results from the union of more than one protein molecule, usually called protein subunits in this context, which function as part of the larger assembly or protein complex. The primary structure is held together by covalent peptide bonds, which are made during the process of translation. The secondary structures are held together by hydrogen bonds. The tertiary structure is held together primarily by hydrophobic interactions but hydrogen bonds, ionic interactions, and disulfide bonds are usually involved too. The two ends of the amino acid chain are referred to as the carboxy terminus (C-terminus) and the amino terminus (N-terminus) based on the nature of the free group on each extremity.

Primary Structure: Sequence




The primary structure of a protein is the amino acid sequence

SECONDARY STRUCTURE: E, F, & LOOPS




E helices and F sheets are stabilized by hydrogen bonds between backbone oxygen and hydrogen atoms

Secondary Structure: E helix

SECONDARY STRUCTURE: F SHEET

F sheet

F buldge

TERTIARY STRUCTURE: A PROTEIN FOLD

FOLDS AND MOTIFS OF PROTEIN STRUCTURE:

Despite that there are about 100,000 different proteins expressed in eukaryotic systems, there are much fewer different structural motifs and folds, partly as a consequence of evolved pathways and mechanisms. Motif in this sense refers to a small specific combination of secondary structural elements (such as helix-turn-helix). These elements are often called supersecondary structures. Fold refers to a global type of arrangement, like helix-bundle or beta-barrel. Structure motifs usually consist of just a few elements, e.g. the 'helix-turn-helix' has just three. Protein structural motifs often include loops of variable length and unspecified structure,

STRUCTURAL DOMAIN:

Within a protein, a structural domain ("domain") is an element of overall structure that is self-stabilizing and often folds independently of the rest of the protein chain. Many domains are not unique to the protein products of one gene or one gene family but instead appear in a variety of proteins. Domains often are named and singled out because they figure prominently in the biological function of the protein they belong to; for example, the "calcium-binding domain of calmodulin. Because they are self-stabilizing, domains can be "swapped" by genetic engineering between one protein and another to make chimeras. A domain may be composed of one, more than one or not any structural motifs.

PROTEIN STRUCTURE PREDICTION:

Protein structure prediction is one of the most significant technologies pursued by computational structural biology and theoretical chemistry. It has the aim of determining the three-dimensional structure of proteins from their amino acid sequences. In more formal terms, this is expressed as the prediction of protein tertiary structure from primary structure. Given the usefulness of known protein structures in such valuable tasks as rational drug design, this is a highly active field of research.

BIOLOGICAL MOTIVATION

PREDICTION OF TRANSMEMBRANE HELIX Within an integral membrane protein, a transmembrane helix is a segment that is alpha-helical in structure, roughly 20 amino acids in length, and (though it may be presumed to lie within the protein, out of contact with the surrounding lipid bilayer) is said to "span" the membrane. IDENTIFICATION OF TRANSMEMBRANE HELICES: Using hydrophobicity analysis to predict transmembrane helices enables a prediction in turn of the "transmembrane topology" of a protein; i.e. prediction of what parts of it protrude into the cell, what parts protrude out, and how many times the protein chain crosses the membrane. This information can be invaluable for developing antibodies, drugs or other reagents that will bind and/or affect the function of the protein.

TMAP (EMBL) PredictProtein (EMBL/Columbia) TMHMM (CBS, Denmark) TMpred (Baylor College) DAS (Stockholm
TMAP: http://www.embl-heidelberg.de/tmap/tmap_sin.html > 95%. ACCURACY Single sequence-based statistical prediction of the locations of transmembrane helices. PHDtopology: http://www.embl-heidelberg.de/predictprotein/ > 85% ACCURACY of all proteins all helices and topology are predicted correctly Refinement of PHDhtm by dynamic programming and prediction of topology (orientation of N-term with respect to membrane). TMpred: http://ulrec3.unil.ch/software/TMPRED_form.html Single sequence-based prediction of location and topology

TMHMM RESULT

# gi_47825389_ref_NP_001001470.1_ Length: 211 # gi_47825389_ref_NP_001001470.1_ Number of predicted TMHs: 1 # gi_47825389_ref_NP_001001470.1_ Exp number of AAs in TMHs: 20.46703 # gi_47825389_ref_NP_001001470.1_ Exp number, first 60 AAs: 20.18892 # gi_47825389_ref_NP_001001470.1_ Total prob of N-in: 0.45484 # gi_47825389_ref_NP_001001470.1_ POSSIBLE N-term signal sequence gi_47825389_ref_NP_001001470.1_ TMHMM2.0 outside 1 9 gi_47825389_ref_NP_001001470.1_ TMHMM2.0 TMhelix 10 32 gi_47825389_ref_NP_001001470.1_ TMHMM2.0 inside 33 211

TM PRED RESULT

Inside to outside helices : 1 found from to score center 35 ( 35) 52 ( 52) 2018 44 Outside to inside helices : 1 found from to score center 35 ( 35) 53 ( 53) 1693 44

HMMTOP RESULT
Protein: noname Length: 234 N-terminus: IN Number of transmembrane helices: 1 Transmembrane helices: 31-47 Total entropy of the model: 17.0147 Entropy of the best path: 17.0155 seq GIREFNPLYS YMEGALLSGA LLSMLGKNDP MCLVLVLLGL TALLGICQGG 50 pred IIIIIIIIII IIIIIIIIII IIIIIIiiii HHHHHHHHHH HHHHHHHooo seq TGCYGSVSRI DTTGASCRTA KPEGLSYCGV RASRTIAERD LGSMNKYKVL 100 pred oooooooooo ooOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO seq IKRVGEALCI EPAVIAGIIS RESHAGKILK NGWGDRGNGF GLMQVDKRYH 150 pred OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO seq KIEGTWNGEA HIRQGTRILI DMVKKIQRKF PRWTRDQQLK GGISAYNAGV 200 pred OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO seq GNVRSYERMD IGTLHDDYSN DVVARAQYFK QHGY 234 pred OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOO

LOCATING DOMAINS If homology to other sequences occurs only over a portion of the probe sequence and the other sequences are whole (i.e. not partial sequences), then this provides the strongest evidence for domain structure. You can either do database searches yourself or make use of well- curated, pre-defined databases of protein domains. Searches of these databases (see links below) will often assign domains easily. SMART (Oxford/EMBL) PFAM (Sanger Centre/Wash-U/ Karolinska Intitutet) COGS (NCBI) PRINTS (UCL/Manchester) BLOCKS (Fred Hutchinson Cancer Research Centre, Seatle) SBASE (ICGEB, Trieste)

SECONDARY STRUCTURE PREDICTION METHODS AND LINKS PSI- pred (PSI-BLAST profiles used for prediction; David Jones, Warwick) JPRED Consensus prediction (includes many of the methods given below; Cuff & Barton, EBI) PREDATORFrischman & Argos (EMBL) PHD home page Rost & Sander, EMBL, Germany ZPRED server Zvelebil et al., Ludwig, U.K. nnPredict Cohen et al., UCSF, USA. SSP (Nearest-neighbor) Solovyev and Salamov, Baylor College,
The aim of secondary structure prediction is to provide the location of alpha helices, and beta strands within a protein or protein family.

Secondary structure prediction has been around for almost a quarter of a century. Probably the most famous early methods are those of Chou & Fasman, Garnier, Osguthorbe & Robson (GOR) and Lim.. The availability of large families of homologous sequences revolutionized secondary structure prediction. Traditional methods, when applied to a family of proteins rather than a single sequence proved much more accurate at identifying core secondary structure elements. The combination of sequence data with sophisticated computing techniques such as neural networks has lead to accuracies well in excess of 70 %. Though this seems a small percentage increase, these predictions are actually much more useful than those for single sequence, since they tend to predict the core accurately. Moreover, the limit of 70-80% may be a function of secondary structure variation within homologous proteins. Alpha helices have a periodicity of 3.6, which means that for helices with one face buried in the protein core, and the other exposed to solvent, will have residues at positions i, i+3, i+4 & i+7 (where i is a residue in an a helix) will lie on one face of the helix. Many alpha helices in proteins are amphipathic, meaning that one face is pointing towards the hydrophobic core and the other towards the solvent. Thus patterns of hydrophobic residue conservation showing the i, i+3, i+4, i+7 pattern are highly indicative of an alpha helix.

Results of nnpredict query


Tertiary structure class: none Sequence gi|47825389|ref|NP_001001470.1| lysozyme [Gallus gallus]: MLGKNDPMCLVLVLLGLTALLGICQGGTGCYGSVSRIDTTGASCRTAKPEGLSYCGVRAS RTIAERDLGSMNKYKVLIKRVGEALCIEPAVIAGIISRESHAGKILKNGWGDRGNGFGLM QVDKRYHKIEGTWNGEAHIRQGTRILIDMVKKIQRKFPRWTRDQQLKGGISAYNAGVGNV RSYERMDIGTLHDDYSNDVVARAQYFKQHGY Secondary structure prediction (H = helix, E = strand, - = no prediction): --------HHEEEHHHHHHHHEE-------E--E-EEE---------------EE------HHH--HH----HHEHHHHH----------EEEEEEH-----HHE-------------EEHH-----E-------EHHHH-HHHHHHHHHHH-H--------------EEE-------------------------HHHHHHHH-----

strategy for secondary structure prediction

FOLD RECOGNITION METHODS AND LINKS


Methods of protein fold recognition attempt to detect similarities between protein 3D structure that are not accompanied by any significant sequence similarity. There are many approaches, but the unifying theme is to try and find folds that are compatable with a particular sequence. Unlike sequence-only comparison, these methods take advantage of the extra information made available by 3D structure information 3D-pssm (this server) TOPITS (EMBL) UCLA-DOE Structre Prediction Server (UCLA) 123D UCSC HMM (UCSC) FAS (Burnham Institute) Methods where an executable or code is available: THREADER(Warwick) ProFIT CAME (Salzburg)

ANALYSIS OF PROTEIN FOLDS AND ALIGNMENT OF SECONDARY STRUCTURE ELEMENTS


If you have predicted that your protein will adopt a particular fold within the database, then an important thing to consider to which fold your protein belongs, and other proteins that adopt a similar fold. To find out, look at one of the following databases SCOP (MRC Cambridge) CATH (University College, London) FSSP (EBI, Cambridge) 3 Dee (EBI, Cambridge) HOMSTRAD (Biochemistry, Cambridge) VAST (NCBI, USA) If your predicted fold has many "relatives", then have a look at what they are. Do any of members show functional similarity to your protein? If there is any functional similarity between your protein and any members of the fold, then you may be able to back up your prediction of fold (possibly by the conservation of active site residues, or the approximate location of active site residues, etc.)

Continued

..

Is this fold a super fold ? If so, does this super fold contain a super site? Certain folds show a tendency to bind ligands in a common location, even in the absense of any functional or clear evolutionary relationships.

Are there core secondary structure elements that should really be present in any member of the fold? Are there non-core secondary structure elements that might not be present in all members of the fold?

Examples of Common Classes

COMPARATIVE PROTEIN MODELLING




Comparative protein modelling uses previously solved structures as starting points, or templates. This is effective because it appears that although the number of actual proteins is vast, there is a limited set of tertiary structural motifs to which most proteins belong. It has been suggested that there are only around 2000 distinct protein folds in nature, though there are many millions of different proteins.

Protein threading scans the amino acid sequence of an unknown structure against a database of solved structures. In each case, a scoring function is used to assess the compatibility of the sequence to the structure, thus yielding possible three-dimensional models. This type of method is also known as 3D1D fold recognition due to its compatibility analysis between threedimensional structures and linear protein sequences. This method has also given rise to methods performing an inverse folding search by evaluating the compatibility of a given structure with a large database of sequences, thus predicting which sequences have the potential to produce a given fold

Assembly of sub-structural units subknown structures fragment library protein sequence predicted structure

Softwares for Threading:


PHDthreader: http://www.embl-heidelberg.de/predictprotein/
< 30%, less than 30% of the predicted first hits are true remote homologues. Evaluated by cross-validation on 89 unique protein structures. Prediction-based threading detecting the fold type and aligning a protein of unknown structure and a protein of known structure for low levels of sequence identity ( < 25%).

T3P2: http://www.mbi.ucla.edu/people/frsvr/frsvr.html
Prediction-based threading detecting the fold type and aligning a protein of unknown structure and a protein of known structure for low levels of sequence identity ( < 25%).

Homology modelling is based on the reasonable assumption that two


homologous proteins will share very similar structures. Because a protein's fold is more evolutionarily conserved than its amino acid sequence, a target sequence can be modeled with reasonable accuracy on a very distantly related template, provided that the relationship between target and template can be discerned through sequence alignment. It has been suggested that the primary bottleneck in comparative modelling arises from difficulties in alignment rather than from errors in structure prediction given a known-good alignment. Unsurprisingly, homology modelling is most accurate when the target and template have similar sequences. MODELLER is used for homology or comparative modeling of protein threedimensional structures (1). The user provides an alignment of a sequence to be modeled with known related structures and MODELLER automatically calculates a model containing all non-hydrogen atoms. MODELLER implements comparative protein structure modeling by satisfaction of spatial restraints (2, 3), and can perform many additional tasks, including de novo modeling of loops in protein structures, optimization of various models of protein structure with respect to a flexibly defined objective function, multiple alignment of protein sequences and/or structures, clustering, searching of sequence databases, comparison of protein structures.

SWISS MODEL : HOMOLOGY MODELLING SERVER

AB INITIO PROTEIN MODELLING:




dimensional protein models "from scratch", i.e., based on physical principles rather than (directly) on previously solved structures. There are many structures. possible procedures that either attempt to mimic protein folding or apply some stochastic method to search possible solutions (i.e. global optimization (i. of a suitable energy function). These procedures tend to require vast function). computational resources, and have thus only been carried out for tiny proteins. proteins. To attempt to predict protein structure de novo for larger proteins, we will need better algorithms and larger computational resources like those afforded by either powerful supercomputers or distributed computing. computing. Although these computational barriers are vast, the potential benefits of structural genomics (by predicted or experimental methods) make ab initio structure prediction an active research field. field.
MOLECULAR DYNAMICS SIMULATION GENETIC ALGORITHM

Ab initio- or de novo- protein modelling methods seek to build threeinitionovothree-

GROMACS: AB INITIO MODELLING SOFTWARE

ENERGY MINIMIZATION MODEL

COMPARISON