Вы находитесь на странице: 1из 18

Bioinformatics(Overview)

Brijesh Singh Yadav (brijeshbioinfo@gmail.com)

Bioinformatics is the application of information technology to the field of molecular biology.


Bioinformatics entails the creation and advancement of databases, algorithms, computational and
statistical techniques, and theory to solve formal and practical problems arising from the
management and analysis of biological data. Over the past few decades rapid developments in
genomic and other molecular research technologies combined developments in information
technologies have combined to produce a tremendous amount of information related to molecular
biology. It is the name given to these mathematical and computing approaches used to glean
understanding of biological processes. Common activities in Bioinformatics include mapping and
analyzing DNA and protein sequences, aligning different DNA and protein sequences to compare
them and creating and viewing 3-D models of protein structures.

The primary goal of bioinformatics is to increase our understanding of biological processes. What
sets it apart from other approaches, however, is its focus on developing and applying
computationally intensive techniques (e.g., data mining, and machine learning algorithms) to
achieve this goal. Major research efforts in the field include sequence alignment, gene finding,
genome assembly, protein structure alignment, protein structure prediction, prediction of gene
expression and protein-protein interactions, and the modeling of evolution.

Bioinformatics was applied in the creation and maintenance of a database to store biological
information at the beginning of the "genomic revolution", such as nucleotide and amino acid
sequences. Development of this type of database involved not only design issues but the
development of complex interfaces whereby researchers could both access existing data as well
as submit new or revised data.

In order to study how normal cellular activities are altered in different disease states, the
biological data must be combined to form a comprehensive picture of these activities. Therefore,
the field of bioinformatics has evolved such that the most pressing task now involves the analysis
and interpretation of various types of data, including nucleotide and amino acid sequences,
protein domains, and protein structures. The actual process of analyzing and interpreting data is
referred to as computational biology. Important sub-disciplines within bioinformatics and
computational biology include: a)the development and implementation of tools that enable
efficient access to, and use and management of, various types of information b)the development
of new algorithms (mathematical formulas) and statistics with which to assess relationships
among members of large data sets, such as methods to locate a gene within a sequence, predict
protein structure and/or function, and cluster protein sequences into families of related sequences.

Major work areas:


Sequence analysis
Genome annotation
Computational evolutionary biology
Measuring biodiversity
Analysis of gene expression
Analysis of regulation
Analysis of protein expression
Analysis of mutations in cancer
Prediction of protein structure
Comparative genomics
Modeling biological systems
High-throughput image analysis
Protein-protein docking

Software and tools:


Software tools for bioinformatics range from simple command-line tools, to more complex
graphical programs and standalone web-services available from various bioinformatics
companies or public institutions. The computational biology tool best-known among biologists is
probably BLAST, an algorithm for determining the similarity of arbitrary sequences against other
sequences, possibly from curated databases of protein or DNA sequences. The NCBI provides a
popular web-based implementation that searches their databases. BLAST is one of a number of
generally available programs for doing sequence alignment.

Web services in bioinformatics:


SOAP and REST-based interfaces have been developed for a wide variety of bioinformatics
applications allowing an application running on one computer in one part of the world to use
algorithms, data and computing resources on servers in other parts of the world. The main
advantages lay in the end user not having to deal with software and database maintenance
overheads Basic bioinformatics services are classified by the EBI into three categories: SSS
(Sequence Search Services), MSA (Multiple Sequence Alignment) and BSA (Biological
Sequence Analysis). The availability of these service-oriented bioinformatics resources
demonstrate the applicability of web based bioinformatics solutions, and range from a collection
of standalone tools with a common data format under a single, standalone or web-based interface,
to integrative, distributed and extensible bioinformatics workflow management systems.
BIOLOGICAL DATABASES:
A biological database is a large, organized body of persistent data, usually associated with
computerized software designed to update, query, and retrieve components of the data stored
within the system. A simple database might be a single file containing many records, each of
which includes the same set of information. For example, a record associated with a nucleotide
sequence database typically contains information such as contact name; the input sequence with a
description of the type of molecule; the scientific name of the source organism from which it was
isolated; and, often, literature citations associated with the sequence.

For researchers to benefit from the data stored in a database, two additional requirements must be
met:
1.Easy access to the information; and
2.A method for extracting only that information needed to answer a specific biological question.

Currently, a lot of bioinformatics work is concerned with the technology of databases. These
databases include both "public" repositories of gene data like GenBank or the Protein DataBank
(the PDB), and private databases like those used by research groups involved in gene mapping
projects or those held by biotech companies. Making such databases accessible via open
standards like the Web is very important since consumers of bioinformatics data use a range of
computer platforms: from the more powerful and forbidding UNIX boxes favoured by the
developers and curators to the far friendlier Macs often found populating the labs of computer-
wary biologists. RNA and DNA are the proteins that store the hereditary information about an
organism. These macromolecules have a fixed structure, which can be analyzed by biologists
with the help of bioinformatics tools and databases.

Protein Structure Database:

1. Protein Data Bank (PDB)


2. Structural Classification of Proteins
3. Class, Architecture, Topology and Homologous super family

Protein Sequence Database

• Primary Database
1. Swiss-Prot
2. Protein Information Resource (PIR)

Secondary Database

1. Prosite
2. Protein Family (PFAM)

Nucleotide Sequence Database

1. GenBank
2. DNA Database of Japan (DDBJ)
3. European Molecular Biology Laboratory (EMBL)

Composite Databases

1. National Center for Biotechnology Information (NCBI)


2. Sequence Retrieval System (SRS)
3. Catalogue of Databases (DBCAT)

Other Databases

1. Receptor-Ligand Database (ReliBase)


2. Restriction Enzyme Database (REBASE)
3. G-Protein Coupled Receptor Database (GPCRDB)
4. Nuclear Receptor Database (NucleaRDB)
5. Literature Database - PubMed

Genbank:
GenBank (Genetic Sequence Databank) is one of the fastest growing repositories of
known genetic sequences. It has a flat file structure that is an ASCII text file, readable by
both humans and computers. In addition to sequence data, GenBank files contain
information like accession numbers and gene names, phylogenetic classification and
references to published literature.There are approximately 191,400,000 bases and
183,000 sequences as of June 1994.

EMBL:
The EMBL Nucleotide Sequence Database is a comprehensive database of DNA and RNA
sequences collected from the scientific literature and patent applications and directly submitted
from researchers and sequencing groups. Data collection is done in collaboration with GenBank
(USA) and the DNA Database of Japan (DDBJ). The database currently doubles in size every 18
months and currently (June 1994) contains nearly 2 million bases from 182,615 sequence entries.

SwissProt:
This is a protein sequence database that provides a high level of integration with other databases
and also has a very low level of redundancy (means less identical sequences are present in the
database).
PROSITE:
The PROSITE dictionary of sites and patterns in proteins prepared by Amos Bairoch at the
University of Geneva.

EC-ENZYME:
The 'ENZYME' data bank contains the following data for each type of characterized enzyme for
which an EC number has been provided: EC number, recommended name, Alternative names,
Catalytic activity, Cofactors, Pointers to the SWISS-PROT entree(s) that correspond to the
enzyme, Pointers to disease(s) associated with a deficiency of the enzyme.

RCSB-PDB :
The RCSB PDB contains 3-D biological macromolecular structure data from X-ray
crystallography, NMR, and Cryo-EM. It is operated by Rutgers, The State University of New
Jersey and the San Diego Supercomputer Center at the University of California, San Diego.

GDB:
The GDB Human Genome Data Base supports biomedical research, clinical medicine, and
professional and scientific education by providing for the storage and dissemination of data about
genes and other DNA markers, map location, genetic disease and locus information, and
bibliographic information.

OMIM:
The Mendelian Inheritance in Man data bank (MIM) is prepared by Victor Mc Kusick with the
assistance of Claire A. Francomano and Stylianos E. Antonarakis at John Hopkins University.

PIR-PSD:
PIR (Protein Information Resource) produces and distributes the PIR-International Protein
Sequence Database (PSD). It is the most comprehensive and expertly annotated protein sequence
database. The PIR serves the scientific community through on-line access, distributing magnetic
tapes, and performing off-line sequence identification services for researchers. Release 40.00:
March 31, 1994 67,423 entries 19,747,297 residues.

Protein sequence databases are classified as primary, secondary and composite depending upon
the content stored in them. PIR and SwissProt are primary databases that contain protein
sequences as 'raw' data. Secondary databases (like Prosite) contain the information derived from
protein sequences. Primary databases are combined and filtered to form non-redundant composite
database

Genethon Genome Databases:


PHYSICAL MAP: computation of the human genetic map using DNA fragments in the form of
YAC contigs. GENETIC MAP: production of micro-satellite probes and the localization of
chromosomes, to create a genetic map to aid in the study of hereditary diseases. GENEXPRESS
(cDNA): catalogue the transcripts required for protein synthesis obtained from specific tissues,
for example neuromuscular tissues.

MGD: The Mouse Genome Databases:


MGD is a comprehensive database of genetic information on the laboratory mouse. This initial
release contains the following kinds of information: Loci (over 15,000 current and withdrawn
symbols), Homologies (1300 mouse loci, 3500 loci from 40 mammalian species), Probes and
Clones (about 10,000), PCR primers (currently 500 primer pairs), Bibliography (over 18,000
references), Experimental data (from 2400 published articles).

ACeDB (A Caenorhabditis elegans Database):


Containing data from the Caenorhabditis Genetics Center (funded by the NIH National Center for
Research Resources), the C. elegans genome project (funded by the MRC and NIH), and the
worm community. ACeDB is also the name of the generic genome database software in use by an
increasing number of genome projects. The software, as well as the C. elegans data, can be
obtained via ftp.

ACeDB databases are available for the following species: C. elegans, Human Chromosome 21,
Human Chromosome X, Drosophila melanogaster, mycobacteria, Arabidopsis, soybeans, rice,
maize, grains, forest trees, Solanaceae, Aspergillus nidulans, Bos taurus, Gossypium hirsutum,
Neurospora crassa, Saccharomyces cerevisiae, Schizosaccharomyces pombe, and Sorghum
bicolor.

MEDLINE:
MEDLINE is NLM's premier bibliographic database covering the fields of medicine, nursing,
dentistry, veterinary medicine, and the preclinical sciences. Journal articles are indexed for
MEDLINE, and their citations are searchable, using NLM's controlled vocabulary, MeSH
(Medical Subject Headings). MEDLINE contains all citations published in Index Medicus, and
corresponds in part to the International Nursing Index and the Index to Dental Literature.
Citations include the English abstract when published with the article (approximately 70% of the
current file).

For researchers to benefit from all this information, however, two additional things were required:

1) Ready access to the collected pool of sequence information and


2) A way to extract from this pool only those sequences of interest to a given researcher

Simply collecting, by hand, all necessary sequence information of interest to a given project from
published journal articles quickly became a formidable task. After collection, the organization
and analysis of this data still remained. It could take weeks to months for a researcher to search
sequences by hand in order to find related genes or proteins.
Computer technology has provided the obvious solution to this problem. Not only can computers
be used to store and organize sequence information into databases, but they can also be used to
analyze sequence data rapidly. The evolution of computing power and storage capacity has, so
far, been able to outpace the increase in sequence information being created. Theoretical
scientists have derived new and sophisticated algorithms which allow sequences to be readily
compared using probability theories. These comparisons become the basis for determining gene
function, developing phylogenetic relationships and simulating protein models. The physical
linking of a vast array of computers in the 1970's provided a few biologists with ready access to
the expanding pool of sequence information. This web of connections, now known as the
Internet, has evolved and expanded so that nearly everyone has access to this information and the
tools necessary to analyze it. Databases of existing sequencing data can be used to identify
homologues of new molecules that have been amplified and sequenced in the lab. The property of
sharing a common ancestor, homology, can be a very powerful indicator in bioinformatics.
Sequence Analysis
In bioinformatics, a sequence alignment is a way of arranging the primary
sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence
of functional, structural, or evolutionary relationships between the sequences. Aligned sequences
of nucleotide oramino acid residues are typically represented as rows within a matrix. Gaps are
inserted between the residues so that residues with identical or similar characters are aligned in
successive columns.

A sequence alignment, produced by ClustalW between two human zinc finger proteins identified
by GenBank accession number.

If two sequences in an alignment share a common ancestor, mismatches can be interpreted


as point mutations and gaps as indels (that is, insertion or deletion mutations) introduced in one or
both lineages in the time since they diverged from one another. In protein sequence alignment,
the degree of similarity between amino acids occupying a particular position in the sequence can
be interpreted as a rough measure of how conserved a particular region or sequence motif is
among lineages. The absence of substitutions, or the presence of only very conservative
substitutions (that is, the substitution of amino acids whose side chains have similar biochemical
properties) in a particular region of the sequence, suggest that this region has structural or
functional importance. Although DNA and RNA nucleotide bases are more similar to each other
than to amino acids, the conservation of base pairing can indicate a similar functional or
structural role. Sequence alignment can be used for non-biological sequences, such as those
present in natural language or in financial data.

Computational approaches to sequence alignment generally fall into two categories:

1. global alignments

2. local alignments.

Calculating a global alignment is a form of global optimization that "forces" the alignment to
span the entire length of all query sequences. By contrast, local alignments identify regions of
similarity within long sequences that are often widely divergent overall. Local alignments are
often preferable, but can be more difficult to calculate because of the additional challenge of
identifying the regions of similarity. A variety of computational algorithms have been applied to
the sequence alignment problem, including slow but formally optimizing methods like dynamic
programming and efficient heuristic or probabilistic methods designed for large-scale database
search.

Sequence alignments can be stored in a wide variety of text-based file formats, many of which
were originally developed in conjunction with a specific alignment program or implementation.
Most web-based tools allow a number of input and output formats, such as FASTA
format andGenBank format; however, the use of specific tools authored by individual research
laboratories can be complicated by limited file format compatibility. A general conversion
program is available at DNA Baser or Readseq (for Readseq you must upload your files on a
foreign server and provide your email address).

Global and local alignments

Illustration of global and local alignments demonstrating the 'gappy' quality of global alignments
that can occur if sequences are insufficiently similar

Global alignments, which attempt to align every residue in every sequence, are most useful when
the sequences in the query set are similar and of roughly equal size. (This does not mean global
alignments cannot end in gaps.) A general global alignment technique is called the Needleman-
Wunsch algorithm and is based on dynamic programming. Local alignments are more useful for
dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs
within their larger sequence context. The Smith-Waterman algorithm is a general local alignment
method also based on dynamic programming. With sufficiently similar sequences, there is no
difference between local and global alignments.

Pairwise alignment

Pairwise sequence alignment methods are used to find the best-matching piecewise (local) or
global alignments of two query sequences. Pairwise alignments can only be used between two
sequences at a time, but they are efficient to calculate and are often used for methods that do not
require extreme precision (such as searching a database for sequences with high homology to a
query). The three primary methods of producing pairwise alignments are dot-matrix methods,
dynamic programming, and word methods;however, multiple sequence alignment techniques can
also align pairs of sequences. Although each method has its individual strengths and weaknesses,
all three pairwise methods have difficulty with highly repetitive sequences of low information
content - especially where the number of repetitions differ in the two sequences to be aligned.
One way of quantifying the utility of a given pairwise alignment is the 'maximum unique match',
or the longest subsequence that occurs in both query sequence. Longer MUM sequences typically
reflect closer relatedness.

Multiple sequence alignment:


Multiple sequence alignment is an extension of pairwise alignment to incorporate more than two
sequences at a time. Multiple alignment methods try to align all of the sequences in a given query
set. Multiple alignments are often used in identifying conserved sequence regions across a group
of sequences hypothesized to be evolutionarily related. Such conserved sequence motifs can be
used in conjunction with structural and mechanisticinformation to locate the catalytic active
sites of enzymes. Alignments are also used to aid in establishing evolutionary relationships by
constructing phylogenetic trees. Multiple sequence alignments are computationally difficult to
produce and most formulations of the problem lead to NP-complete combinatorial optimization
problems. Nevertheless, the utility of these alignments in bioinformatics has led to the
development of a variety of methods suitable for aligning three or more sequences.
Sequence Similarity Tool:
BLAST:
BLAST, the Basic Local Alignment Search Tool, is a statistical pattern matching algorithm. It
was developed and published by Altshul et al. in 1990 and then as an enhanced version in 1997. It
is one of the foundational algorithms for the study of comparative genomics. BLAST's impact on
our understanding of biology is demonstrated by its ubiquity. BLAST is web-base and fast. It is
used world-wide to compare DNA and protein sequences for similarity in structure and function
and to infer evolutionary relationships between sequences. As an example of the volume of
BLAST analyses conducted worldwide, in March 2003, the US National Center for
Biotechnology Information (NCBI) was receiving 100,000 unique BLAST runs from 70,000
unique IP addresses daily, with usage increasing continually. ( Personal communication W.
Matten, 2003.)

BLAST operates by cutting up sequences into smaller "words" and searching for each of the
words in "target" sequences. It looks in both directions along target sequences to find longer
pattern matches. BLAST scores matches according to experimental knowledge of homology. This
accounts for some of the imperfect matches it generates. BLAST also matches and aligns
sequences locally. It does not create global sequence alignments.
FASTA:
FASTA (pronounced FAST-AYE) stands for FAST-ALL, reflecting the fact that it can be used
for a fast protein comparison or a fast nucleotide comparison. This program achieves a high level
of sensitivity for similarity searching at high speed. This is achieved by performing optimised
searches for local alignments using a substitution matrix, in this case a DNA identity matrix.

The high speed of this program is achieved by using the observed pattern of word hits to identify
potential matches before attempting the more time consuming optimised search. The trade-off
between speed and sensitivity is controlled by the ktup parameter, which specifies the size of the
word. Increasing the ktup decreases the number of background hits. Not every word hit is
investigated but instead it initially looks for segment's containing several nearby hits. This
program is much more sensitive than BLAST programs, which is reflected by the length of time
required to produce results. FASTA produces optimal local alignment scores for the comparison
of the query sequence to every sequence in the database. The majority of these scores involve
unrelated sequences, and therefore can be used to estimate lambda and K values. These are
statistical parameters estimated from the distribution of unrelated sequence similarity scores. This
approach avoids the artificiality of a random sequence model by employing real sequences, with
their natural correlations.

FASTA uses four steps to calculate three scores that characterise sequence similarity. These steps
are outlined below. A representation of these steps is reported in a postscript format figure drawn
from Barton (1994) Protein Sequence Alignment and Database Scanning.

Step 1: Identify regions shared by the two sequences with the highest density of identities
(ktup=1) or pairs of identities (ktup=2).

The first step uses a rapid technique for finding identities shared between two sequences; the
method is similar to an earlier technique described by Wilbur and Lipman. FASTA achieves
much of its speed and selectivity in this step by using a lookup table to locate all identities or
groups of identities between two DNA or amino acid sequences during the first step of the
comparison. The ktup parameter determines how many consecutive identities are required in a
match. A ktup value of 2 is frequrntly used for protein sequence comparison, which means that
the program examines only those portions of the two sequences being compared that have at least
two adjacent identical residues in both sequences. More sensitive searches can be done using ktup
= 1. For DNA sequence comparisons, the ktup parameter can range from 1 to 6; values between 4
and 6 are recommanded. When the query sequence is a short oliginucleotide of oligopeptude,
ktup = 1 should be used.

In conjunction with the lookup table, we use the "diagonal" method to find all regions of
similarity between the two sequences, counting ktup matches and penalizing for intervening
mismatches. This method identified regions of a diagonal that have the highest densitu of ktup
matches. The term diagonal refers to the diagonal line that is seen on a dot matrix plot when a
sequence is compared with itself, and it denotes an alignment between two sequenves without
gaps. FASTA uses a formula for scoring ktup matches that incorporates the actual PAM250
values for the aligned residues. Thus, groups of identities with high similarity scores contribute
more to the local diagonal score than to identities with low similarity scores. This more sensitive
formula is used for protein sequence comparisons; the constant value for ktup matches is used for
DNA sequence comparisons. FASTA saves the 10 best local regions, regardless of whether they
are on the same of different diagonals.

Step 2: Rescan the 10 regions with the highest density of identities using the PAM250 matrix.
Trim the ends of the region to include only those residues contributing to the highest score. Each
region is a partial alignment without gaps.

After the 10 best local regions are found in the first step, they are rescored using a scoring matrix
that allows runs of identities shorter than ktup residues and conservative replacements to
contribute to the similarity score. For protein sequences, this score is usually caculated using the
PAM250 matrix, although scoring matrices based on the minimum number of base changes
required for a specific replacement, on identities alone, or on an alternative measure of similarity,
can also be used with FASTA. The PAM250 scoring matrix was derived from the analysis of the
amino acid replacements occuring among related proteins, and it specifies a range of positive
scores for replacements that commonly occur among related proteins and negative scores for
unlikely replacements. FASTA can also be used for DNA sequence comparisons, and matrices
can be constructed that allow separate penalties for transitions and transversions.

For each of the best diagonal regions rescanned with the scoring matrix, a subregion with the
maximal score is identified. Initial scores are used to rank the library sequences. These scores are
referred to as init1 score.

Step 3: If there are several initial regions with scores greater than the CUTOFF value, check to
see whether the trimmed initial regions can be joined to form an approximate alignment with
gaps. Calculate a similarity score that is the sum of the joined initial regions minus a penalty
(usually 20) for each gap. This initial similarity score (initn) is used to rank the library sequences.
The score of the single best initial region found in step 2 is reported (init1).

FASTA checks, during a library search, to see whether several initial regions can be joined
together in a single alignment to increase the initial score. FASTA calculates an optimal
alignment of initial regions as a combination of compatible regions with maximal score. This
optimal alignment of initial regions can be rapidily calculated using a dynamic programming
algorithm. FASTA uses the resulting score, referred to as the initn score, to rank the library
sequences. The third "joining" step in the computation of the initial score increases the sensitivity
of the search method because it allows for insertions and deletions as well as conservative
replacements. The modification does, however, decrease selectivity. The degradation selectivity
is limited by including in the optimization step only those initial regions whose scores are above
an empirically determined threshold : FASTA joins an initial region only if its similarity score is
greater than the cutoff value, a value that is approximately one standard deviation above the
average score expected from unrelated sequences in the library. For a 200-residue query sequence
and ktup-2, this value is 28.

Step 4 : constructs NWS (Needleman-Wunch-Sellers algorithm) optimal alignment of the query


sequence and the library sequence, considering only those residues that lie in a band 32 residues
wide centered on the best initial region found in Step 2. FASTA reports this score as the
optimized (opt) score. After a complete search of the library, FASTA plots the initial scores of
each library sequence in a histogram, calculates the mean similarity score for the query sequence
against each sequence in the library, and determines the standard deviation of the distribution of
initial scores. The initial scores are used to rank the library sequences, and, in the fourth and final
step of the comparison, the highest scoring library sequences are aligned using a modification of
the standard NWS optimization method. The optimization employs the same scoring matrix used
in determining the initial regions; the resulting optimized alignments are calculated for further
analysis of potential relationships, and the optimized similarity score is reported.

CLUSTAL W:
ClustalW2 is a general purpose multiple sequence alignment program for DNA or proteins. It
produces biologically meaningful multiple sequence alignments of divergent sequences. It
calculates the best match for the selected sequences, and lines them up so that the identities,
similarities and differences can be seen. Evolutionary relationships can be seen via viewing
Cladograms or Phylograms.
Phylogenetic analysis:

A phylogenetic tree, also called an evolutionary tree, is a tree showing the evolutionary
relationships among various biological species or other entities that are believed to have a
common ancestor. In a phylogenetic tree, each node with descendants represents the most recent
common ancestor of the descendants, and the edge lengths in some trees correspond to time
estimates. Each node is called a taxonomic unit. Internal nodes are generally called hypothetical
taxonomic units (HTUs) as they cannot be directly observed.
A speculatively rooted tree for rRNA genes

Tool for Phylogenetic analysis:

• BIONJ - Server for NJ phylogenetic analysis


• DendroUPGMA
• PHYLIP - Server for phylogenetic analysis using the PHYLIP package
• PhyML - Server for ML phylogenetic analysis
• Phylogeny.fr - Robust Phylogenetic Analysis For The Non-Specialist
• The PhylOgenetic Web Repeater (POWER) - perform phylogenetic analysis
• BlastO - Blast on orthologous groups
• Evolutionary Trace Server (TraceSuite II) - Maps evolutionary traces to structures
• Phylogenetic programs - List of phylogenetic packages and free servers (PHYLIP pages)

Homology modeling
In protein structure prediction, homology modeling, also known as comparative modeling, is a
class of methods for constructing an atomic-resolution model of a protein from its amino acid
sequence (the "query sequence" or "target"). Almost all homology modeling techniques rely on
the identification of one or more known protein structures likely to resemble the structure of the
query sequence, and on the production of an alignment that maps residues in the query sequence
to residues in the template sequence. The sequence alignment and template structure are then
used to produce a structural model of the target. Because protein structures are more conserved
than DNA sequences, detectable levels of sequence similarity usually imply significant structural
similarity.

The quality of the homology model is dependent on the quality of the sequence alignment and
template structure. The approach can be complicated by the presence of alignment gaps
(commonly called indels) that indicate a structural region present in the target but not in the
template, and by structure gaps in the template that arise from poor resolution in the experimental
procedure (usually X-ray crystallography) used to solve the structure. Model quality declines with
decreasing sequence identity; a typical model has ~2 Å agreement between the matched Cα atoms
at 70% sequence identity but only 4-5 Å agreement at 25% sequence identity. Regions of the
model that were constructed without a template, usually by loop modeling, are generally much
less accurate than the rest of the model, particularly if the loop is long. Errors in side chain
packing and position also increase with decreasing identity, and variations in these packing
configurations have been suggested as a major reason for poor model quality at low identity.
Taken together, these various atomic-position errors are significant and impede the use of
homology models for purposes that require atomic-resolution data, such as drug design and
protein-protein interaction predictions; even the quaternary structure of a protein may be difficult
to predict from homology models of its subunit(s). Nevertheless, homology models can be useful
in reaching qualitative conclusions about the biochemistry of the query sequence, especially in
formulating hypotheses about why certain residues are conserved, which may in turn lead to
experiments to test those hypotheses. For example, the spatial arrangement of conserved residues
may suggest whether a particular residue is conserved to stabilize the folding, to participate in
binding some small molecule, or to foster association with another protein or nucleic acid.

Homology modeling can produce high-quality structural models when the target and template are
closely related, which has inspired the formation of a structural genomics consortium dedicated to
the production of representative experimental structures for all classes of protein folds. The chief
inaccuracies in homology modeling, which worsen with lower sequence identity, derive from
errors in the initial sequence alignment and from improper template selection. Like other methods
of structure prediction, current practice in homology modeling is assessed in a biannual large-
scale experiment known as the Critical Assessment of Techniques for Protein Structure
Prediction, or CASP.
BIOINFORMATICS CENTRE OF EXILLENCE:

Our Vision

1. To foster high quality, innovative, and multi-disciplinary research and Advanced training in
Bioinformatics, and Programming Languages.
2. Provide a forum for emerging Bioinformatics researchers and a platform for the development
of commercial Bioinformatics capability.

Mission
1. To provide a national level facility and expertise in Bioinformatics and Computational
Biology.
2. To put individual in global race by enhancing skills in ERP and Programming Languages.
3. To provide integrated Bioinformatics solutions such as biological and chemical
databases, data analysis, data mining, bio-medical text mining and customized tool
development among others.
4. A strong base for pharmaceutical research and development and IT services are driving
the offshoring of Bioinformatics services.
5. Started work in drug development, conduct research, clinical trials and contract
manufacturing segments for the other industry.

Our Current Work:


1. We are owing world largest Infectious Disease Databases
2. We are working on Neuroinformatics Project
Future scenario

The study feels that the focus of Bioinformatics in the drug discovery process has shifted from
target identification to target validation. The market for laboratory information management
system products in India is estimated at $23 million in 2008. This is expected to grow to above 30
per cent over 2008-11. It is also expected that companies with presence across various
geographies require bioinformatics services on a global scale and often seek a single vendor who
can offer a comprehensive range of services on a long-term basis, across the world.

The industry is expected to scale up its business through organic and inorganic growth.

As far as funding for expansion is concerned, rising demand for bioinformatics has improved
venture capitalist interest and presence in this sector.

The bioinformatics outsourcing opportunity will rise from $32 million in 2007 to $62 million in
2010. These opportunities range from laboratory information management systems’ enterprise
solutions, improved database utility and data management tools to exportable software that can be
shared.

However, to capture this opportunity, vendors need to establish creditability through success of
past projects, and by demonstrating a strong technical capability, establishing strong overseas
relationships and training end-users.