Вы находитесь на странице: 1из 99

BASICS OF

BIOINFORMATICS
Biotechnology Division
North-East Institute of Science &
Technology
(Council of Scientific & Industrial
Research)
Jorhat 785 006, Assam
Salam Pradeep
Email:
salampradeep@gmail.co
Bioinformatics
• Use of techniques including
• Applied mathematics
• Informatics
• Statistics
• Computer science
• Artificial intelligence,
• Chemistry & Biochemistry
• To solve biological problems on the
molecular level
Major Research Efforts &
Applications
Sequence analysis &
alignment
• Comparison of sequence in order to
find the similar sequence.
• Way of arranging the sequences of
DNA / RNA / Amino Acids to identify
regions of similarity that may be a
consequence of functional, structural
or evolutionary relationships.
• Identification of gene structures,
reading frames, distributions of
introns & exons & regulatory
elements.
Genome annotation
• Process of marking the genes and other
biological features in a DNA sequence
• First genome annotation software
system was designed in 1995 by Dr.
Owen White
• First genome of a free-living organism
to be decoded, the bacterium
Haemophilus influenzae.
• White’s software system finds the
genes (places in the DNA sequence that
encode a protein), the transfer RNA,
and other features.
Computational evolutionary
biology
• Trace the evolution of a large
number of organisms by measuring
changes in their DNA, rather than
through physical taxonomy or
physiological observations alone.
• Compare entire genomes, permits
the study of more complex
evolutionary events, such as gene
duplication, horizontal gene transfer,
speciation.
• Track and share information on an
Measuring biodiversity
• Biodiversity Databases are used to
collect the species names, descriptions,
distributions, genetic information,
status & size of populations, habitat
needs, and how each organism
interacts with other species.
• Computer simulations model such
things as population dynamics, or
calculate the cumulative genetic health
of a breeding pool (in agriculture) or
endangered population (in
conservation).
• Entire DNA sequences, or genomes of
endangered species can be preserved,
allowing the results of Nature's genetic
Prediction of protein

structure
Protein structure prediction is one of the
most important goals pursued by
bioinformatics and theoretical chemistry.
• Its aim is the prediction of the three-
dimensional structure of proteins from their
amino acid sequences.
• In other words, it deals with the prediction
of a protein's tertiary structure from its
primary structure.
• Protein structure prediction is of high
importance in medicine (for example, in
drug design) and biotechnology (for
example, in the design of novel enzymes).
Comparative genomics
• Comparative genomics is the study
of the relationship of genome
structure and function across
different biological species or strains.
• Gene finding is an important
application of comparative genomics,
as is discovery of new, non-coding
functional elements of the genome.
• Computational approaches to
genome comparison have recently
become a common research topic in
Modeling biological
systems
• Systems biology involves the use of
computer simulations of cellular subsystems
such as the networks of metabolites and
enzymes which comprise metabolism, signal
transduction pathways and gene regulatory
networks) to both analyze and visualize the
complex connections of these cellular
processes.
• Artificial life or virtual evolution attempts to
understand evolutionary processes via the
computer simulation of simple (artificial) life
forms.
Protein-protein interaction
& docking
• Protein-protein interactions involve the
association of protein molecules.
• These associations are studied from the
perspective of biochemistry, signal
transduction and networks.
• Wet Lab Techniques: Co-
immunoprecipitation, FRET, Bimolecular
Fluorescence Complementation
• Protein-protein docking: the prediction of
protein-protein interaction based on the
three-dimensional protein structures only
is not satisfactory As of 2006.
Biological Sequence
Database
Primary Sequence
Databases
• The International Nucleotide
Sequence Database (INSD) consists
of the following databases.
• DDBJ (DNA Data Bank of Japan)
• EMBL Nucleotide DB (European
Molecular Biology Laboratory)
• GenBank (National Center for
Biotechnology Information)
• They interchange the stored
information and are the source for
many other databases
NCBI
• National Center for Biotechnology
Information is part of the United States
National Library of Medicine (NLM), a
branch of the National Institutes of
Health.
• Founded in 1988 sponsored by Senator
Claude Pepper.
• NCBI has had responsibility for making
available the GenBank DNA sequence
database since 1992
• In addition to GenBank, NCBI provides
OMIM, MMDB (3D protein structures),
dbSNP, the Unique Human Gene
Sequence Collection, a Gene Map of the
DDBJ
EMBL
Protein Sequence
Database
UniProt - Universal Protein
Resource
Swiss-Prot - Protein
Knowledgebase
Protein Information
Resource
Pfam
Protein Structure
Databases
Protein Data Bank (PDB)
PDB Statistics
NCBI Molecular Modeling
Database
Genome Databases
Corn
ERIC (Enteropathogen
Resource Integration Center)
Flybase
MGI Mouse Genome
Viral Bioinformatics
Resource Center
Saccharomyces Genome
Database
National Microbial Pathogen
Data Resource
Other Databases
• Protein-protein interactions
- BioGrid, STRING, DIP etc
• Metabolic pathway Databases
- KEGG, BioCyc, MANET etc
• Microarray databases
- ArrayExpress, Stanford Microarray
Dbase, GEO
Sequence File Formats
• FASTA – Always starts with a >
(greater than symbol)
• GENBANK – Series of header lines
- Locus, Definition, Origin …
• EMBL – 1st line begins the first
sequence entry
- 1st line of entry contains 2 letter ID
FASTA Format
GenBank Format
EMBL Format
Inside NCBI
Sitemap
Taxonomy Browser
NCBI Taxonomy Browser
Statistics
Genome Projects
Genome Projects
Statistics
Map Viewer
Sequence analysis
&
Sequence alignment
Sequence analysis &
alignment
• Comparison of sequences in order to
find similar sequences
• A way of arranging the sequences of
DNA/RNA/PTN to identify regions of
similarity that may be a consequence
of functional, structural or
evolutionary relationships.
• Aligned sequences of nucleotide or
amino acid residues are typically
represented as rows within a matrix
Representations in Sequence
alignment

Conservative
Substitution
Semi Conservative
Substitution
Global and Local
alignments
• Global alignments attempt to align
every residue in every sequence
• Most useful when the sequences in
the query set are similar and of
roughly equal size.
• Local alignments are useful for
dissimilar sequences that are
suspected to contain regions of
similarity or similar sequence motifs
within their larger sequence context.
• With sufficiently similar sequences -
there is no difference between local
• Needleman-Wunsch algorithm - A
general global alignment technique
and is based on dynamic programming

• Smith-Waterman algorithm - A general


local alignment method also based on
dynamic programming.
Pairwise alignment
• Used to find the best-matching
piecewise local or global alignments
of two query sequences.
• It can only be used between 2
sequences at a time
• Efficient to calculate and are often
used for methods such as searching
a database for sequences with high
homology to a query.
• Primary methods of producing
pairwise alignments are dot-matrix
Multiple sequence

alignment
MSA incorporate more than two
sequences at a time
• Multiple alignment align all of the
sequences in a given query set
• Often used in identifying conserved
sequence regions across a group of
sequences
• Aid in establishing evolutionary
relationships by constructing
phylogenetic trees
Sequence Similarity
Search
NCBI BLAST
• An algorithm for comparing primary
biological sequence information, such
as the amino-acid sequences of
different proteins or the nucleotides of
DNA sequences
• A BLAST search enables a researcher to
compare a query sequence with a
library or database of sequences, and
identify library sequences that
resemble the query sequence above a
certain threshold.
• BLAST program was designed by
Eugene Myers, Stephen Altschul,
Warren Gish, David J. Lipman and Webb
Miller at the NIH and was published in J.
BLAST Types
• blastn - Nucleotide-nucleotide
BLAST
• blastp - Protein-protein BLAST
• blastx - Nucleotide 6-frame
translation-protein
• tblastx - -Nucleotide 6-frame
translation-nucleotide 6-frame
translation
• tblastn - Protein-nucleotide 6-frame
translation
• megablast - Large numbers of
BLASTn
BLASTp
BLASTn: Search Set
BLASTp: Search Set
BLASTn: Program
Selection
BLASTp: Program
Selection
BLASTn Result
BLASTn: Graphic
Summary
BLASTn Description
BLASTn Alignment
BLASTn Tree View
PDB BLASTp
BLASTp: Graphic
Summary
PDB BLASTp Description
PDB BLASTp Alignment
BLASTp Tree View
Multiple Sequence
Alignment
EBI ClustalW Server
Preparing Multiple
Sequence
Phylogenetic
Analysis
Cladogram
• A Cladogram is a branching diagram (tree)
assumed to be an estimate of a phylogeny
where the branches are of equal length, thus
cladograms show common ancestry, but do not
indicate the amount of evolutionary "time"
separating taxa.
Phylogram
• Phylogram is a branching diagram (tree)
assumed to be an estimate of a phylogeny,
branch lengths are proportional to the
amount of inferred evolutionary change.
JalView – Java Applet
Thank You

Вам также может понравиться