Вы находитесь на странице: 1из 43

Bioinformatics

Murat Gk
muratgok@gmail.com
REF: Tolga Can, METU
Grading
Final exam - % 40
Midterm exam - % 20
Assignment 1 (Article) - % 20
Assignment 2 (Presentation) - % 20
What is Bioinformatics?
Bioinformatics is conceptualizing biology in terms of molecules (in the
sense of physical-chemistry) and then applying informatics
techniques (derived from disciplines such as applied math, CS, and
statistics) to understand and organize the information associated with
these molecules, on a large-scale.
What uses and manages Bioinformatics
Skill set
Artificial intelligence
Machine learning
Statistics & probability
Algorithms
Databases
Programming
What is bioinformatics
Bioinformatics is the use of computers and
computational methods to analyse large sets of
molecular biological data that is used for :

The investigation of living organisms and their evolution.
The discovery of genes, gene regulation; genetic networks
and protein functionality, which can be used to understand:
human disease; human development (conception to
adulthood) etc .
the results of which can facilitate our understanding of
diseases like cystic fibrosis; suggest therapies; and the
development of cures such as drug development, viral
therapy

Reading DNA novels: bioinformatics
Analysing large sets of data is equivalent to reading and
understanding a book (Computational linguistics). The syntax:
Reading involves looking at letters [ including spaces and punctuation] to
determine the words. Bioinformatics is the reading of DNA letters (referred to
by letters ATGC) and determining location of genes important elements of
DNA correspond to words.


Reading DNA novels: bioinformatics
The next step in reading involves determining if the words
are nouns/verbs/adverbs etc In general there are rules:
what are they
Bioinformatics involves determining what the important
elements correspond to: e.g. genes; gene promoters.
However, clearly the rules to determine genes and other
elements are more complex than in a natural language and
more importantly are constantly being modified and
updated .

Reading DNA novels: bioinformatics
syntax:
The next step in determining the sequence of the words.;
e.g. should it bewhat are the rules of english grammar;
are what the rules of grammar english
Bioinformatics involves determining the sequence of
important elements; e.g. promoter are upstream of
genes and not the other way around.

Reading DNA novels: bioinformatics
Symantics:
What does the set of words (sentence) mean. what is your
purpose? what processes do humans use to interpret this
sentence

Bioinformatics attempts to analyse the function of
DNA/genetic sequences by: e.g.
comparing the sequences to sequences whose function is
already known.
By converting the sequence into its equivalent protein
and comparing it to known proteins
determining 3-D structure of proteins and looking for
known structural components.

Reading DNA novels: bioinformatics
Bioinformatics also focuses on the computational
aspects of the discipline such as:
Setting up databases
Writing code to perform analysis
Determining and Utilisation of known computational
techniques to improve analysis of the biological data.
Nucleic acid world
Biology

Nucleotid
The four nucleotides making DNA.
The two complementary strands of a complete DNA
molecule.
Molecular biology information - DNA
Central dogma
Introns and Exons
"Outside of a dog, a book is man's best friend. Inside of a
dog it's too dark to read."
Lets make this text looks more like a nucleic acid sequence.
OUTSIDEOFADOGABOOKISMANSBESTFRIENDINSIDE
OFADOGITSTOODARKTOREAD
Lets add a single intron within this sequence.
OUTSIDEOFADOGABOOKISMANSBEGUITUITGLSAJSAKHDLAYSIOEYASHDKLSALKDN
KLASNDKLASKGDASJKBDNKNSKDKLSANDKNSKANDKNSAKNDKSAKLDHSDJGJASBDKN
SANDLNSMALVJOPDHVANVLNVKLSAKNFADNGKLHKNIUHSAFLSAFLFASFOLSANFLNK
FNKDSNGOIFSGHSKDHGKDHSFIABHFIHEHRFKHLKDHSUNSIUANIDAUBIOAYICMUSF
AVMASSTFRIENDINSIDEOFADOGITSTOODARKTOREAD
Genetic code
Degeneration
*UGA stop codon codes for a
new amino acid,
selenocysteine
One gene encodes one proteins.
20 letter alphabet
ACDEFGHIKLMNPQRSTVWY but not BJOUXZ
Strings of ~300 aa in an average protein (in bacteria)
A
m
i
n
o

a
c
i
d
s

DNA >> Amino Acids >> Proteins
Methionine
Valine
SEQUENCESTRUCTUREFUNCTION
The first amino-acid sequence of a protein insulin was
determined in 1951. The actual recipe for human insulin, from which
all its biological properties derive, is the following chain of 110
residues:
insulin =
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTP
KTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLE
NYCN
Protein structures
Primary Structure
Secondary Structure (-helix, -sheet)
Tertiary structure

Quarternary structure

Human hemoglobin
A pyruvate decarboxylase from the
yeast Saccharomyces cerevisiae
Where to get data?
GenBank
http://www.ncbi.nlm.nih.gov
Protein Databases
SWISS-PROT:
http://www.expasy.ch/sprot
PDB:
http://www.pdb.bnl.gov/
And many others
Two kinds of cells
Prokaryotes no nucleus (bacteria)
*Their genomes are circular
Eukaryotes have nucleus (animal,plants)
**Linear genomes with multiple chromosomes in pairs
Animal cell
Sequence alignment
Sequence alignment is a way of arranging the
sequences of DNA, RNA, or protein to identify regions
of similarity that may be a consequence of
functional, structural, or evolutionary relationships
between the sequences.
Comparing DNA/protein sequences for
Similarity
Homology
Prediction of function
Construction of phylogeny
(a) Find the amino acid sequence of the human
myoglobin protein.
- Write down the amino acid sequence. What is the
length of the sequence? Which database did you use
to find the information?
(b) Find the amino acid sequence of the mouse
myoglobin protein.
- Write down the amino acid sequence. What is the
length of the sequence? Which database did you use
to find the information?
(c) What is the difference between human and mouse
myoglobins? Describe the differences in your own words.
Sequence alignment
They seem quite similar. However, if we think of the letters
as amino acid residues rather than elements of strings,
alignment (a) is the better one, because isoleucine (I) and
leucine (L) are similar side chains, while tryptophan (W) has a
very different structure. This is a physico-chemical measure;
we might prefer to say that leucine simply substitutes for
isoleucine more frequently.
Taylor venn-diagram
Amino asitlerin fiziko-kimyasal zellikleri
Scoring matrices: PAM and BLOSUM
Instead of having a single match/mismatch score for every pair of
amino acids, consider chemical, physical, evolutionary relationships:
For example:
Alanine vs. valine or alanine vs. lysine? Alanine and valine are both small and
hydrophobic, but lysine is large and charged. Which substitutions occur more in
nature?
Assign scores to each pair of symbol
Higher score means more similarity
PAM 120
BLOSUM 62
Use of scoring matrices
If you have no prior knowledge on the sequence, the
BLOSUM62, used as the default matrix in BLAST searches,
is probably the best choice.
For distant related sequences, select low BLOSUM
matrices (for example BLOSUM45) or high PAM matrices
such as PAM250.
The BLOSUM matrices with low numbers correspond to
PAM matrices with high numbers.

Scoring Alignments
Dot-plot
representations
Scoring function

Вам также может понравиться