Вы находитесь на странице: 1из 7

What is?

Introduction to Bioinformatics
Bioinformatics

Applying computational techniques to biology data

Medical informatics
Applying computational techniques to medical data

Chemo-informatics
Applying computational techniques to chemical data

Lots of overlap between the three disciplines


Idea is to enhance and enable scientific discovery

What is bioinformatics?

Bioinformatics Data

an emerging interdisciplinary research area


Applications

deals with the computational management


and analysis of biological information: genes,
genomes, proteins, cells, ecological systems,
medical information, robots, artificial
intelligence...

DNA Data

Structural Genomics
Predictive toxicology
Drug trial Data
Medical diagnosis
Patient Record Data

The Core of Bioinformatics to date


Relationships between

We know the DNA sequence in many species

TDQAAFDTNIVTLTRFVM
EQGRKARGTGEMTQLLNS
LCTAVKAISTAVRKAGIA
HLYGIAGSTNVTGDQVKK
LDVLSNDLVINVLKSSFA
TCVLVTEEDKNAIIVEPE
KRGKYVVCFDPLDGSSNI
DCLVSIGTIFGIYRKNST
DEPSEKDALQPGRNLVAA
GYALYGSATMLV

sequence

Good and Bad News


E.g., human genome (3.2 billion bases)

There is a 1:1 mapping


DNA sequence in genes protein structure
3D structure

protein functions

Properties and evolution of genes, genomes,


proteins, metabolic pathways in cells

But we dont yet know a good algorithm for this


And we cannot yet observe proteins folding

Holy Grail of bioinformatics


Identify protein structure from amino acid sequence

Use of this knowledge for prediction, modelling, and


design

The holy grail of bioinformatics


GCTCCTCACTGTCTGTGTTTATTC
TTTTAGCTTCTTCAGATCTTTTAG
TCTGAGGAAGCCTGGCATGTGCA
AATGAAGTTAACCTAA...

Basic concepts
> 500, 000 genes
sequenced to date

Expected number of
unique protein
structures:
~ 700-1, 000

Information processing in cells

conceptual foundations of bioinformatics:


evolution
protein folding
protein function
bioinformatics builds mathematical models
of these processes to infer relationships between components
of complex biological systems

Global approaches: Toward a new Systems Biology


Global cell state

nucleic acids

proteins

Genome

How does the spatial and


temporal organisation of
living matter give rise to
biological processes?

regulatory
transcripts

sites

Protein population:
proteomics

Genome activation
patterns: transcriptomics

coding regions

One-to-many mappings!

Organisation:
tissue imaging

EM

Context-dependence!

X-ray, NMR

cells
molecular complexes

Global approaches: Toward a new Systems Biology

Perturbation

Living cell

Dynamic response

We do not know yet whether the information in the genome is sufficient


to reconstruct an entire biological system. Information on building blocks
not enough, information on their interactions is essential.

External environment
Internal environment

Biological knowledge
(computerised)
Sequence information

Basic principles
Virtual cell

Practical
applications

Metabolic net
Genetic networks

Structural information

Bioinformatics

DNA hRNA

mRNAs

proteins

Mathematical
modelling
Simulation

Transcription
The Central
Dogma

Bioinformatics in context
Mathematics/
computer
science

Genomics

DNA
transcription


RNA

Molecular
biology

Bioinformatics

Biophysics

translation


Proteins

Current challenges to users


Potential hurdles:
Methods are in flux and not fully developedscattered and heterogeneous resources

Ethical, legal,
and social
implications

Molecular
evolution

Sequence homology search of the


genome of Plasmodium
falciparum

Remedies: Web resources


navigation guides
integration of tools and databanks
http://www.biochem.ucl.ac.uk/~nagl/bioinformatics.html

The search for new antimalarial


drugs
Malaria is one of the leading causes of morbidity
and mortality in the tropics.
300 to 500 million estimated clinical cases and 1.5
million to 2.7 million deaths per year.
Nearly all fatal cases are caused by Plasmodium
falciparum.
The parasite's resistance to conventional
antimalarial drugs such as chloroquine is growing
at an alarming rate.

Target identification for antimalerial


drugs

P. falciparum has a plastidlike organelle, called the


apicoplast, acquired by endosymbiosis of an alga.

Jomaa et al. (1999)

Self-replicating, maternally inherited (35kb, circular DNA).


Comparative genome analysis: Search for orthologs.
Apicoplast contains enzymes found in plant and bacterial,
but not animal metabolic pathways.
Potential target for antimalerial drugs:
DOXP reductoisomerase

Jomaa et al. (1999) Science 285: 1573-1576:

Biological databases

The challenge

Searching Databases
We have ways to score how well 2 seqs match
Now want to use this in databases
Given a known gene sequence
Which genes in the database are closely related

Have to worry about:


(Boguski, 1999)

In 1995, the number of genes in the database started to exceed


the number of papers on molecular biology and genetics in the
literature!

Repeated subsequences biasing matches


Accuracy and significance of matches
Sensitivity and specificity (false + and false -)

Multiple Sequence Alignments


Protein sequences
form families
Learn much more
about a gene by
looking at its
family


Multiple sequence
alignment algorithms

Profiles
PSI-BLAST

Data types
primary data

sequence

AATGCGTATAGGC

DNA

DMPVERILEALAVE

amino acid

secondary data
motifs: regular
expressions, blocks,
profiles, fingerprints

tertiary data
atomic co-ordinates

secondary
protein structure

primary database

secondary db

e. g., alpha-helices, betastrands

tertiary protein
structure

tertiary db

domains, folding units

Primary biological databases

International nucleotide data banks


GenBank

EMBL
Europe

Protein

Nucleic acid
EMBL
GenBank
DDBJ (DNA
Data Bank of Japan)

PIR
MIPS
SWISS-PROT
TrEMBL
NRL-3D

GenBank file format

Swiss-Prot

EMBL
EBI

International
Advisory Meeting

USA
NLM
NCBI

Collaborative Meeting

TrEMBL

DDBJ

NRDB

Japan
NIG
CIB

GenBank file format

SWISS-PROT file format

Other primary protein databases


TrEMBL (translated EMBL) in SWISS-PROT format
rapid access to sequence data from genome projects
computer-annotated supplement to SWISS-PROT
translations of all coding sequences (CDS) in EMBL
SP-TrEMBL
REM-TrEMBL: immunoglobulins, T-cell receptors, short
fragments, synthetic and patented sequences

PIR: related databases


NRL-3D Sequence-Structure Database

produced by PIR from sequence and annotation


information extracted from three-dimensional
structures in the Protein Databank (PDB)
allows keyword and similarity searches

OWL composite database


By accession number
By database code

Other primary protein databases

The Protein Information Resource (PIR)


integrated system of protein sequence databases
and derived related databases, e. g., alignment
databases
rapid searching, comparison, and pattern matching of
protein sequences
retrieval of descriptive, bibliographic, feature, and
concurrent cross-reference information
aims to be comprehensive and consistently
annotated

PIR: related databases


PATCHX integrated with PIR
a non-redundant database of protein sequences
produced by MIPS, the European branch of PIRInternational
The PIR Protein Sequence Database and PATCHX
together provide the most complete collection of
protein sequence data currently available in the
public domain.

Two other useful sites


INFOBIOGEN-The Public Catalog of Databases
http://www.infobiogen.fr/services/dbcat/

By text
By sequence
By title
By author
By query language
By regular expression

OWL only released every 6-8


weeks

Direct OWL access:

KEGG-Kyoto Encyclopedia of Genes and Genomes


http://www.genome.ad.jp/kegg/
Kyoto Encyclopedia of Genes and Genomes (KEGG) is an effort to
computerize current knowledge of molecular and cellular biology in
terms of the information pathways that consist of interacting molecules
or genes and to provide links from the gene catalogs produced by
genome sequencing projects.

OWL Blast server

Sequence Retrieval System (SRS)


Database browser that allows users to
retrieve
link
access
entries from all interconnected
resources.
Users can formulate queries across a
range of different database types.

Machine Learning
Machine learning (inductive reasoning)
Automatic proposing of hypotheses based on data
Has many applications in bioinformatics
Including protein structure prediction

Example: predictive toxicology


Given: set of toxic drugs and a set of non-toxic
drugs
Given: background information (chemistry, etc.)
Produces: hypothesis why drugs are toxic

Overview of machine learning


Aims, techniques, methodologies, representations
Artificial neural networks

Evaluating Learned Hypothesis


How do we know that a rule/hypothesis
Reflects something interesting, not a coincidence?

Show that a learning algorithm isnt overfitting

Guide to Protein Databases:


http://www.biochem.ucl.ac.uk/~robert/bioinf/lecture1/index
.html
http://www.biochem.ucl.ac.uk/~robert/bioinf/lecture2/index
.html

i.e., learning the data, rather than generalising


Use cross-validation techniques (hold back data)

Define errors
Use statistics to define confidence intervals
Show that one learning algorithm

With thanks to all who work on Bioinformatics


researchs.

Outperforms another algorithm

Вам также может понравиться