Вы находитесь на странице: 1из 79

Topic 7

Data-mining in trancriptomics databases

• Genome-wide expression profiling


• The technology
• Organization and classification of data-sets
• Data-mining
ORGANIZATION OF BIOLOGICAL DATA

Gene i Genomics

m-RNA i Transcriptomics

Protein Sequence /
Protein i Proteomics

Function
(Enzyme, 3-D Structural
hormone etc.) Database
The Flow of Genetic Information

5’ Sequence same as RNA


3’
DNA ACTGCACCATGGGGCTCAGCGACGGGGAATGGCACTTGGTG
TGACGTGGTACCCCGAGTCGCTGCCCCTTACCGTGAACCAC
Sequence complementary to RNA

mRNA 5’ ACUGCACCAUGGGGCUCAGCGACGGGGAAUGGCACUUGGUG

Initiation codons
signal
Protein
Met-Gly-Leu-Ser-Asp-Gly-Gln-Trp-His-Leu-Val
DESCRIPTION OF A LIVING CELL / VIRUS

Genome / General Capability


Genomics of the Cell

Transcriptomics Readyness of the Cell

Proteomics / Physiological state


Protein Map of the cell
Network genomics

Metabolites

DNA RNA Protein

Growth rate
Expression

stem cells
cancer cells
microbes
Some useful signals on Genes
Upstream activating
sequences (UAS)

m-RNA expression
TATA box
start & end

DNA
x x
mRNA

Ribosomal
binding site protein
Protein Protein
synthesis synthesis
starts stops
A typical gene in higher organisms

Transcription Acceptor
Intron Donor
start site model
(non-coding region) model

Translation Stop
start site Exon (coding codon
region)
Alternative splicing leads to diversity
Transcription
start site
E1 I1 E2 I2 E3

E1 E2 E3

E1 I1 E2 E3
Human RNA-splice junctions sequence matrix
Genetic Regulation of Processes
(Regulation of Transcriptional Activity)
A Typical Genetic Regulatory Circuit

McAdams and Arkin, Proc. Natl. Acad. Sci., 1997, vol 94, 814-819
Newly identified members of Gal4 Regulatory Circuit

Ren et al, Science, 22 Dec 2000, vol 290, 2306-2309


8 cross-checks for regulon quantitation
In vitro
Protein fusions In vivo selection
Selection A-B (one-hybrid)
(Selex) A
B

EC SC BS HI

P1 1 0 1
P2 1 1 0
P3 0 1 1
P4 1 0 0
P5 1 1 1
Microarray data P6 0 1 1
Coregulated sets P7 1 1 0

of genes Phylogenetic profiles

TCA
cycle
B. subtilis purM purN purH purD

E. coli purM purN

Metabolic pathways Known regulons in


purH purD

Conserved operons other organisms


Data mining in transcriptomics
databases

47 articles on RNA array data


13 databases (3 Sybase, 2 Oracle, 8 Other)
60 articles on RNA array data mining
108 companies, 23 for software
Current Gene Expression Databases

 Axeldb www.dkfz-
heidelberg.de/abt0135/axeldb.htm
Gene expression in Xenopus
 BodyMap bodymap.ims.u-tokyo.ac.jp/
human & mouse gene expression
 FlyView pbio07.uni-muenster.de/ Drosophila
 Interferon Stimulated Gene Database
www.lerner.ccf.org/labs/williams/xchi-html.cgi
genes induced by treatment with interferon
 Stanford Microarray Database
genome-www.stanford.edu/microarray
Raw & normalized data from various sources
RNA quantitation database integration
experiment • R/G ratios
control ORF
Microarrays1 • R, G values
~1000 bp • quality indicators
hybridization
ORF • Averaged PM-MM
PM • “presence”
Affymetrix2 MM
25-bp hybridization • feature statistics

ORF SAGE Tag • 25-mers

SAGE3 • Counts of SAGE 14-


sequence counting mers sequence tags
for each ORF
concatamers
1 DeRisi, et.al., Science 278:680-686 (1997)
2 Lockhart, et.al., Nat Biotech 14:1675-1680 (1996)
3 Velculescu, et.al,, Science 270:484-487 (1995)
Biotinylated RNA
from experiment

GeneChip expression Each probe cell contains


analysis probe array millions of copies of a specific
oligonucleotide probe

Streptavidin-
phycoerythrin
Image of hybridized probe array conjugate
Error Model for Microarray Data

Fawcett et al, Proc. Natl. Acad. Sci. USA (2000) 97, 8063-68
Representation of expression data

Normalized Time-point 1
Expression Data
from microarrays

T1 T2 T3

Time-point 3
Gene 1

dij
.

Gene 1
Gene N Gene 2
Cluster analysis of mRNA expression data

By gene (rat spinal cord development, yeast cell cycle):


Wen et al., 1998; Tavazoie et al., 1999; Eisen et al., 1998;
Tamayo et al., 1999

By condition or cell-type or by gene&cell-type (human


cancer):
Golub, et al. 1999; Alon, et al. 1999; Perou, et al. 1999;
Weinstein, et al. 1997
Cluster Analysis

• To divide samples into homogeneous groups based on set


of features.
• Clustering of genes based on similarity in expression
pattern over a range of conditions.

Protein/protein complex

Genes

DNA regulatory elements


Gene Expression Data Analysis

Gene Expression Data


Pairwise Measures
Distance/Similarity Matrix
Clustering
Gene Clusters
Motif Searching/...
Regulatory Elements / Gene Functions
Clusters of Two-Dimensional Data
Key Terms in Cluster Analysis

• Distance & Similarity measures


• Hierarchical & non-hierarchical
• Single/complete/average linkage
• Dendrograms & ordering
Distance Measures: Minkowski Metric

Suppose two objects x and y both have p features :


x  ( x1 x 2  xp )
y  ( y1 y 2  yp )
The Minkowski metric is defined by
p
d ( x, y)  r | xi  yi |r
i 1
Most Common Minkowski Metrics
1, r  2 (Euclidean distance )
p
d ( x, y)  2 | xi  yi |2
i 1

2, r  1 (Manhattan distance)
p
d ( x , y )   | xi  yi |
i 1

3, r   (" sup" distance )


d ( x , y )  max | xi  yi |
1 i  p
An Example
x

3 y

1, Euclidean distance : 2 4 2  32  5.
2, Manhattan distance : 4  3  7.
3, " sup" distance : max{4,3}  4.
Manhattan distance is called Hamming
distance when all features are binary.

Gene Expression Levels Under 17 Conditions (1-High,0-Low)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
GeneA 0 1 1 0 0 1 0 0 1 0 0 1 1 1 0 0 1
GeneB 0 1 1 1 0 0 0 0 1 1 1 1 1 1 0 1 1

Hamming Distance : #( 01 )  #( 10 )  4  1  5.
Similarity Measures: Correlation Coefficient
p

 ( x  x)( y
i 1
i i  y)
s ( x, y ) 
p p
2 2
 i
( x
i 1
 x )   i
( y  y )
i 1

p p
averages : x  1
p  xi and y 
i 1
1
p y.
i 1
i

s( x, y)  1
What kind of x and y give
(1) s(x,y)=1,
(2) s(x,y)=-1,
(3) s(x,y)=0 ?
Similarity Measures: Correlation Coefficient

Expression Gene A Gene B


Level
Gene B Gene A
Time Time

Expression Gene B
Level
Gene A

Time
Pattern recognition &
normalization

Singular Value Decomposition (SVD) =


Principal-Component Analysis (PCA)

Linear transformation of Genes by Conditions space


to “Eigen” space producing orthonormal superpositions.
hierarchical & non-
Normalized Expression Data

ab c d
Clustering methods

Hierarchical: a series of successive fusions or


splittings of data until a final number of clusters is
obtained.
• A definite hierarchy between clusters & sub-clusters
Non-hierarchical -: A number of clusters is assumed
at the start. Points are allocated among clusters so
that a criterion is minimized, e.g.the within-clusters
sum of the variance
• No hierarchy within clusters or between clusters.
• E.g. K-mean, Self Organizing maps, etc..
Hierarchical Clustering Techniques

At the beginning, each object (gene) is


a cluster. In each of the subsequent
steps, two closest clusters will merge
into one cluster until there is only one
cluster left.
The distance between two clusters is
defined as the distance between--

• Single-Link Method / Nearest Neighbor


• Complete-Link / Furthest Neighbor
• Their Centroids.
• Average of all cross-cluster pairs.
Single-Link Method
Euclidean Distance
a a,b
b a,b,c a,b,c,d
c d c d d
(1) (2) (3)
b c d b c d c d d
a 2 5 6 a 2 5 6 a, b 3 5 a , b, c 4
b 3 5 b 3 5 c 4
c 4 c 4

Distance Matrix
Complete-Link Method
Euclidean Distance

a
a,b a,b
b a,b,c,d
c,d
c d c d
(1) (2) (3)

b c d b c d c d c, d
a 2 5 6 a 2 5 6 a, b 5 6 a, b 6
b 3 5 b 3 5 c 4
c 4 c 4

Distance Matrix
Compare Dendrograms
Single-Link Complete-Link
ab c d 0
ab c d

6
Which clustering methods do you suggest
for the following two-dimensional data?
Problems of Hierarchical
Clustering
• It concerns more about complete tree
structure than the optimal number of
clusters.
• There is no possibility of correcting for a
poor initial partition.
• Similarity and distance measures rarely
have strict numerical significance.
Non-hierarchical clustering
Normalized Expression Data
Interpreting Patterns of Gene Expression
with Self Organizing Maps

Tamayo et al, Proc. Natl. Acad. Sci. USA, 1999, Vol 96, 2907
SOM algorithm
• Initial mapping of nodes fo is random.
• At each iteration, data-point P is selected and the
node Np that maps closest to P is identified.
• The mapping of the nodes is then adjusted by the
formula
fi+1(N) = fi(N) + (d(N, Np), i) (P-fi(Np)

where learning rate, (x, i) = 0.02 T / (T + 100 i)


T = max. no of iterations.
Clustering of genes with Self Organizing Maps
Clustering by K-means
•Given a set S of N p-dimension vectors without any prior
knowledge about the set, the K-means clustering algorithm
forms K disjoint nonempty subsets such that each subset
minimizes some measure of dissimilarity locally. The algorithm
will globally yield an optimal dissimilarity of all subsets.
•Euclidean distance metric between the coordinates of any two
genes in the space reflects ignorance of a more biologically
relevant measure of distance. K-means is an unsupervised,
iterative algorithm that minimizes the within-cluster sum of
squared distances from the cluster mean.
•The first cluster center is chosen as the centroid of the entire
data set and subsequent centers are chosen by finding the
data point farthest from the centers already chosen. 200-400
iterations.
Representation of expression data
T1 T2 T3
Gene 1
Time-point 1

Time-point 3

Gene N
dij
.
Normalized
Expression Data Gene 1
from microarrays Gene 2
Identifying prevalent expression patterns
(gene clusters)
Time-point 1

Normalized
Expression
1.5

0.5
Time-point 3

-0.5
1 2 3

-1

-1.5

Time -point

Normalized
Expression
Normalized
Expression

1.2 1.5

1
0.7

0.5
0.2
0
-0.3
1 2 3 -0.5 1 2 3
-0.8
-1

-1.3
-1.5

-1.8 -2

Time -point Time -point


Evaluate Cluster contents
Genes MIPS functional category
gpm1 Glycolysis
HTB1 Nuclear
RPL11A
Organization
RPL12B
RPL13A
RPL14A Ribosome
RPL15A
RPL17A
RPL23A
TEF2 Translation
YDL228c
YDR133C
YDR134C
YDR327W Unknown
YDR417C
YKL153W
YPL142C
Representation and clustering of Gene Expression Data

Eisen et al, Proc. Natl. Acad. Sci. USA, 1998, Vol 95, 14863
Hierarchical Clustering of Genes from Expression Data

Red=up-regulated, green=down-regulated
Gene Disruption Studies in Yeast

genes

M
u
t
a
n
t
s

Hughes et al, Cell, 2000, vol 102, 109-126


Molecular Classification of Human Breast Tumors
Biclustering of Gene Expression Data
Breast tumor samples 

g
e
n
e
s

Perou et al, Nature, 2000, vol 406, 747-752


Identification of marker genes in cancer by
expression profiling
Data-Management in Cancer Research

Weinstein et al, Science (1997) 275, 343-349


Obtaining correlation by integrating two data-sets
Database S: Molecular Structure Descriptors
460,000 compounds x 588 descriptors
Database A: Activity patterns (-log GI50)
60,000 compounds x 60 cell lines
Database T: molecular targets (abundance/expression)
100 targets x 60 cell lines
Database A.T’: Correlation between compounds & targets

60 cell lines 100 targets


60k compds

60 cell lines

100 targets

60k compds
A . T’ = A.T’
‘‘Clustered correlation’’ map of compounds & molecular targets

compounds

Targets
Gleaning information from the Cancer databases at NCI

• Clustering of cell lines based on A, T, & A.T’


databases
• Prediction of mechanism of action of drugs based on
A.T’ database
• Correlation of targets in terms of expression based on
T.T’ database.
• Correlation of targets in terms of activities based on
(A.T’)’.(A.T’) database.
• Correlation between structure descriptors and
molecular targets based on S’.(A.T’) database.
Target-target correlation using cancer data

In terms of expression In terms of activities


(T.T’) (A.T’)’.(A.T’)
1
Targets

113
1 Targets 113 1 Targets 113
Correlation
between structure
descriptors and
Targets in
S’.(AT’)
database
Scherf et al, Nature Genetics (2000) 24, 236-44
Hierarchical clustering of human cancer cell lines

Based on Based on
gene sensitivity
expression to 1400
profiles compds
tested
drugs Clustered Correlation for A.T’ database

genes
Distinct Types of Diffuse Large B-Cell Lymphoma
Identified by Gene Expression Profiling

Alizadeh et al, Nature (2000) 403, 503-511


Gene expression signatures for cancer types
DLBCL gene expression subgroups define
prognostic categories
Class Discovery & Class Prediction in Cancer Research
by Gene Expression Monitoring

• General strategy, independent of previous


biological knowledge

• Class Discovery: New Cancer Classes

• Class Prediction: Assigning tumors to known


classes

• Based solely on gene expression monitoring

Golub et al, Science, 1999, vol 286, 531-537


Class Distinction Between
Acute Myeloid Lukemia (AML) &
Acute Lymphoblastic Leukemia (ALL)

Identify Distinguishing Features in a Dataset


Class Prediction Between AML & ALL
Assigning new tumor to known class
Class Discovery in Cancer with a 2-cluster SOM

Golub et al, Science, 1999, vol 286, 531-537


Class Discovery with a 4-cluster SOM

• Possibly, discovers a New Class of Cancer


• Can be applied to cancer data irrespective of
biological background
Exon Microarrays for Human Genome

Shoemaker, et al, Nature (2001) 409, 922-927


15,511 probes for 8,183 predicted exons
69 experiments
Using Expression Data from multiple experiments to
validate exons & define Gene boundaries.
Characterization of novel transcripts using Tiling Arrays
Verification of predicted exons using tiling microarrays.
Whole genome scan for validating predicted exons.
Determination of Regulatory Network and Motifs
from Microarray Data

Tavazoie et al, Nature genetics (1999) 22, 281-85


Application of Microarray Technology

• Classification of cancers, identification of marker


genes
• Validation of predicted exons / genes for higher
organisms.
• Identification of genetic regulatory networks.

Вам также может понравиться