Вы находитесь на странице: 1из 49

SVM in Bioinformatics

Computer Science
BISLab. KAIST
Lee, Ki-Young
2003-12-10
Understandin
g

Application
2003-12-17 Lee, Ki-Young 2
Contents
Introduction
What is SVM
Basic idea, Why OHP
Non-linear Separable Case
Soft Margin Hyperplane
Non-Linear SVM
Multi-class Classification
Success in using SVM
Applications in Bioinformatics
Areas, features, kernel function and parameters,
validation measures, current results, case by
case, current Issues
Conclusion
Reference
2003-12-17 Lee, Ki-Young 3
Introduction
SVM
Invented by Vapnik, as a by-product of SLT
Simple, and always trained to find global optimum
Used for pattern recognition, regression, and linear
operator inversion
Considered too slow at the beginning, but now for most
application this problem is overcome due to late 1990s
Small number of parameters choice easy to
use
2003-12-17 Lee, Ki-Young 4
Basic Idea
length
weight Optimal Hyperplane (OHP)
simple kind of SVM
(called an LSVM)
margin
Support vectors
maximum
margin
Swellfish
mackerel
2003-12-17 Lee, Ki-Young 5
Why maximum Margin Hyperplane
Intuitively this feels safest
There is some theory (using VC D.) that is
related to the proposition that this is a good
thing
Empirically it works very well
Hyperplane is really simple
Robust to outliers since the model is immune
to change/removal of any non-support vector
data points
If weve made a small variation in the
location of the boundary this gives us least
chance of causing a misclassification
OHP
not OHP
2003-12-17 Lee, Ki-Young 6
Soft Margin Hyperplanes
Idea 1:
Find minimum ||W||
2
,
minimizing the value of training set errors using slack variables
1 1 = s +
j j
y for b X W
1 1 = + s +
j j
y for b X W
0 =
0 1 > >
1 2 > >
2 >

+
i
W

2
|| ||
2

+
p
i
C
W

2
|| ||
2
Objective function
2
|| ||
2
W
Penalty term
i

R Ce
weight term
Linear separable
2003-12-17 Lee, Ki-Young 7
Higher Dimensional Space
Only Inner product is needed to
calculate Dual problem and
decision function
Idea 2:
Mapping into higher dimensional space,

then find minimum ||W ||
2
in that space
feature space
weight
2
length
2
weight * length

Hypersurface
length
s
d
Kernalization
Hyperplane
Original Data

2003-12-17 Lee, Ki-Young 8
Kernel Functions
) 2 / || || exp( ) , (
2
2
o Y X Y X K =
Gaussian Kernel
o
Controls the size of Window
Smaller results in a sharper but more complicated model
o
d
Y X Y X K ) 1 ( ) , ( + =
Polynomial Kernel
degree
Looks at correlation up to degree
bigger results in more complicated model
d
d
Sigmoid Kernel, Fisher Kernel, String Kernel
There are so many Kernel function!!
Y X Y X K = ) , (
Linear Kernel
2003-12-17 Lee, Ki-Young 9
Multi-class Classification
Using Muti-class SVM
Using two-class SVM
One-against-others
One-against-one
Other variations
2003-12-17 Lee, Ki-Young 10
One-against-others method
Using N classifiers
-
versus
+
Classifier 1
versus Classifier 2
versus
Classifier 3
u
X
Unseen data
Classifier 2
Classifier 3
Classifier 1 Sign + , - ?
More than one classifier can generate +
All classifier can generate -
Easy and simple
variation
2003-12-17 Lee, Ki-Young 11
One-against-one method
Using NC2 classifiers
versus
+
-
Classifier 1
versus Classifier 2
versus
Classifier 3
Unseen data
Classifier 1
Classifier 2
Classifier 3
u
X
Sign + , - ?
Too many classifier
Complicate and more time required
More accurate than one-against-one
variation
2003-12-17 Lee, Ki-Young 12
Multi-class SVM
) max( arg ) (
m u m
m
b X W X f + =
Decision function
} 1 { 1 ,...,k nd m ,...,n, a i e =
0
,
>
i m

- 2
m, i m i m y i y
b X W b X W
i i
+ + > +
Constraint

= =
+
n
i y m
i m m m
i
C W W
1
,
2
1
Minimize
2003-12-17 Lee, Ki-Young 13
Success in using SVM
What kind of Data
Which feature(s)
What kind of Kernel Function .
How about the parameter values
Which method to solve multi-class classification
all fields in ML
2003-12-17 Lee, Ki-Young 14
SVMs in Bioinfomatics
Feature (s)
Kernel Function and Parameter values
Areas
Current results
Case study
Validation Measures
Introduction
2003-12-17 Lee, Ki-Young 15
Big Picture of Protein Synthesis
DNA
Protein
2003-12-17 Lee, Ki-Young 16
Transcription
DNA
RNA
RNA polyerase
preRNA
2003-12-17 Lee, Ki-Young 17
preRNA mRNA (Splicing)
mRNA
preRNA
DNA one strand
snipped out snipped out
Where is the splicing cite?
exo
n
intron
exo
n
intron exo
n
2003-12-17 Lee, Ki-Young 18
preRNA mRNA (Splicing)
pre-
mRNA
Exon 1 Exon 2 GU A AG
Exon 1
Exon 2 A AG
5 splice site
: donor site
3 splice site
: acceptor site
Branching point
Exon 1 Exon 2
mRNA
A AG
GU-AG motifs are parts of longer consensus sequences that span the 5 and 3
splice sites
In Vertebrates
5 splice site: 5AG|GUAAGU-3
3 splice site: 5-PyPyPyPyPyNCAG|-3 (Py: U or C, N: any nucleotide)
SVM
2003-12-17 Lee, Ki-Young 19
Translation
DNA
mRNA
tRNA
RNA polyerase
Ribosome
Polypeptide
2003-12-17 Lee, Ki-Young 20
Protein Stucture Hierarchy
Primary
Structure
Amino
Acids
o Helix
Secondary
Structure
Polypeptide
chain
Tertiary
Structure
Assembled
subunits
Quaternary
Structure
2003-12-17 Lee, Ki-Young 21
Protein Primary Structure
a polypeptide chain
a residue
2003-12-17 Lee, Ki-Young 22
Protein Secondary Structure
o-coil
|-sheet turn or loop
2003-12-17 Lee, Ki-Young 23
Protein Tertiary Structure
o-coil
|-sheet
turn or loop
2003-12-17 Lee, Ki-Young 24
Protein Structure
Protein Structure is so important
Protein Structure can give any to the function of Protein
Traditional biological experiments are time consuming
and expensive
Computational mechanism is needed!!
SVM
2003-12-17 Lee, Ki-Young 25
Areas in Bioinfo.
Microarray data analysis
Gene functional classification:
Brown et al. (2000)
Pavlidis et al. (2001)
Tissue classificaiton:
Mukherje et al. (1999),
Furey et al. (2000),
Guyon et al. (2001)
Protein Synthesis
Splicing site prediction
Ying-Fei Sun, (2003)
2003-12-17 Lee, Ki-Young 26
Areas in Bioinfo.
Proteins
Family prediction : Jaakkoola et al. (1998)
Fold recognition: Ding et al. (2001)
Protein-protein interaction prediction: Bock et al. (2001)
Structure prediction:
Hua et al. (2001), Suiun Hua (2001)
Yu-Dong Cai, (2002, 2003), Chris H. Q. Ding (2001),
Florian Markowetz (2003)
Function prediction: Jakkoola et al. (1998)
C. Z. Cai (2003), Yu-dong Cai (2003)
2003-12-17 Lee, Ki-Young 27
Features
Physicochemical feature of residue / base
Hydrophobicity
Normalized volume
Polarity
Charge
Surface tension

20 Amino acids
2003-12-17 Lee, Ki-Young 28
Features
Lower level structure
Amino acids Composition
the number/frequency of each amino acid
MFWYTSNKKHRTGPILMVATSNQ.
Input vector
I L V A P G C T S N D E Q K H R Y F W M
number/frequency
2003-12-17 Lee, Ki-Young 29
Features
) , (
coupling - order - sequence :
1
2
,
1
,
j i j i
k L
i
k i i k
R R D J
rank k
J
k L
=

=
+
t
(
(
(
(
(

= =
=
20
20
) (
) (
20
) (
) (
) (
20
1
2
20
1
0
0
20
1
0
0
i i
i
i C
i C
i C
i C
i C
Pseudo-amino acid composition
Composition (AAC, 20-D) + Sequence (Correlation)
L
R R R R R ...
residues acid amino L of chain protein a
4 3 2 1
{ }
2 2 2 2
)] ( ) ( [ )] ( ) ( [ )] ( ) ( [
3
1
) , (
i j i j i j j i
R A R A R H R H R C R C R R D + + =
2003-12-17 Lee, Ki-Young 30
Kernel Functions and parameters
Gaussian Kernel
Polynomial Kernel
Fisher Kernel
String Kernel
Spectrum kernel
Interpolated Kernel
Most !!
Cross validation
minimize VC-Dimension
Parameters (for Kernel, C)
Kernel functions
.
2003-12-17 Lee, Ki-Young 31
Validation Measures
TP / (TP + FN)
Specificity (S
p
) Sensitivity (S
n
)
TN / (TN + FP)
True Positive TP
True Positive TN
False Positive FP
False Positive FN
Accuracy
(TP + TN) / (TP + TN + FP + FN)
Weighted average of S
n
and S
p
2003-12-17 Lee, Ki-Young 32
Current Results in Bioinfo.
at least as good as other tools
Splicing cites: 85 ~ 92%
Protein secondary structure : 55 ~ 93.2%
Protein fold prediction : 28~56%
Protein domain prediction : 57~94.5%
Protein function prediction : 88~94%
Protein-protein interaction : 80~83%
SVM is Really a Good tool
2003-12-17 Lee, Ki-Young 33
Case Study 1
Identifying splicing sites in eukaryotic
RNA: SVM approach
Ying-Fei Sun, Xiao-Dan Fan, Yan-Da Li
Computers in Biology and Medicine, 2003
2003-12-17 Lee, Ki-Young 34
Identifying splicing sites in eukaryotic
RNA: SVM approach

Data Source
Primate and rodent subsets of Genetic Sequence Data Bank
Data acquisition
For genuine splice site
GT-AG rule 70 bases on both donor and acceptor sites
For false splice site
GT-AG rule 70 bases on both donor and acceptor sites
No recognizable exon-intron-exon junctions using some tools
Objective
Find splicing site (donor and acceptor sites)
Exon 1 Exon 2 GU A AG
donor site acceptor site
2003-12-17 Lee, Ki-Young 35
Identifying splicing sites in eukaryotic
RNA: SVM approach

Pre-treatment of data
Remove the redundant sequence from the two collections using
sequence alignment comparison program (less than 80%)
Two sequences, ACDDDEFGR vs. ACDEFHR
ACDDDEFGR
ACD--EFHR
Mismatch, substitution -1
Gap or indel (insertion/deletion) opening: -2, extension: -1
Match +1 Rules
6*1 + 1*(-1) + 2*(-2) = 1
6*1 + 1*(-2) + 2*(-1) = 2
ACDDDEFGR
AC-D-EFHR
or
ACDDDEFGR
AC-D-EFHR
or
6*1 + 1*(-2) + 2*(-1) = 2
2003-12-17 Lee, Ki-Young 36
Identifying splicing sites in eukaryotic
RNA: SVM approach
Final data sets
Method of Coding
4-bit string code as A - 0001, T-0010, G-0100, C-1000
Vicinity of splice sites( 20+20, 30+30)
AG
GT
Donor Site Acceptor Site
2003-12-17 Lee, Ki-Young 37
Identifying splicing sites in eukaryotic
RNA: SVM approach
SVM
With and without secondary structural information
Polynomial kernel (s= 1, r=1, d=4), Gaussian kernel (std =20)
Three-fold cross validation
Accuracy (average of Sn and Sp)
Sensitivity Sn = TP/(TP + FN), Specificity Sp = TN/(TN +FP),
Result and conclusion
Better Result than previous approaches (Neural network, Hidden
Markov Model, Dynamic programming )
2003-12-17 Lee, Ki-Young 38
Identifying splicing sites in eukaryotic
RNA: SVM approach
2003-12-17 Lee, Ki-Young 39
Case Study 2
A Nobel Method of Protein Secondary
Structure Prediction with High SOV:
SVM Approach
Sujun Hua and Zhirong Sun J. Mol. Biol., 2001
o-coil |-sheet turn or loop
2003-12-17 Lee, Ki-Young 40
A Novel Method of Protein Secondary
Structure Prediction with High SOV: SVM
Approach
Objective
Predicting the secondary structure of proteins
SVM
Kernel: Gaussian RBF
Parameter Values: = 0.10, C:1.50
Binary, tertiary classifiers(3 cases), One-against-others method
Data Set
RS 126 set ( 126 protein chains proposed by Rost & Sander)
CB 513 set ( 513 protein chains constructed by Cuff & Barton)
One of 3 classes (o-helix(H), |-sheet(E), coil(C))
Input vector: protein segment size
l
2003-12-17 Lee, Ki-Young 41
A Novel Method of Protein Secondary
Structure Prediction with High SOV: SVM
Approach
Tertiary Classifiers
2003-12-17 Lee, Ki-Young 42
A Novel Method of Protein Secondary
Structure Prediction with High SOV: SVM
Approach
(

+
=

) (
1
2 1
2 1 2 1
) (
) , ( max
) , ( ) , ( min
) (
1
100 ) (
i S
s len
s s ov
s s s s ov
i N
i Sov
o
0 . 28 6 )
6
1 2
10
1 1
(
3 6 6
1
100 ) ( =
+
+
+

+ +
= E Sov
Segment Overlap Measure (SOV)
Ambiguity in the position of segment
ends
Type and position of secondary structure segments rather than a
per-residue assignment of conformational state
Natural variation of segment boundaries
CCEEECCCCCCEEEEEECCC
CCCCCCCEEEEECCCEECCC
)
`


=
) 2 / ) int(len(s ), 2 / ) int(len(s
), , ( min )), , ( min ) , ( (max
min ) , (
2 1
2 1 2 1 2 1
2 1
s s ov s s ov s s ov
s s o
2003-12-17 Lee, Ki-Young 43
A Novel Method of Protein Secondary
Structure Prediction with High SOV: SVM
Approach
Result
2003-12-17 Lee, Ki-Young 44
A Novel Method of Protein Secondary
Structure Prediction with High SOV: SVM
Approach
Result
Conclusion
SVM will become a very useful tool for predicting the structural classes of proteins
2003-12-17 Lee, Ki-Young 45
Conclusion
More researches are needed
Feature selection, negative data generation,
kernel function and parameters
Currently, many papers are publishing
SVM is a useful tool in Bioinformatics
Microarray data analysis
Splicing Site Prediction
Protein Structure Prediction
Protein Function Prediction
Protein-Protein Interaction Prediction

2003-12-17 Lee, Ki-Young 46
Reference

An Introduction to Lagrange Multipliers, Steuard Jensen
http://home.uchicago.edu/~sbjensen/Tutorials/Lagrange.html
Linear Algebra and Its Applications, David C. Lay, 1999, second edition
Some Mathematical Tools for Machine Learning, Chris Burges, August,
2003
Statistical Learning and VC Theory, Peter Bartlett, ISCAS, May 2001
A Tutorial on Support Vector Machines for Pattern Recognition,
Christopher J.C. Burges, Data Mining and Knowledge Discover, 1998
Support Vector Learning, B. Schlkopf, Ph. D. Thesis, 1997
Kernel methods: a survey of current techniques, Colin Campbell, 2002

2003-12-17 Lee, Ki-Young 47
Reference

Predicting protein-protein interactions from primary structure, Joel R. Rock and
David A. Gough, Bioinformatics, 2001
Support Vector Machine Classification of Microarray Gene Expression Data,
Michael P. S. Brown, 1999
Protein function classification via support vector machine approach, C.Z. Cai,
Mathematical Biosciences, 2003
Prediction of Secondary Protein Structure with Binary Coding Patterns of Amino Acid
and Nucleotide Physicochemical Properties, NIKOLA, 2002
Identifying splicing sites in eukaryotic RNA: support vector machine approach,
Ying-Fei Sun, Computers in Biology and Medicine, 2003
A Nobel Method of Protein Secondary Structure Prediction with High segment
Overlap Measure: Support Vector Machine Approach, Sujun Hua, J. Mol. Biol., 2001
Gene Expression data analysis of human lymphoma suing support vector machines
and output coding ensembles, Giorgio Valentini, Artificial Intelligence in Medicine,
2002
2003-12-17 Lee, Ki-Young 48
Reference

Support Vector Machines for Prediction of Protein Domain Structural Class, Yu-
Dong Cai, Xiao-Jun Liu, Xue-Biao Xu and Kuo-Chen Chou, J. theor. Biol., 2003
Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins
from amino acid sequence, Yu-dong Cai, Shuo Liang Lin, BBA, 2003
Support Vector Machines for Predicting HIV Protease Cleavage Sites in Protein,
Yu-Dong Cai, Xiao-Jun Liu, Xue-Biao Xu, Kuo-Chen Chou, J. Comput. Chem., 2002
Transductive Support Vector Machines for Classification of Microarray Gene
Expression Data, R. Semolini, 2003
2003-12-17 Lee, Ki-Young 49
Questions or Comments

Вам также может понравиться