Академический Документы
Профессиональный Документы
Культура Документы
Classification I
SIMCA
Jens C. Frisvad
What is classification?
• Finding discrete clouds of points in
multidimensional feature space with a
certain empty space in between to other
such clouds of points
• Putting into boxes
• ”The goal of classification identification is
to assign new objects to the class to which
they show the largest similarity” (p. 335)
1
03-03-2010
SIMCA:
Soft Independent Modelling of Class Analogy
2
03-03-2010
SIMCA
• The classes in the training set should consist of
many objects representing the diversity of the
class (suggestion: use at least 25 objects for
each class)
• Variable selection should be made, preferably
the same number of variables should be present
in each set
• The SIMCA model can be tested with a test set
• Tabulate the number of outliers and aliens for
each class
SIMCA identification
• Any new object can be allocated to one of
the SIMCA classes, or may be an outlier to
all the classes (identification)
• If the classes are overlapping, an object
may be member of more than one class*
• The distance of any object to the different
classes can be calculated
• The distance between classes can be
calculated
• *Here fuzzy clustering may be used
3
03-03-2010
Pretreatment of data
• Use log (x+1) if the highest values in a
variable are 8 times greater than the
lowest values
• Chromatographic data should often be
logarithmated
• Autoscaling (standardization) of often a
very good idea
• Do not autoscale variables with a very low
variance
4
03-03-2010
Training set
Variables Class
1 2 .. k p
O 1 1
B 2
.
J .
E 2
C Xik
i .
T
S .
n
G
5
03-03-2010
eik has the standard deviation s0, which is the typical distance
for any object in the class to the class itself
p n
2
s0 = ∑∑e
k =1 i =1
ik /( p − A )( n − A − 1)
6
03-03-2010
Confidence cylinder
(box in higher dimensions)
• Confidence radius:
2
F95 % ⋅ s g 0 = F95 % ⋅ ∑ s i / n g
i
2 2 ng
si = ∑vk
k ⋅ eik
n g − Ag − 1
/( p − Ag )
7
03-03-2010
2 2
st ,g ( a ) = ∑k =1
Θ ak , g / n g
SIMCA cylinders
8
03-03-2010
Ag
p
2
s f ,g = ∑v e
k =1
k ik Ψ /( p − Ag )
9
03-03-2010
2 2
F = si , g / s0
2 2
F = Ψ 2 ⋅ si , g / s0
10
03-03-2010
Modelling power
ng
2
∑eik /(ng − Ag −1)
i =1
mpowk = 1 −
ng
∑(x − µ )
i =1
k
2
/ ng −1
Relevance of variable k in class g, less than 0.1 is low, and the variable
may be irrelevant
11
03-03-2010
Discrimination power
(of variable)
2 2
sk ,r ( g) + sk , g (r)
dk (r, g) = 2 2
sk ,r + sk ,g
nr
2 2
s k ,r = ∑
i =1
e ik /( n r − A r − 1 )
nr
2 2
s k ,r ( g ) = ∑e i =1
kf ( g ) / nr
12
03-03-2010
Coomans Plot
Class
overlap
13
03-03-2010
14