Chapter 3

18
CHAPTER 3 PATTERN CLASSIFICATION OF EMG DATA Pattern Recognition has been accepted as a potential method toward solving myoelectric system control problem. All multifunction myoelectic control systems using pattern recognition are based on the assumption that at given electrode location, the set of parameters describing the myoelectric signal will be repeatable for a given state of muscle activation and it will be different from one state of activation to another [36]. Numerous pattern recognition schemes have been developed and experimented. An initial work at SDSU used the statistic method by which an EMG pattern is classified by choosing the closest cluster to the pattern. An encouraging level of classification success rate was achieved [48]. Later, the simplified Fuzzy ARTMAP network (SFAM) has been proved to be even more powerful by obtaining better successful classification rates [49]. Furthermore, its capability of dealing with the plasticity-stability dilemma [51] made this type of network a good choice for EMG pattern recognition. 3.1 Pattern Recognition (Introduction) Pattern recognition is a process to map observed patterns to a set of categories. Generally speaking, pattern recognition is a combination of the following processes: signal and data collection, data pre-processing, feature extraction, and feature classification and appropriate classifier training. Pattern recognition, also called pattern classification, defines a scientific discipline dealing with methods for objects description and classification [10]. It is viewed as a branch of Artificial Intelligence (AI). Applications are speech recognition, character recognition,
19 handwriting recognition, facial recognition, etc. It is also an important method in Artificial Intelligence and robotics. Some terms are usually defined as follows [10]: Pattern: Something serves as a model for a collection of objects being observed. It is a conceptual representation of what are being observed. Class: States of nature or a category of objects associated with concepts or prototypes. Feature: Measurement, attribute or primitives derived from the pattern, may be useful for their characterization. People always want to make their pattern classifiers more capable of dealing with various observations, stable to noise, and with higher adaptability such that they are able to keep functioning independently with less human interference and supervision. Such intention might lead to a tendency to develop a system that can fulfill intelligent tasks and mimic the process of human learning and classification. That is, they have adaptability to deal with a variance of input samples. However, the environments of observed target are usually very complicated and sometimes the changes become unpredictable. The applicability of pattern recognition is established by limiting the variability of the observed targets environment. The feature values are measured, preprocessed and extracted. The model of sample space is employed to describe the distribution of samples. Each dimension of a sample space is associated with a feature. Therefore a sample is represented with a point in the sample space. Its feature values are the corresponding coordinates. The task of pattern recognition becomes how to describe the distribution of samples in sample space and how to
20 relate an input sample to the distribution. Statistical approach and neural network are two basic approaches for pattern recognition. 3.1.1 Statistical Approach Statistical approach in pattern recognition treats the problem of mapping a pattern to a class as random processes. The distribution of samples belonging to a certain category is subjected to means, variances and covariances. Classes are represented by parameters of their probability density functions (pdf). If there exist M classes C1 , C2 , C3 , ..., CM , the classification task is to find a class of largest probability of which an input sample I can belong to. This is called the posterior probability P (C j | I ) , which is the conditional probability for input feature vector I to belong to class C j . The likelihood of I to class C j is P ( I | C j ) . The prior probability or prevalence for class
C j is P (C j ) , which is the proportion of the number of feature vectors those belong to class C j
to the total number of feature vectors. The posterior probability can be computed if the distributions of the feature vectors in each of the existing classes are given.
P(C j | I ) =
M
P( I | C j ) P(C j ) P( I )
(3.1)
where P( I ) = P( I | C j )P(C j )
j =1
This equation is called Bayes rule. The classification can also be based on P ( I | C j ) P (C j ) only in the sense that P ( I ) is same for all the classes. Bayesian classifier is the optimal solution of the learning classification problem, that is, the solution minimizes the classification error probability [11][24].
21 We can state the Beyes classification rule by using the model of sample space which is introduced in the previous section. The sample space is divided for M classes into M regions R1 , R2 , R3 , ..., RM , respectively. These regions are separated by decision surfaces in the sample space. For a minimum error probability case the decision surfaces are described by equations:
P (C j | I ) P (Ck | I ) = 0 .
(3.2)
We define discriminant funtion as

g j ( I ) = f ( P (C j | I )) .
(3.3)
where f (x) is any monotonically increasing function. Then, the Beyes classification rule can be stated as: I belongs to class Ci if gi ( I ) > g j ( I ), i j .
3.1.2 Neural Network Approach
How the human brain generates thought and intelligence is one of the most interesting scientific research problems. The neuron, or nerve cell, is the fundamental function unit of all nervous system tissue, including the brain. Neurobiology shows that signals are propagated from neuron to neuron by complex electrochemical reactions, and a collection of simple cells can lead to thought, action, and consciousness [13]. According to DARPA Neural Network Study (1988, AFCEA International Press, p. 60): a neural network is a system composed of many simple processing elements operating in parallel, whose function is determined by network structure, connection strengths, and the processing performed at computing elements or nodes. A typical neural network is composed of nodes. Nodes are connected by links. Assigned with weights, links interconnect the nodes to become a structure in such a way that information
22 relevant to the mapping from input features (attributes) to output activation can be represented, updated and stored for long-term usage. Learning process of a network can be supervised, unsupervised, and sometimes a mix of both. Supervised learning means the training is concept driven or inductive hypothesesfind in representation space a hypothesis corresponding to the structure of the interpretation space. Unsupervised learning means the training is data-driven or deductive hypothesisfind a structure in the interpretation space corresponding to the structure in the representation space. Two major categories of network structure are feed-forward and feedback networks. There is no cycle in a feed-forward network. Some feed-forward networks, called multi-layer networks, have more than one layer in its structure. The layer connecting to and accepting input is composed of input nodes and is called input layer. The layer that leads to output results is composed of output nodes and is called output layer. The layers between input and output layer are called hidden layers, and the nodes composing them are called hidden nodes. The connections in a feed-forward network extend from input layer to output layer uniformly. In a feedback or recurrent network, there are cycles in connections. Feedback networks are usually more difficult to train than feed-forward networks. Recurrent networks can become unstable, oscillate, or exhibit chaotic behavior. One important way for learning of multi-layer feed-forward networks is BackPropagation learning. Back-propagation refers to the method for computing the gradient of the case-wise error function with respect to the weights for a feed-forward network. It is a straightforward but elegant application of the chain rule of elementary calculus [9]. A back-propagation network adapts to the training data and minimizes the error between the activated output and the actual output computed by the network. The adaptability is achieved
23 by updating the weights of the links, and sometimes, by modifying the internal structure of the networks such as change of number of nodes in hidden layers.
3.2 Adaptive Resonance Theory Based Networks
Adaptive Resonance Theory (ART) was proposed by Stephen Grossberg in 1976 [1][3][4][8]. The term resonance" refers to a state in which the similarity is above a predefined threshold, which is measured by quantifying the extent to which an input vector matches an exiting prototype. The network keeps updating the existing prototypes upon new inputs within resonance threshold. A new neuron is created if there is no resonant prototype in the existing network. There are several versions of ART based architecture [5]. ART-1 is a binary version of ART, which can cluster binary input vectors. ART-2 is an analogue version of ART, which can cluster real-valued input vectors. ARTMAP is a supervised version of ART that can learn arbitrary mappings of binary patterns. It includes two ART modules and creates stable recognition categories [3]. Fuzzy ARTMAP is a synthesis of ART and fuzzy logic, which use fuzzy set-theory operations to describe the system dynamics. It offers a new possibility for designing a system to be adaptive, capable of incremental learning to deal with new clusters, and of stability [3][4].
3.2.1 Simplified ARTMAP Network (SFAM)
The SFAM is a version of the fuzzy ARTMAP neural network model. SFAM was designed to improve the computational efficiency of the fuzzy ARTMAP model with minimal loss of learning effectiveness [2]. The fuzzy component in the name of this network refers to
24 the fact that its learning process implements fuzzy logic operations in order to achieve a number of key pattern matching and adaptation functions [8]. Marko Vuskovic and Sijiang Du [49] experimented a SFAM architecture based on Eucledean distance and Mahalanobis distance. The performances are compared with that of the classic SFAM which is proposed by T. Kasuba. The architecture of SFAM is illustrate in the following figure:
Figure 3-1: Architecture of SFAM Each node in output layer only has one node in category layer associated with it. The jth output node is connected with the ith input node defined by a weight W ij . The vector
(W 1j , W 2j, W 3j,..., W 2j d ) is the actual long-term storage for a prototype corresponding to the jth
output node.
25 The SFAM networks accept a feature vector fv that represents an instance in sample space during the learning process. The category of this instance is also presented to the network. The feature vector is complement encoded, which doubles the length of the fv . Then it is sent to the input layer. Complement coding is an input normalization process that represents the presence of a particular feature in the input pattern and its absence. For instance, a normalized ddimension input feature vector fv = ( f1 , f 2 ,..., f d ) become the vector X after the complement coding:
X = ( f1 , f 2 ,..., f d , f1 , f 2 ,..., f d ) .
(3.4)
We can observe an interesting property of the complement coding in the way that the sum of the complemented input is the number of dimensions of the sample space: X = ( f1 , f 2 ,..., f d , f1 , f 2 ,..., f d ) = id=1 f i + id=1 (1 f i ) . It can be stated as: X = id = 1 f i + d id=1 fi = d . The activation function is defined as a measure of similarity between an input and a prototype. The activation function AF ( j ) is used in training for selecting an activated node in output layer that best matches an input. It is defined as:
AF ( j ) = X W j
(3.5)
(3.6)
+Wj
(3.7)
where W j are the weights for the links to jth node in the output layer OC , and is a small positive scalar we assign it 0.001 arbitrarily. The vector of weights defines the weighting function for each node.
26 The matching function MF ( j ) is used in classification for measuring similarities and assigning a category to an input feature vector. It is defined as: MF ( j ) = X W j . X (3.8)
The system is said to make a category choice when at most one output node is activated during one time running. The activated node j is the node with the largest activation such that AF ( j ) = MAX ( AF (k )), k = 1, 2,3,... . The smallest j is chosen if more than one node have the equal maximal value. The granularity of the system is represented by vigilance parameter . Small make the system more fuzzy which might cause less output nodes while large makes the system more vigilant to the subtle differences of patterns. There is always a tradeoff between less or more output nodes to their time-space efficiencies and good classification rate. The resonance occurs if the match function of the chosen node meets the vigilance parameter [3][4]: MF ( j ) = X W j X . (3.9)
The resonance means that the jth output node is good enough to encode an input [2]. During learning, the weighting function are updated if the category associated with the output node is the same as the actual category of the input training sample. The updating makes the system more adaptive to the distribution of a cluster associated with the node. The updating of the weighting function of the activated nodes is: W j = (1 ) W j + X W j , (3.10)
27 where is the learning rate that must be greater than 0 and less than or equal to 1, and
X W j = MIN ( fi , W ji ) used in [4] as fuzzy and operation, which assumes positive
j
normalized input. A new node in output layer is created if no node is found to be resonant. That is, the category being associated with the activated node is different with the category being associated with the input sample. The mismatch occurs if the match function does not satisfy the vigilance criteria: MF ( j ) = X W j . X (3.11)
The system will not update the weighting function if the mismatch occurs. The activated node is not considered to be able to encode the input sample properly, because the value of matching function is less than the vigilance. The prototype represented by the exited node does not match the input sample very well. Therefore a new node is created to represents a new prototype described by the input sample.
28 The algorithm of SFAM is illustrated as the following chart:
Figure 3-2: Program flowchart for SFAM.
29
3.2.2 Benchmark Problems
Many kinds of pattern recognition methods and neural network architectures are invented and implemented. Some of them perform well on some specific problems while some others not. The benchmark problems are introduced to compare and evaluate these various kinds of classifiers. The performances of the classifiers are easily compared with each others after applying them on a same set of benchmark problems. Benchmark problems usually are representative, easy to state and good for testing some general capabilities of a classifier. We used the following two kinds of benchmark problems to demonstrate the capability and performance of SFAM in the presented work. Circle-in-square Two-spiral
The circle-in-square problem requires a system to identify what points of a square lie inside and which lie outside a circle whose area equals half of the square. G.A. Carpenter test the Fuzzy ARTMAP by circle-in-square problem and get 88.6% successful classification rate for a 100 point input sample [4]. Two-spiral problem is a task to distinguish between points on two intertwined spirals [7]. The training set is 194 samples evenly distributed on the two spirals. The test set includes 4 times number of points on the two spirals. G.A. Carpenter tested the Fuzzy ARTMAP for two-spiral problem and demonstrated that the Fuzzy ARTMAP can solve it very well [4]. The training set and test set are illustrated in the following figure. The samples belonging to two categories are represented by + and * respectively in the figure. The test set is 4 times denser than the training set along the spirals track.
30
Figure 3-3: The two-spiral benchmark problem.

3.3 Distance Based ART Network
Distance-based ARTMAP is a class of networks that uses the distance in sample space to measure the similarity between patterns. It has almost identical architecture and algorithm to the SFAM. However, the fuzzy AND operation is not used in the activation function of distance based ARTMAP. Therefore there is no complement coding for it. An output node represents a pattern in sample space, which is abstraction of a sample cluster. The data associated with one output node in a distance-based network include a weight vector that describes the center of this cluster, the parameters those describe the shape of the cluster, and other parameters for statistic information. The mechanism of learning and classification of distance-based ARTMAP are also based on Adaptive Resonance Theory. The matching function and activation function are implemented by using the distances in sample space. Two types of distances are used in the networks proposed
31 in this thesis. One is Euclidean distance and the other is Mahalanobis distance. We will discuss them in the next sections.
3.3.1 Euclidean Distance-based Simplified ARTMAP Network
The Euclidean distance of two samples I and I ' in a M-dimensional sample space is defined as dist = I I ' = ( I1 I1 ') 2 + ( I 2 I1 ')2 + ... + ( I M I M ') 2 . (3.12)
Euclidean distance-based simplified ARTMAP network (ESAM) is a network of the SFAM architecture which discriminates patterns by Euclidean distance in sample space. The activation function and match function are also defined based on the new criteria to discriminate patterns. The activation, resonance, matching and mismatch still make the underlying mechanism for the learning and classification, same as that in the SFAM. A cluster can be represented by the mean of all the samples belonging to the cluster, which is denoted by j. We can estimate the membership of input I belong to the jth cluster by measuring the norm I j , by assuming that the region of each cluster is of same size and a symmetric shape. However, the distribution of samples usually has random size. Sometimes one clusters size is much bigger than its neighbor clusters. Therefore we normalize the I j by the size of the cluster R j . R j is initiated to be 1- for normalized inputs
Rj = 1
(3.13)
The parameter j in network is represented by W j . The activation function becomes AF ( j ) = I Wj Rj
(3.14)
32 where R j is the radius of the cluster. Unlike in SFAM, the maximum activation is the node that has minimum AF ( j ) . A cluster can be represented by a pattern. An input sample has a higher degree of similarity to the pattern if it is closer to the center of cluster. The pattern represented by the activated node has the smallest distance to the input sample in the sample space. The activated node is jth node such that AF ( j ) = MIN ( AF (k )), k = 1, 2,3,... . The smallest j is chosen if two or more nodes are activated. The value of match function MF ( j ) quantifies the degree to which an input sample matches the encoded pattern. The maximal MF ( j ) value is one when a sample matches the pattern completely. The matching function is defined as: If I Wj R j , then MF ( j ) =
Rj I Wj .
If I Wj < R j , then MF ( j ) = 1 . Each output node has a weight vector which is defined by the weighting function, and a radius parameter R j which describes the size of cluster. The learning is achieved by updating the weighting function of activated nodes by the equation (3.12) with the learning rate of the system. The radius is also updated after the updating of weights as R j = max(W j , I W j ) (3.15)
3.3.2 Testing ESAM on Two Benchmark Problems
The ESAM network is tested by the circle-in-square benchmark problem. The experiment is demonstrated in the following figure.
33
Figure 3-4: Learning of ESAM for circle-in-square problem The vigilance value is 0.8 in this experiment. Totally 100 points in the square region are randomly generated as training set and another 100 points are randomly generated as test set. The test set and training set is displayed in (a) and (b). We can observe that about half of sample of each set are inside the circle region while the rest of the set are outside the region of circle. The network was trained by the training set in (a) with two epochs. The successful classification for the test set in (b) is 87.0%. Totally thirteen output nodes are generated as shown in (d). An evenly distributed 10,000 points in the square region are sent to the trained network. Each of the 10,000 points is classified to be inside or outside the circle. The classification result is displayed in (c). The picture in (c) is also named the pattern response because it demonstrates how the
34 classifier encodes the result of learning. It describes how the points belonging to the two categories are distributed based on the training. The parameter of each output nodes is displayed in (d). The centers of clusters are represented by the stars * and plus signs+ for the two categories respectively. The region of a cluster is described by the circle surrounds its center. The cluster central point is encoded by the weighting function of the output node. The size of a cluster is encoded by a parameter R. We can observe that there are a number of output nodes for clusters covering different size of regions. We can get the networks of different performances by changing the vigilance value. The results are demonstrated in the following figure.
Figure 3-5: The trained ESAM of different vigilance values The left part of the figure displays the output nodes if we set Vigilance = 0.7. The successful classification rate is 78.0% and the number of output nodes is six. The right part of the figure displays the output nodes if we set Vigilance = 0.9. The successful classification rate is 90.0%. The number of output nodes is 35. The network is trained by 100 training samples in 2 epochs. Each cluster will has same size if the vigilance is initialized by a positive value closed to
35 one. The initial size is defined by initial value of the R j. We can observe in the right part of the figure that all clusters have same size because the vigilance is high. A higher vigilance usually introduces more output nodes and therefore obtains a higher hit rate since the network become more sensitive to subtle difference of input samples. The ESAM is also able to solve two-spiral benchmark problem. It can have a good solution for the problem if we set a high vigilance(>0.9). The result is displayed in the picture:
Figure 3-6: ESAM works for two-spiral benchmark problem The network is able to obtain 84.15% successful classification rate after being trained in one epoch with vigilance = 0.9. As we know, the training set is 194 points of two categories and test set is the 1944 points. Pattern response of the network is displayed in the left figure and the output nodes are displayed in the right figure. Totally 52 output nodes are generated. The round shape of a cluster is not displayed since they have same size. A higher successful rate can be obtained if we increase the vigilance value to 0.98. The successful rate is 100% after one epoch of training. Totally 170 output nodes are generated in
36 this case. This implies that the network can classify all the test samples correctly if the vigilance is high enough. The result is demonstrated in the following figure.
Figure 3-7: ESAM works for two-spiral benchmark problem with high vigilance
3.3.3 Mahalanobis Distance-based Simplified ARTMAP Network
Mahalanobis distance-based simplified ARTMAP network (DSAM) has an identical architecture as ESAM. It does not use complement coding. The parameters for each output node includes a weighs vector W that describe the center of this cluster, a covariance matrix S, and a parameter N that describes the number of samples in the cluster. The threshold value is a confidence value for the probability that an input belongs to the cluster. We will introduce the mathematic background of the DSAM and discuss its performance in this section. The mean or expected value of a discrete random variable X is
E ( X ) = xk p X ( xk ) ,
k =1 n
(3.16)
37 The probability density function (pdf) of X is p X (.) and xk , k = 1, 2,..., n are instances of the random variable X. The variance of the discrete random variable X is defined as VAR[ X ] = E[( X E[ X ])2 ] . (3.17)
We get a quantity with the same units as X by taking the square root of variance. The standard deviation of the random variable is defined as
STD[ X ] = VAR[ X ] 2 .
(3.18)
The mean value usually represents the center of a distribution and variance or standard deviation is a measurement for the width of spread of the distribution. The covariance is a measurement for the interdependency of multiple random variables. The covariance of random variables X and Y is
COV ( X , Y ) = E[( X E[ X ])(Y E[Y ])] .
(3.19)
If we assign equal probability to xk , k = 1, 2,..., n , that is p( x1 ) = p( x2 ) = p( x3 ),..., = p( xn ) , then the mean m, variance and covariance become
mX = E ( X ) = 1 n xk . n k =1 1 n ( xk m)2 . n 1 k =1
(3.20)
X = VAR[ X ] =
COV ( X , Y ) =
(3.21)
1 n ( xk mX )( yk mY ) . n 1 k =1
(3.22)
The covariance matrix S of random variables X 1 , X 2 ,..., X M is
38
VAR( X 1 ) COV ( X 1 , X 2 ) K K COV ( X 2 , X 1 ) VAR( X 2 ) S= M M O K COV ( X M , X 1 ) COV ( X 1 , X M ) COV ( X 2 , X M ) . M VAR( X M )
(3.23)
A vector random variable X in a M-dimensional sample space is composed of the random variables: X 1 , X 2 ,..., X M . It is denoted by X = ( X 1 , X 2 ,..., X M ) . Each instance of X is a feature vector x = ( x1 , x2 ,..., xM ) , which represent a point in the sample space. The mean vector of X is denoted by m = (m1 , m2 ,..., mM ) . The Mahalanobis distance from a feature vector to the mean vector is:
2 = ( x m) ' S 1 ( x m) = [ ( x1 m1 ) ( x2 m2 ) L ( x1 m1 ) . 1 ( x2 m2 ) ( xM mM ) ] S M ( xM mM ) (3.26)
(3.24)
(3.25)
We represent a feature vector x by a point in the sample space. The center of a cluster is represented by the mean vector m. A cluster in sample space has an ellipsoid shape in a three dimensions space, or a hyper- ellipsoid shape in a sample space of higher dimensionality. The surface of an ellipsoid or hyper-ellipsoid is formed by the points those have same Mahalanobis distance to the center point m. The Covariance matrix S plays an essential role in the computation of 2 . The diagonal elements of S describe the variance of each feature and the other elements of S describe the correlation of each pair of features. Thus the ellipsoid can of different size, shape and angle in the sample space.
39 An important property of Mahalanobis distance is that it is invariant to unit scaling of input features. The distance 2 describes a certain confidence value for a sample to belong to this cluster. A smaller distance 2 to the center implies a higher probability of the sample belonging to this cluster, while a larger distance 2 gives a lower probability or confidence value for it. The activation function for DSAM is defined as:
AF ( j ) = ( x w j )T S 1 ( x w j ) .
The match function is identical to the activation function:
(3.27)
MF ( j ) = ( x w j )T S 1 ( x w j ) .
or
(3.28)
t j = ( x w j )T S 1 ( x w j ) .
The highest activation is determined by minimizing the activation value: min(t j ) .
(3.29)
3.3.4 Recurrent Computation for Covariance Matrix
Covariance matrix plays a key role in the computation of 2 . In order to reduce the complexity for the computation of S a recurrent computation is introduced for a new covariance matrix S ' base on input vector x: [49]
S k +1 = 1S k + 2 ( x m)( x m)T , k = 1, 2,..., n ,
(3.30)
where
1 = (1 )
n 1 n , 2 = (1 2 ) . n 1 n 1
(3.31)
The parameter is the learning rate.
40 If we suppose large clusters, i.e. n >> 1 , then
1 (1 ), 2 = (1 2 ).
The inverse of a matrix is not available when the matrix is badly scaled or nearly
(3.32)
singular. We use the previous covariance matrix that has a valid S 1 for the computation of 2 in such a case. We can observe that only S 1 is needed in the computation of 2 . Therefore the recurrent method can be stated in a more effective way [50]:
S 1 := 1
S 1 g g T
1 ( 1 + t )
(3.33)
(1 ) 2 1 1 S g gT := , 1 2 (1 2 ) 2 + (1 2 )(1 )t
where g = S 1 ( x m) and t is the result from activation function for current input x. The equation can also be stated as:
S 1 := 1 ( S 1 g g T 1 ), 2 + t
(3.34)
where 1 and 2 are
1 =
1 1 2 , 2 = . (1 ) 1 2
(3.35)
3.3.5 Testing DSAM on Two Benchmark Problems
We get the result as the following figure for the circle-in-square problems. The figure represents a single case with a better than average. We can observe that the ellipsoid for a cluster can have an angle. This representation of cluster is more efficient than those in the SFAM and ESAM. Less number of output nodes is achieved in the ESAM.
41
Figure 3-8: DSAM for circle-in-square problem The right part of the figure shows output nodes surrounded with 80% confidence ellipses. The training and test sets had 100 random samples. The network has generated 9 output nodes in two epochs, yielding a 91% hit rate for the test set. The left part of the figure represents verification with 10,000 samples, which yielded 90.2% of successful classifications [50]. The results for 100 and 1000 samples are listed in the following table.
Table 3-1: Performance of DSAM for circle-in-square problem. [50] DSAM is also able to solve two-spiral benchmark problem. It can solve the problem with 100% successful rate after on epoch of training by a high vigilance. The result is displayed in the following figure. The pattern response is displayed in the left part and the output nodes are displayed in the right part in the figure.
42
Figure 3-9: DSAM for two-spiral problem

3.3.6 Comparison of DFAM with Gaussian ARTMAP Networks
J. R. Williamson [6] introduced a kind of architecture named GAUSSIAN ARTMAP (GA) which is a synthesis of a Gaussian classifier and ART neural network, achieved by defining the ART choice function as the discrimination function of a Gaussion classifier with separable distributions, and the ART match function as the same, but with the distributions normalized to a unit height [11]. There are similarities between Mahalanobis classifier and Gaussion classifier in the sense of that they both combine statistic method with ART theory and by the assumption of the normal distribution of samples in sample space. Gaussion Random Variable [12] has the probability density function (pdf):
f x ( x) =
1 2
( x m )2
2 2
, < x <
(3.36)
Where m is the mean and >0 is the standard variation. The following figures are examples:
43
Figure 3-10: Examples of Guassian distribution The Guassian distribution was chosen because it monotonically increases towards the mean value which represents the center of a cluster. More importantly, its useful in generalization properties in high dimensional spaces [11]. Based on Bayes rule, a posteriori probability of category j given input
I = ( I1 , I 2 , I 3 ,..., I M ) in M-dimensional sample space is p( j I ) = p( I j ) P( j ) . p( I )

(3.37)
The distribution of each cluster represented by an output node is Gaussian distribution:

p( I j ) = 1 (2 ) M / 2 i =1 ji
M
exp(
1 M ji I i 2 ) ), ( 2 i =1 ji
(3.38)
where ji is the mean and ji is the standard variance in ith dimension.
44 For an input I, the activation function AF ( j ) of GA is the logarithm of p( j I ) . p ( I ) is ignored because it is the same for all clusters. P( j ) is n j N where n j is the number of samples
in jth cluster and N is the number of samples in overall sample space.

AF ( j ) = log((2 ) M 2 p( I j ) P( j )) . M 1 M I = ( ji i ) 2 log( i =1 ji ) + log( P( j )) 2 i =1 ji The match function is
MF ( j ) = log((2 ) M / 2 i =1 ji p( I / j ))
M
(3.39)
= If we introduce matrix
1 M ji I i 2 ) ( 2 i =1 ji
(3.40)
2 0 K j1 0 2 K j2 ' Sj = M M O 0 0 K
0 0 . M 2 jM
(3.41)
Then the match function is

1 MF ( j ) = ( I ji )T S 'j1 ( I ji ) . 2
(3.42)
It can be concluded that the GA can be viewed as a special case of DSAM---by the assumption that the data of different dimensions are uncorrelated---that is
COV ( X i , X j ) = 0, i j . The effect of this is that the sample clusters are geometrically covered
with a series of ellipsoids with principal axes parallel to the coordinate axes of the feature space,
45 while DSAM can adjust the angles of ellipsoids for better coverage of clusters [50].
Figure 3-11: Output node and confidence ellipsoid of GA and DSAM The ellipsoids, which have parallel coordinate axes in the left part of the figure, are generated by a GA network. The ellipsoids generated by the DSAM are displayed in the right part of the figure. We can observe that the axes of these ellipsoids can have angles with each others. Both types of the networks are trained by 100 randomly generated samples for circle-insquare problem. The GA networks achieved a successful rate of 90% and the DSAM achieved a successful rate of 94%[50] for the classification. The DSAM has advantages over SFAM and ESAM because covariance matrix describes the ellipsoid in an efficient manner. GA should have close performance with DSAM because they basically use a similar idea. SFAM and ESAM generally create more output nodes [49], although the implementations are simple.
3.3.7 Merging of Output Nodes
The number of output nodes is an important fact to evaluate the classifiers performance. The classifier need find the output node with largest activation. This operation need to calculate
46 the activation function for each output node and sorting the results. The system costs will be reduced if less output nodes are evolved during the computation. There merging method is designed to reduce the output nodes. The basic idea of merging is that two output nodes can merge into one node if the change does not bring much negative impact on the hit rate. An output node represents a cluster in sample space. Using large number of output nodes can represent a complicate distribution in sample space, as we demonstrated in the two spirals benchmark problem. On the other hand, its not economy to use many output nodes to represent just one section of sample space that belongs to one category. It is possible to represent a large section of sample space by a few output nodes in a distance base ART network. The distance used in ESAM is normalized by the radius parameter
R j , which describes the size of cluster. A sphere or hyper-sphere in sample space can be
represented with the center point and a corresponding radius. The merging method is very interesting because it provides a method to decrease the number of the output node. The distance based ART networks steadily add into more output nodes in a training process. New nodes are created if some input samples could not be represented very well with the existing nodes. The total number of output nodes will increase infinitely after a long period of training. The classifier needs methods like merging to reorganize periodically. A merging strategy is experimented for ESAM in the presented work. The idea is to break the peace state of neighbor nodes those belong to same category. A node will invade its neighbor nodes if the neighbor nodes have smaller magnitudes, which could be measured by the number of its represented training samples, or the size of cluster. A large cluster will become larger and larger by continual training. Some winner nodes will be produced which occupy a
47 large section of sample space. Some other nodes territories are inside the large section. The owner of these parts will be set to a winner node. The previous owners, the nodes of smaller magnitude, are therefore deleted. The merging method is tested by circle-in-square benchmark problem. The number of input samples for each node is recorded and used in activation function to produce a winner node. The node represents more input samples are more inclinable to get activation. The result of the merging experiment is demonstrated in the following figure.
Figure 3-12 Merging of output node for ESAM The part (a) of the figure shows that the circle part is represented by only one node after eight epochs of training. Forty-seven nodes were created after the first epoch training. The nodes began to merge after the third epoch. Some nodes were deleted thereafter while some new nodes were created again. The total number was decreasing steadily. Totally twenty-nine nodes were created after eight epochs of training. The part (b) of the figure reveals a problem for the merging method. The performance dropped significantly after twenty-eight epochs training. The hit rate dropped about ten percent.
48 The system becomes less stable after introducing the merging method. There is a trade-off between the adaptability and stability. The system has better adaptability after employing merging method. The initial size of a node becomes less sensitive because the merging method will make a node keep growing even it is created with a small initial size. However, the merging will cause nodes growing without limitation, which will lead to a deterioration of the learning system. Considering these facts, the merging method is still a questionable method although it has interesting results. Further improvements are expected in order to make it more reliable and practicable. Ms. Hongyu Xu [51] proposed a merging technology for Mahalanobis distance-based simplified ARTMAP network in her Masters thesis.

Chapter 3

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Chapter 3

Загружено:

Авторское право:

Доступные форматы

18

We define discriminant funtion as

3.1.2 Neural Network Approach

28 The algorithm of SFAM is illustrated as the following chart:

Figure 3-2: Program flowchart for SFAM.

Figure 3-3: The two-spiral benchmark problem.

The parameter j in network is represented by W j . The activation function becomes AF ( j ) = I Wj Rj

3.3.2 Testing ESAM on Two Benchmark Problems

The covariance matrix S of random variables X 1 , X 2 ,..., X M is

3.3.4 Recurrent Computation for Covariance Matrix

The parameter is the learning rate.

40 If we suppose large clusters, i.e. n >> 1 , then

where 1 and 2 are

3.3.5 Testing DSAM on Two Benchmark Problems

Figure 3-9: DSAM for two-spiral problem

I = ( I1 , I 2 , I 3 ,..., I M ) in M-dimensional sample space is p( j I ) = p( I j ) P( j ) . p( I )

The distribution of each cluster represented by an output node is Gaussian distribution:

where ji is the mean and ji is the standard variance in ith dimension.

in jth cluster and N is the number of samples in overall sample space.

Then the match function is

Вам также может понравиться