Вы находитесь на странице: 1из 4

Proceedings of SPS-DARTS 2005 (the 2005 The first annual IEEE BENELUX/DSP Valley Signal Processing Symposium)

RECOGNITION OF ISOLATED DIGITS USING A LIQUID STATE MACHINE

David Verstraeten and Benjamin Schrauwen and Jan Van Campenhout

{dvrstrae,bschrauw,jvc }@elis.ugent.be
UGent, Department of Electronics and Information Systems (ELIS)
Sint-Pietersnieuwstraat 41, 9000 Gent

readout
ABSTRACT input liquid liquid state
function
output

The Liquid State Machine (LSM) is a recently developed com-


putational model [1] with interesting properties. It can be used u(t) x(t) f
M
y(t)
for pattern classification, function approximation and other com-
plex tasks. Contrary to most common computational models,
the LSM does not require information to be stored in some sta-
ble state of the system: the inherent dynamics of the system are Figure 1: The liquid state machine.
used by a memoryless readout function to compute the output.
We apply this framework to the practical task of isolated word
speech recognition. We investigate two different speech front from the environment can also remain detectable for a cer-
ends and different ways of coding the inputs into spike trains. tain time (for instance the ripples in a pond caused by a
The robustness against noise added to the speech is also briefly pebble thrown in). This means that the liquid state at a
researched. It turns out that a biologically realistic configuration certain time gives a spatial snapshot of the temporal infor-
of the LSM gives the best result, and that its performance rivals mation contained within inputs from the past. The liquid
that of a state-of-the-art speech recognition system. state is used by a memoryless readout function to com-
pute the actual output of the LSM. For this publication we
used a simple linear discriminant as classifier, but any pat-
1. INTRODUCTION
tern analysis or classification algorithm is possible. The
Neural networks are known to offer robust solutions for advantage of this approach is the effort needed to train
many classification problems. However, classical neural the LSM: the network itself is not changed, only the read-
network models suffer from two fundamental drawbacks: out function is trained, which can generally be done much
problems that are inherently temporal in nature (such as faster.
speech recognition, object tracking, robot control ...) need The LSM combines the advantages of a conventional neu-
to be reduced to a series of static inputs which can only ral network (i.e. robustness and the ability to generalize
be evaluated separately and sequentially by the network, from examples) with two significant improvements: an in-
which results in a loss of important time-dependent infor- herent ability to incorporate temporal information into its
mation. Also, neural networks can take a long time to computations, and a drastically reduced effort needed to
learn a certain task when compared to more conventional train the classifier (compared to training the liquid which
statistical methods. is a recurrent network). A more complete description of
The Liquid State Machine (LSM) does not have these dis- this computational framework can be found in [1].
advantages. The LSM is a recent computational model We tested this LSM for the rather practical task of rec-
whose structure is depicted in Figure 1. The liquid con- ognizing isolated spoken digits. For our experiments, we
sists of a recurrent network of non-linear interacting com- used two different speech front ends: Mel Frequency Cep-
putational nodes with an internal state. The set of all the stral Coefficients (MFCC), a technique often used in tra-
internal states of these nodes form the liquid state. In this ditional speech processing systems, and the Lyon Passive
case the liquid is built from spiking neurons [2], but other Ear model, a biologically realistic model of the human in-
possibilities exist: see [3], [4] or [5]. ner ear. We also investigated three different ways of cod-
The recurrent connections between the nodes in the liq- ing the output of these speech processing steps into spike
uid cause inputs to the network to remain detectable for a trains: the well-known Poisson spike coding, the BSA fil-
long time after they have been presented. This property is ter coding scheme , and a Leaky Integrate & Fire (LIF)
called temporal integration, and indicates the origin of the neuron [2]. Finally we briefly investigate the robustness
term liquid in this context: in a real liquid, disturbances of our word recognizer against noise added to the speech

135
Proceedings of SPS-DARTS 2005 (the 2005 The first annual IEEE BENELUX/DSP Valley Signal Processing Symposium)

inputs. words: WER = 100 · N Nnc


tot
with Nnc the number of in-
correctly classified samples, and Ntot the total number of
samples presented.
2. EXPERIMENTAL SETUP

The experiments for this publication were done using 3. SPEECH CODING
CSIM, a neural simulator written in Matlab. The param-
It is common practice for speech recognition applications
eters of the neurons were inspired by biological data and
to preprocess the sound in order to enhance the speech-
are taken from [6]1 .
specific features to facilitate the recognition process. Many
The connections between the nodes of the liquid are formed different preprocessing techniques exist, and in this pub-
in a stochastic manner: the probability of a connection be- lication we compare the performance of two of these al-
ing placed between nodes a and b is given by Pconn (a, b) = gorithms: the commonly used MFCC technique, and the
−D 2 (a,b)
C · e λ2 with D (a, b) the Euclidean distance (in 3D biologically realistic Lyon Passive Ear model.
space) between the two nodes. Both the average amount
of connections and the average distance between connected 3.1. Mel-Frequency Cepstral Coefficients
neurons are controlled by the parameter λ. In our case λ
was chosen (same as [6]) so that mainly local connections Mel-Frequency Cepstral Coefficients (MFCC) is a very
are formed. popular - if not the de facto standard - preprocessing tech-
Because of the probabilistic construction of the liquids nique in the field of speech recognition. The MFCC are
and their connections, there exists some variation between calculated as follows: (1) the sample data is windowed
the performance of different liquids of the same size. using a hamming window and a FFT is computed, (2) the
We therefore simulated different liquids of the same size magnitude is run through a so-called mel-scale2 filter bank
and calculated the average performance for a given size, and the log10 of these values is computed, (3) a cosine
but we were sometimes hindered by memory limitations. transform is applied to reduce the dimensionality and to
Since a liquid of a given size can still have a varying num- enhance the speech-specific features of the input. The re-
ber of internal connections, we were not always able to sult is the so-called cepstrum.
evaluate every liquid size the same number of times.
For our experiments, we used a subset of the TI46 speech 3.2. Lyon Passive Ear
corpus, consisting of ten different utterances of the digits The Lyon Passive Ear model [8] is a model of the human
‘zero’ to ‘nine’, spoken by five different speakers. The inner ear or cochlea, which describes the way the human
training set consisted of a random selection of 300 sam- cochlea transforms a sound into a series of neural spikes
ples, and the test set of the remaining 200 samples. generated by the hair cells present inside the inner ear.
We used a linear discriminant as readout function, i.e. a The model consists of a filter bank which closely resem-
simple linear projection of the liquid space into the output bles the selectivity of the human ear to certain frequencies,
space using a weight matrix w: followed by a series of half-wave rectifiers and adaptive
gain controllers both modeling the hair cell response.
C [x (t)] = Θ [w · x (t) + w0 ] . (1) This form of preprocessing is computationally more in-
tensive than the MFCC front end. It takes about three to
The weight matrix is found using pseudo matrix inversion. five times as long to compute on a conventional processor.
Previous research [7] has shown that this simple readout
function performs better in this case than more advanced
4. SPIKE CODING
classifiers such as a Fisher discriminant or a pool of paral-
lel perceptrons. The readout function computes an output The neurons in the liquid communicate with spikes. This
based on the liquid state (a filtered version of the output means that the output from the speech processing front
spikes of all neurons) every 20 ms, and the final output ends needs to be converted from analog values into a se-
of the LSM based on a given input is determined using ries of spike trains. Many different methods exist to en-
the winner take all principle after the full input sample is code analog values into spike trains (see e.g. [9]), such as
presented. time to first spike coding, population coding and correla-
The performance of the LSM is expressed as the Word tion coding.
Error Rate (WER): the fraction of incorrectly classified In this publication we compare three different methods for
words as a percentage of the total number of presented extracting the spike trains from the analog outputs of the
1 Due to page limitations those parameter setting are not reproduced 2 A mel-scale is a non-linear transformation of the frequency domain

here. to model the human selectivity to certain frequency bands.

136
Proceedings of SPS-DARTS 2005 (the 2005 The first annual IEEE BENELUX/DSP Valley Signal Processing Symposium)

speech front ends. First, we use a standard Poisson cod- network is used with so-called Long Short-Term Memory.
ing, whereby the spike trains are generated by interpreting It is trained for the same subset of TI46, and a WER of
the analog outputs of the speech processing as instanta- 2% was achieved.
neous firing rates. The spikes are generated by a Poisson We can conclude that the LSM passes the test of isolated
process based on these firing rates. Thus, the instanta- word recognition very well and rivals the performance of
neous firing frequency is directly proportional to the ana- standard HMM based techniques and outperforms other
log value that is being encoded. kinds of SNN solutions.
Secondly, we use a LIF neuron as a way to code the analog
values into spike trains: this biologically realistic model 6. NOISY INPUTS
of a real neuron takes an input current (an analog value)
as input, and produces spikes in response to this current. We also briefly investigated the noise robustness of the
This coding scheme is computationally the most intensive LSM by adding three different types of noise commonly
due to the complexity of the model. found in day to day environments: speech babble (B),
Finally, we use BSA [10] to code the speech processing white-noise (W) and car interior noise (C) from the NOI-
output into spike trains. BSA is a heuristic encoding al- SEX noise database. We trained a random liquid of 1232
gorithm that assumes the decoding of the spike trains will neurons on clean data, added noise to the test set at sound
take place using a linear filter. This is the case for our im- levels of 30, 20 and 10 dB and tested the performance with
plementation of the LSM: the liquid response (i.e. the set this corrupted test set. The data was preprocessed using
of all spike times of the neurons in the liquid) is decoded the Lyon Passive Ear model and coded into spike trains
into the liquid state (analog values) using an exponential using BSA.
filter, before being fed into the readout function. We also cite the best results from [13], which describes
the Log Auditory Model (LAM) as a speech front end
specifically designed to be noise-robust which is tested us-
5. RESULTS ing a HMM based speech recognition system. However,
the dataset consists in this case of isolated digits from the
The results of our experiments are given in Figure 2. It ap- TIDIGITS database, which is not identical but still com-
pears that the standard MFCC technique is not well suited parable to our dataset. Therefore, the results in table 1 are
to be used as a speech front end for the LSM. Performance indicative and not quantitative. Also, note that here the
is poor and for two coding schemes (LIF and Poisson) performance is expressed as a recognition score to allow
does not even increase for larger liquids. We note how- easy comparison.
ever that the performance for the BSA-coding is worse on
the training set than the others, but is better on the test set. Clean 30 dB 20 dB 10 dB
This suggests that the LSM generalizes better using BSA C LSM 95.5% 89.5% 87% 84.5%
in the case of an MFCC front end. The Lyon Passive Ear LAM 98.8% 98.6% 98.8% 98.6%
front end performs far better: the larger liquids achieve an B LSM - 91.5% 89.5% 82%
average WER of 3% using the LIF coding scheme, and the LAM - 98.4% 93.2% 72.5%
best liquid attained a WER of 0.5%. Note that BSA and W LSM - 88% 82.5% 79.5%
LIF coding schemes have comparable performance on the LAM - 98.4% 95.7% 72.7%
training and test set but that the BSA coding scheme is
computationally much faster. Table 1: The robustness of the LSM against different types
For comparison, we present two other speech recognition of noise.
systems tested with a comparable dataset. Sphinx4 is a
recent speech recognition system developed by Sun Mi- It turns out that the LSM has good noise-robustness. The
crosystems [11], using Hidden Markov Models (HMMs, HMM performs better for low noise levels, but the perfor-
a widespread technique for doing speech recognition) and mance of the LSM decays more gradually with increas-
an MFCC front end. When it is applied to the full TI46 ing noise levels. Results from [1] show that this noise-
database, a word error rate (WER) of 0.168% is achieved. robustness is also observed with other types of input than
The best LSM from our experiments achieved a WER of speech.
0.5%. While slightly worse than the state-of-the-art, we
point out that the LSM offers a number of advantages over 7. CONCLUSION
HMMs. HMMs tend to be sensitive to noisy inputs and
are usually biased towards a certain speech database. An In this paper we used a LSM based on a spiking neural
additional comparison can be made by looking at the re- network to perform the task of isolated word recognition
sults described in [12], where a recurrent spiking neural with a limited vocabulary. We investigated two speech

137
Proceedings of SPS-DARTS 2005 (the 2005 The first annual IEEE BENELUX/DSP Valley Signal Processing Symposium)

11
BSA train
10 BSA test
60 IF train
9 IF test
Poisson train
50 8 Poisson test

7
BSA train
40
BSA test

WER (%)
6
WER (%)

IF train
5
30 IF test
Poisson train 4

20 Poisson test
3

2
10
1

0 0
200 300 400 500 600 700 800 900 1000 400 500 600 700 800 900 1000 1100 1200
Liquid size Liquid size

Figure 2: Performance results for the MFCC front end (left) and the Lyon front end (right) using three different spike
coding schemes. Note the difference in scale on the WER axis between both figures.

front ends: the standard MFCC technique, and the bio- [5] C. Fernando and S. Sojakka, “Pattern recognition
logically realistic Lyon Passive Ear model. We used three in a bucket.,” in Proc. 7th European Conference on
different ways of generating the spike trains: the simple Artificial Life, 2003, pp. 588–597.
Poisson coding, the BSA filter coding scheme and the bi-
ological LIF model. We also investigated the sensitivity [6] W. Maass, T. Natschläger, and H. Markram, “A
to different types of noise commonly found in real-world model for real-time computation in generic neural
applications. microcircuits,” Proc. of NIPS 2002, vol. 15, pp. 229–
We have shown that the LSM performs well for this task. 236, 2003.
We have found that the performance is far better when the [7] David Verstraeten, “Een studie van de Liquid State
speech is preprocessed using a biologically realistic model Machine: een woordherkenner,” M.S. thesis, Ghent
of the human cochlea than using the classic MFCC speech University, ELIS department, 2004.
front end, especially when the spikes are generated using a
realistic neural model. Note that the BSA heuristic encod- [8] R.F. Lyon, “A computational model of filtering, de-
ing scheme performs almost as good as the LIF encoder, tection and compression in the cochlea,” Proc. IEEE
but is much faster. Furthermore, our results show an ex- Int. Conf. Acoustics Speech and Signal Processing,
cellent robustness of the LSM to different types of noise. May 1982.
[9] P. Dayan and L. F. Abbott, Theoretical Neuro-
8. REFERENCES science, MIT Press, Cambridge, MA, 2001.
[1] W. Maass, T. Natschläger, and H. Markram, “Real- [10] Benjamin Schrauwen and Jan Van Campenhout,
time computing without stable states: A new frame- “BSA, a fast and accurate spike train encoding
work for neural computation based on perturba- scheme,” in Proceedings of the International Joint
tions,” Neural Computation, vol. 14, no. 11, pp. Conference on Neural Networks, 2003.
2531–2560, 2002.
[11] W. Walker, P. Lamere, P. Kwok, B. Raj, R. Singh,
[2] W. Gerstner and W. M. Kistler, Spiking Neuron Mod- E. Gouvea, P. Wolf, and J. Woelfel, “Sphinx-4: A
els, Cambridge University Press, 2002. flexible open source framework for speech recogni-
tion,” Tech. Rep., Sun Microsystems Inc., 2004.
[3] H. Jaeger, “The “echo state” approach to analysing
and training recurrent neural networks,” Tech. Rep. [12] A. Graves, D. Eck, N. Beringer, and J. Schmidhu-
GMD Report 148, German National Research Cen- ber, “Biologically plausible speech recognition with
ter for Information Technology, 2001. LSTM neural nets,” in Proc. of Bio-ADIT, 2004.
[4] T. Natschläger, N. Bertschinger, and R. Legenstein, [13] Y. Deng, S. Chakrabartty, and G. Cauwenberghs,
“At the edge of chaos: Real-time computations “Analog auditory perception model for robust speech
and self-organized criticality in recurrent neural net- recognition,” in Proc. IEEE Int. Joint Conf. on Neu-
works,” in Proc. of NIPS 2004, Advances in Neural ral Network, 2004.
Information Processing Systems, 2004.

138

Вам также может понравиться