Вы находитесь на странице: 1из 6

SPEECH SPECTROGRAM BASED MODEL ADAPTATION FOR SPEAKER IDENTIFICATION

Sabri Gurbuz, John N. Gowdyl, and Zekeriyu Tufekci

Department of Electrical and Computer Engineering Clemson University Clemson, SC 29634, USA sabrig@eng.clemson.edu,jgowdy @eng.clemson.eduztufekc @eng.clemson.edu
ABSTRACT
Speech signal feature extraction is a challenging research area with great significance to the speaker identification and speech recognition communities. We propose a novel speech spectrogram based spectral modal adaptation algorithm. This system is based on dynamic thresholding of speech spectrograms for text-dependent speaker identification. For a given utterance from a target speaker we aim to find the target speaker among a number of speakers who exist in the system. Conceptually, this algorithm attempts to increase the spectral similarity for the target speaker while increasing the spectral dissimilarity for the non-target speaker who is a member of the enrollment set. Therefore, it removes aging and intersession-dependent spectral variation in the utterance while preserving the speaker inherent spectral features. The Hidden Markov Model (HMM) parameters representing each listed speaker in the system are adapted for each identification event. The results obtained using speech signals from both the Noisex database and from recordings in the laboratory environment seem promising and demonstrate the robustness of the algorithm for aging and sessiondependent utterances. Additionally, we have evaluated the adapted and the non-adapted models with data recorded two months after the initial enrollment. The adaptation seems to improve the performance of the system for the aged data from 84% to 91%. Key Words Speaker identification, spectral adaptation, speech feature extraction, HMM model. One important aspect of the spectral adaptation based speaker identification system is robustness to aging and intersession variability. Here intersession variability refers to the situation where a persons voice can experience slight changes when using a identification system from one day to the next, and aging refers to natural variations over the time. A user can obtain the best performance from a speaker identification system when performing a identification immediately after the training. However, over time the user may experience difficulty when using the system because of aging. This can range from several months to years, so the effects of aging may degrade system performance significantly. However, the spectral variation of a speaker may be small when measured over a short period of time. As time passes this variance increases. Because of the effects of intersession variability and aging, models that are trained with data from a single utterance have a limited chance of success[l][2]. Therefore, the models need to be trained with the data from several training sessions for better performance. One approach to accommodate intersession variability is to have several initial training sessions for each user in the system. However, the spectral distributions of the model represent fuzzy speaker characteristics, which may reduce the identification capability of the speaker models as well as place a burden upon the user by increased number of training sessions. Another approach, which is more convenient to the user, is to adapt the spectral features of training utterances of users to the spectral features of test utterance spoken by target user. This allows the spectral similarity to increase between test and training utterances spoken by same speaker while spectral dissimilarity increases between test and training utterances of non-target speakers. Our work considers the latter approach, and we use an existing HMM algorithm for the speaker identification problem. While the focus of this work was spectral adaptation of training utterances to test utterance, using HMM adaptation, the technique can also be used to reduce the effects of recording environment distortions such as ambient noise and microphone distortions.

1. INTRODUCTION
Speaker identification consists of determining whether or not a voice sample provides a sufficient match to a speaker known by the system. The human speech recognition (HSR) system works with partial recognition of key features across frequency and time, in the form of speech signal features.
1. Correspondence should be done with John N.Gowdy.

0-7803-63124/00/$10.00@2000 IEEE

110

Various adaptive training methods have been proposed for speech and speaker identification systems. For example, speaker verification systems use a neural tree network discriminant model and Gaussian mixture model (GMM) based statistical model to represent each user[2]. Both HMM and GMM based methods were proposed with data from a target speaker to create a model for speaker identification[3][4][ 11. These two modeling approaches are based on discriminant and statistical measures, respectively. Far less attention has been devoted to spectral feature based model adaptation. We,propose a spectral feature based adaptive model for a text-dependent speaker identification system. The speaker identification system uses ten MFCCs, ten delta coefficients, and one delta energy coefficient. Given a test utterance spoken by a target speaker, training utterances of all users are spectrally adapted with the test utterance. The resulting performance after spectral adaptation is superior to that obtained by training the model with the original training utterances. The spectral model adaptation process can conveniently be performed just before identification of the user. Before describing the details of the identification process it is useful to give the steps in the process, These steps are: adapt the training utterances of each speaker to the target utterance (adaptation process), find the observation vectors of the adapted training utterances for each speaker, and for the target utterance (data preparation), calculate HMM parameters for each speaker (training algorithm) and find the model that produces the highest score for the given observation vector belonging to the target utterance (identification process).

de-emphasizing differences between test utterances and training utterances spoken by a non-target speaker. Also after this we will call the Spectrogram of the test utterance spoken by a target speaker the test spectrogram and spectrogram of a training utterance the model spectrogram. Upon receiving a test utterance from a target speaker we form the test spectrogram and correlate it with a model spectrogram in order to find the maximum correlation index value and the correlation value. This leads to alignment of starting indices of the test and training utterances. Fig. 1 shows the correlation based alignment of the utterances. After this the recursive dynamic adaptation process occurs at this correlation index on the model spectrogram. Here we adapt a threshold value at every iteration and threshold the model spectrogram with the adapted threshold level until the relative correlation coefficient drops below one. Updating the threshold level at each iteration can be represented mathematically as,

T = m,

+ a x (Ads - ms),

where T is the threshold level, m, and M , are minimum and maximum energy value in the spectrogram, respectively, and a: is the relative threshold level between zero and one. We repeat this process for each training utterance belonging to all users listed in the system. After each iteration the a: value is updated. The new a value is

a:(n)= a(n - 1) a(n - 1),

(2)

where e is a small positive number. From the resulting spectral features of the model spectrogram speech vectors (observations) are calculated. Then for each speaker listed in the system, each spoken utterance is represented by a sequence of observations, 0, defined as
0=01,02,

... ,oT

(3)

An overview of this paper is as follows. The proposed algorithm is introduced in section 2. In section 3, for the completeness of the idea we give a brief overview of the data preparation, HMM training and identification process. In section 4, we present some experimental results and compare the identification rate with the non-adaptive algorithm. Finally, a summary of the results and contributions of this paper are provided in section 5.
2. SPEECH SPECTROGRAM BASED MODEL ADAPTATION
The speech spectrogram based model adaptation emphasizes the similarity between test utterances and training utterances spoken by a target speaker, while at the same time

where ot is the feature vector observed at time, t, from the speech frame. Fig. 2 shows the adaptation of model parameters. The speaker identification problem can then be regarded as that of computing

(4)
where wi is the ith speaker enrolled in the system. We can rewrite this probability using Bayes Rule

Thus, for a given set of prior probabilities, P(wi),the most probable spoken word depends only on the likelihood P ( 0 1 wi). Knowing the high dimensionality of the observation sequence, 0 ,the direct estimation of the joint conditional probability, P(ol,0 2 , ... I wi),from the spoken utterance is not practical. However, using a parametric model,

111

Spectrogram Spectrogram

(TEST)
I

(MODEL)
I

Test and Model

Find maximum correlation Update the Threshold Value

Figure 1: Correlation based alignment of test and model

Threshold Model Spectrogram

3. DEFINING HMM MODELS AND SYSTEM SETUP


In HMM based speaker recognizers a HMM represents a speaker uniquely so that for a given observations, 0, we can find the underlying model that generates the maximum out4* Fig. shows a HMM put probability defined by representation of a speaker. The following three subsections outline the basic steps for building a speaker identification system.

0=0102 .-SOT
Figure 2: Dynamic adaptation of training utterances to test utterance.

3.1. Data Preparation


Spectral properties of the speech signal vary by time since the vocal tract shape is varied as a function of time to produce the desired speech sounds. The human ear resolves frequencies non-linearly across the audio spectrum[5]. Therefore, designing a front-end to operate in a similar non-linear manner improves identification performance. Also the perfonnance of an identification system can be greatly enhanced by adding time derivatives to the basic static parameters[6][7]. MFCCs, delta coefficients and delta energy based parameterization method utilizing Hamming windows were sea 2 2 a 3 3
a 4 4

as

a 6 6

a77

as

a-

Figure 3: A text-dependent HMM representation of a speaker.

112

procedure which maximizes

4(T) = mzG4z(T)aiN,
for 1 < i

<N

and 1 < j

<N

where

4jW = [m?x(#Ji(t - l)Uij]bj(Ot)


with initial conditions given by

Figure 4:Mel-Scaled Filter Banks for 512 points FFT. lected for our speaker identification system. These coefficients are referred as observations 0 in this paper. The MFCC are calculated by taking the DCT of Melscaled log filter bank energies as shown below.

If Aij represents the total number of transitions from state i to state j in performing the above maximization, then the transition probabilities can be estimated from the relative frequencies as follows
Aij

where k = 1 , 2 , .....M , M < N , and N is the number of filter banks, MFCCk represents the kth MFCC and X ( n ) represents log-energy output of the nth filter. Fig. 4 shows Mel-scaled filter banks. The filters used are triangular, and they are equally spaced along the Mel-scale which is defined by

a 23 .. -

E,"=, Aik

(9)
'

M e l ( f ) = 25951og(1

f + -). 700

(7)

The sequence of states which maximizes ~ N ( T implies ) an alignment of training data observations with states[8][6]. Within each state a further alignment of observations can be made for mixture components. We have selected the number of mixture components as one in this work. For a j t h state, the mean vectors and variances are defined by

Then the observations obtained from training utterances are used in the Viterbi training algorithm for estimation of model parameters.

3.2. Viterbi Training


For the initialization of a new HMM a uniform segmehtation is used. That is, each training observation is divided into N equal segments for the first iteration. Here, N represents the number of emitting states in HMM. In this style of model training, a set of training observations, O T ,where 1 5 T 5 R, are used to estimate the parameters of a single HMM by iteratively computing the Viterbi alignments, where R represents the number of training utterance for a user. After the first iteration each training observation, 0, is segmented by maximizing likelihood P ( 0 I Mi), where Mi represents the HMM for the ithuser, using a state alignment

where T,. is the number of observation vector in T"' training utterance, and $$ is an indicator function, which is 1 if 0 : is assigned to state j and is zero otherwise. And since we assumed that coefficients are naturally independent.

where k = 1 , 2 , . . . ,K , and K represents number of coefficients in each vector ot (in our work ot contains 10MFCCs, 10 delta coefficients, and 1 delta energy). Then the output distribution bj(ot)for state j is defined as
bj

(4= nf=(=, N ( O t k ; P t k , &>.

(12)

113

State Index

f
0 0
9
8
a 711

4. EXPERIMENTAL RESULTS
All results discussed in this paper are produced using recordings in a laboratory environment. In our identification experiment, we have employed 6 enrolled speakers for training and testing purposes. To evaluate this algorithm, we have evaluated the adapted and the non-adaptedmodels with data recorded two months after the initial enrollment. The identification database contains two data sets with collections separated by a two month period. The first set contains ten repetitions of each person speaking the word eight. First, we have used the first five recordings from the initial enrollment as the training utterance for each speaker. The rest of the five recordings of the initial enrollment of each speaker are called the current set. The second set contains ten more repetitions of each speaker after two months of initial enrollment, and this is called the aged set. Four different testing scenarios were used for the paper. In each case, all training utterances were the same and taken from the recent enrollment set. The scenarios are outlined below. Test the system for the recent set using the non-adaptive system. Test the system for the recent set using the adaptive system. Test the system for the aged set using the non-adaptive system. Test the system for the aged set using the adaptive system. Table 1 displays the performance of these experiments. Several observations can be made when inspecting the table. First, for the recent set, both adaptive and non-adaptive systems have a identification accuracy of 100%. Second, for the aged set, because of the spectral variation, non-adaptive system has a accuracy of 84% while adaptive system improved the identification rate to 91%.

b,(o 7)

6 5
4

2
c

o L / I

9 1 0
Time Index

Figure 5 : A sample structure of a Viterbi decoding.

3.3. Viterbi Decoding: Identification


At this point we assume that a trained set of models are ready to be used to recognize the utterance of a target speaker. This algorithm searches for the model which yields the maximum value of the maximum likelihood P(0 1 M ) . The identity of the speaker is assigned to the model that produced the highest score. P(0 I M ) is computed using equation 9 in section 3.2. For the re-estimation case, the direct computation of likelihoods has disadvantages because of product terms in equation 9, which may lead to precision loss[6]. Therefore, log likelihoods are used instead. Therefore, recursion of equation 9 becomes

This recursion is the Viterbi algorithm [8][5]. Fig. 5 shows an example of HMM and state transition paths. This algorithm can be visualized as finding the best path through a matrix where the vertical dimension represents the states of the HMM and the horizontal dimension represents the time frames of speech. Each large dot in the picture represents the log probability of observing that frame at that time, and each path is between dots corresponding to a log transition probability. The log probability of any path is computed simply by summing the log transition probabilities and the log output probabilities along that path. The paths are grown from left-to-right, column-by-column. At time t, each partial path $i(t - 1) is known for all states . a , hence equation 13 can be used to compute $ j ( t ) thereby extending the partial paths by one time frame.

5. CONCLUSION
This paper examines spectral adaptation methods of HMM training for a text-dependent speaker identification system. Table 1shows that all testing scenarios suffer when evaluating the aged data set. This is to be expected since the identification model is trained on the data collected from the initial enrollment. The spectral adaptation techniques are examined for HMM training. It was shown that adaptive model improved the identification performance from 84% to 91%. Hence, the overall system performance using spectral adaptation is comparable to that achieved by training the model without spectral adaptation.

114

Table 1: Identification performance for current and aged sets using non-adaptive and adaptive systems. Identification Model Type Non-adapted 100% 84%

Training scenario Recent set Aged set

# of speaker Test set size


6 6
30 60

Adapted 100% 91%

6. REFERENCES
[ 11 S. Furui, Comparison Of Speaker Recognition Meth-

Verification Against Session-Dpendent Utterance Variation, in Proceedings of ICASSI;: vol. 2, 1998.


[SI J. R. Deller, J. G. Proakis, and J. H. L. Hansen, Descrete- Time Processing of Speech Signals. 1993.

ods Using Statistical Features and Dynamic Features,

IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-29, no. 3,1981 .
[2] W. Mistretta and K. Farrel, Model Adaptation Methods for Speaker Verification, in Proceedings of ICASSP, vol. 2,1998.
[ 3 ] L. L. Jialong He, Speaker Verification Performance And length Of the Test Sentence, in Proceedings of ICASSP, vol. 1,1999.
[4] T. Matsui and K. Aikawa, Robust Model For Speaker

161 S. Young, J. Odell, D. Ollason, V. Valtchev, and P, Woodland, The HTK Book. version 2.1: Entropic
Cambridge Research Laboratory Ltd., 1997.

[7] M. S. Herbert Gish and A. Meilke, Robust, Segmental


Method For Text Independent Speaker Identification, in Proceedings o f ICASSe vol. I, 1994.

[8] R. R. Lawrence, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, in Proceedings of the IEEE, vol. 77, no. 2, 1989.

115

Вам также может понравиться