Вы находитесь на странице: 1из 4

SPEAKER DETECTION AND TRACKING FOR TELEPHONE TRANSACTIONS*

Jack McLaughlin
Applied Physics Laboratory, Univ. of Washington Seattle, WA 98105
jackm@apl.washington.edu

Douglas A. Reynolds
MIT Lincoln Laboratory Lexington, MA 02420-9185
dar@sst.ll.mit.edu

ABSTRACT
As ever greater numbers of telephone transactions are being conducted solely between a caller and an automated answering system, the need increases for software which can automatically identify and authenticate these callers without the need for an onerous speaker enrollment process. In this paper we introduce and investigate a novel speaker detection and tracking (SDT) technique, which dynamically merges the traditional enrollment and recognition phases of the static speaker recognition task. In this speaker recognition application, no prior speaker models exist and the goal is to detect and model new speakers as they call into the system while also recognizing utterances from previously modeled callers. New speakers are added to the enrolled set of speakers and speech from speakers in the currently enrolled set is used to update models. We describe a system based on a GMM speaker identication (SID) system and develop a new measure to evaluate the performance of the system on the SDT task. Results for both static, open-set detection and the SDT task are presented using a portion of the Switchboard corpus of telephone speech communications. Static open-set detection produces an equal error rate of about 5%. As expected, performance for SDT is quite varied, depending greatly on the speaker set and ordering of the test sequence. These initial results, however, are quite promising and point to potential areas in which to improve the system performance.

1. INTRODUCTION
The general task of speaker recognition traditionally consists of two phases: enrollment and recognition. During enrollment, training speech collected from a speaker is used to train his/her model. The static collection of speaker models are then used during the recognition phase to either identify (closed-set) or verify (open-set) the speaker in an input speech utterance. In this paper we introduce and investigate a detection and tracking task in which utterances presented to our system must be either enrolled or recognized in a new and dynamic fashion.

*This work was sponsored by the Department of Defense under Air Force contract F19628-00-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the United States Government J. McLaughlin was a member of MIT LL staff when this work was performed

In the this task, no prior speaker models exist and the goal is to detect and model new speakers from a stream of single-speaker input utterances while also recognizing utterances from previously modeled speakers. The system must operate in an open-set identication mode where it is rst determined whether a new utterance matches a previously modeled speaker or is from a new speaker. If the utterance matches a speaker in the current enrolled group, then the identied speakers model is updated using the utterance. If the utterance is from a new speaker, then a new model is created and the enrolled group is updated. This task is more challenging than the static speaker recognition task in several respects. First, decisions at any point in time will have a direct effect on future decisions and performance since speaker models are evolving and the enrolled group of alternative models is changing. Second, performance is difcult to characterize since we now have several different types of errors to catalog and performance can be highly dependent on the ordering of test sequences. Detection and tracking has been used heavily in the transcription of broadcast news corpora where the goal is to automatically produce a clean text transcript of the audio portion of a broadcast [1]. The accuracy of word recognizers can be increased substantially if the acoustic characteristics of the incoming audio are detected and tracked so as to steer like audio segments to recognizers specially tuned for those acoustic conditions. Similar segments may have a common speaker as well as background noise [2][3], or may be alike in their speaker attribute alone [4] or possibly speaker and language attributes [5]. Another application of this detection and tracking technology is automatic message routing based on caller identication. For message routing, an incoming call could be identied as from a previous or new customer. Previous customers would be handled in a personalized manner, whereas new customers information would be entered into the system and new speaker models automatically generated. The importance of updating customer models over time has been illustrated [6], and thus adapting such models as new data becomes available is an important part of any SDT system. As a development corpus for this task, we have chosen to use a portion of the NIST 1999 Speaker Identication Evaluation corpus. This corpus is derived from the larger Switchboard-II phase 3 conversational telephone corpus collected by the Linguistic Data Consortium (LDC)[7].

0-7803-7402-9/02/$17.00 2002 IEEE

I - 129

The remainder of this paper is organized as follows. In Section 2 we describe in more detail the operation of our detection and tracking system. The system is built upon a Gaussian mixture model (GMM) based speaker verication system that is also described. The following section discusses the evaluation of our system, rst describing the characteristics of the evaluation corpus and then detailing a novel evaluation metric for measuring the performance of the SDT system. Section 4 provides performance results for a static detection task, and shows the result of our SDT system using the new metric. Section 5 wraps up with some conclusions and discussion of future directions.

2. DETECTION AND TRACKING SYSTEM


In designing our SDT system, our goal is to process a series of single-speaker telephone transactions such that, at any point in time, we can associate each past message with a label indicating the speaker that originated it. As a new message enters the system, it is treated very much as a test message in open-set SID would be. An initial decision is made concerning whether the message came from a speaker within the enrolled set. If so, then we must decide which speaker the message came from as in open-set SID. In SDT, we take the additional step of updating the model of the identied speaker by recalculating that model using all the messages previously associated with that speaker and the new message. In the event that the new message is determined to have come from an unenrolled speaker, we can not merely reject the message as we would in the openset task. Because we need to be able to identify that speaker in the future should he call again, it is necessary to create a model for that speaker using the new message. This speaker then becomes a member of the enrolled set. To determine if a new message comes from a speaker in the enrolled set, the message is scored against each existing model. These scores are then normalized using the score for that message against a universal background model [8] before comparison with a preset threshold :

discarding cepstra outside the telephone band. Vectors composed of these cepstra and delta cepstra are used to build a Gaussian Mixture Model (GMM) for each enrolled speaker. To speed processing when running the SDT system, feature vectors were decimated by a factor of 10. This procedure is documented in detail in [9], and has surprisingly little effect on accuracy. The 2048-mixture background model is trained using Switchboard data from the 1997 NIST evaluation, and the speaker models, also having 2048 mixtures, are adapted from this universal model as discussed in [8]. Several different values of threshold were experimented with, all spaced around the point that yielded equal false alarm and probability of miss rates in our open-set tests with the evaluation corpus (see below).

3. SYSTEM EVALUATION 1999 Evaluation Corpus


System development was done using the NIST 1999 Speaker Identication Evaluation corpus. Culled from the Switchboard-II phase 3 telephone speech collection, this corpus contains both single-speaker and multi-speaker utterances; we used only the single-speaker, male utterances. In addition, though these utterances were recorded from both electret and carbon button handsets, we chose to use only the electret utterances in order to eliminate the known problem of mismatch in training and test data. From what remained, we worked with the 403 utterances from the 60 speakers with the greatest number of utterances. These ranged in duration from nearly zero length to a minute with an average duration of 30 seconds.

Evaluation Metric
In order to assess performance as changes are made to the system, it is essential to have some metric which considers the gravity of each type of error. We propose a scheme which assesses penalty points for each of three different errors: A) Missed model This occurs when a new speaker arrives, and we fail to detect him. Thus, we miss creating a model for this new speaker. B) Duplicate model Occurs when we create a model for a speaker who is already enrolled and has a primary model. C) Misidentication Occurs when message is correctly assessed as belonging to an enrolled speaker, but the system fails to assign it to the correct speakers primary or duplicate model(s). Another possible result of classication is that an utterance is associated with a duplicate model. Since the system, in this case, is associating speech from a particular speaker with a model for that speaker, we count this as being correct and assess no penalty points. For each message entering our system then, we must either make one of the above three errors or be correct. Points are assigned on a message by message basis as shown in Table 1. The missed model and duplicate model errors receive the largest number of penalty points because both of these errors result in the

enrolled speaker M i( x) B( x) < unenrolled speaker


Here, Mi is the score for model i for the set of feature vectors x extracted from the message, and B is the score for the universal background model. Note that if any one of the normalized scores exceeds the threshold, then this is sufcient to accept the message as coming from a speaker who is already enrolled. In this regard, the existing models can be viewed as a set of independently operating detectors, any one of which can alarm to indicate that the current speech le belongs to an enrollee. If an enrollee is detected, he is identied by choosing the model with the highest score. At the heart of our detection and tracking system is an open-set SID system. Our speech data, originally sampled at 48kHz is downsampled to 8kHz sampling rate. The sampled speech is broken up into frames using a 20 msec window which slides by 10 msec, and from each frame a 20-element mel frequency cepstral coefcient vector is extracted after

I - 130

system having an incorrect perception of the number of speakers present in the run, and in our application we would like, at the very least, to maintain an accurate count of the number of callers the system has seen. Of the two, the missed model error seems more serious because it is more difcult to recover from. The duplicate model error could be largely recovered from by recognition on the part of a human operator or the software that a particular speaker has more than one model representing him..:

Table 1: Penalty Point Assignments


Missed model 3 Duplicate model 2 Mis-ID 1 Correct 0

4. RESULTS
As a means of assessing the performance of our SDT systems underlying open-set SID system on the 60-speaker subset of our corpus, models were trained up for each of the speakers using approximately one minute of training data per model taken from the training data portion of the Eval 99 dataset. Testing was performed using close to 1500 utterances of varying durations but averaging 30 seconds long. The accuracy of an open-set SID system is dependent upon the number of speakers enrolled, but we would like to have some measure of the goodness of our system which is independent of this factor. A detection curve is one such measure. Figure 1 shows the detection curve resulting from the training and testing described above. The false alarm probability describes the percentage of test messages whose normalized score exceeds the threshold when scored against a model for a speaker who did not actually produce the message This is plotted against the miss probability which is the percentage of messages that fell below the threshold
40

when they were scored against the model for the speaker that actually did produce the message. Sweeping over a range of thresholds produces the detection curve. In essence, this is a plot of the aggregate open-set result for an enrolled group of size one with each speaker in the corpus in turn serving as the enrolled speaker. Also given in Figure 1 are the detection curves for all the single-speaker, male utterances regardless of handset type and all males using electret handsets. We see that, as expected, results are considerably worse when two handset types are involved and no compensation is performed. It was to avoid the complications of such compensation that we excluded the carbon button handset utterances from our development data. Note also that the 60-speaker subset we have chosen performs slightly better than the full 231 male talkers, but is more or less reective of the entire set. To evaluate our SDT system using our telephone speech corpus, the 403 utterances of our 60-speaker subset were used to create 52 sets of messages (or 52 scenarios in our terminology) by making 52 random draws of 300 messages. Because the ordering of messages has a profound effect on performance with an SDT system, within each scenario utterances were randomly sorted to assure a variety of evolutions. Each 300-message scenario contained an average of 50 speakers with a minimum of 41 and a maximum of 60. Each set was then processed, in the random order, through our system, and penalty points were totaled with the introduction of each new utterance. This penalty point score was then normalized by dividing by the total number of utterances seen by the system up to that point. Figure 2 shows the score for the best, worst and average of these scenarios for each of the preset thresholds used after processing of the rst 100 utterances of each scenario. Analysis of the result points out two important aspects of the problem of detection and tracking on a sequential set
1.8

1.6

Average Best Worst

20

1.4

Miss probability (in %)

10

Score
All Electret 60 speaker subset

1.2

5 1 2 1 0.5 0.8

0.6 0.2 0.1 0.1

0.25

0.2

0.15

0.1

0.05

0.05

0.1

0.15

0.2

Threshold
0.2 0.5 1 2 5 10 20 40

False Alarm probability (in %)

Figure 1: Detection curves for all males (using both carbon and electret handsets), all males using electret

Figure 2: Normalized score evaluated after 100 utterances for several different thresholds, for the best of the 52 scenarios, the worst and the average. A threshold setting of 0 yields the minimum score, on average.

I - 131

of utterances. First is the wide variability in performance between the best and worst scenarios. This may be due not only to the differences in the number of speakers in the scenarios, but also to the length and sound quality of the utterances. Short, noisy utterances will tend to hurt system performance particularly if such messages occur early in the scenario as this will lead to poor speaker models. On the other hand, long, clean messages, if correctly classied, will result in ever improving speaker models as time goes on for our dynamic system. The second thing we observe is that a threshold seems to exist which yields the lowest score. This threshold (somewhere around 0) strongly discourages false alarms. This is important because, as noted earlier, each model is functioning independently as a detector, and if any one of them false alarms, then we will commit a missed model error. By biasing against this most expensive of errors, we reduce our score at the expense of committing less costly duplicate model errors. At threshold values much beyond 0 however, the large number of duplicate model errors being committed more than offsets the decrease in missed model errors, and the score rises.

[2]

S. S. Chen and P.S. Gopalakrishnan, Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion, Proc. of the DARPA Broadcast News Transcription and Understanding Workshop, 1998. T. Hain, S.E. Johnson, A. Tuerk, P.C. Woodland and S.J. Young, Segment Generation and Clustering in the HTK Broadcast News Transcription System, Proc. of the DARPA Broadcast News Transcription and Understanding Workshop, 1998. M. Nishida and Y. Ariki, Real Time Speaker Indexing Based on Subspace Method - Application to TV News Articles and Debate, Int. Conf. on Spoken Language Processing, vol. 4, pp. 1347, 1998. D. A. Reynolds, et. al., "Blind Clustering of Speech utterances Based on Speaker and Language Characteristics," Int. Conf. on Spoken Language Processing, 1998. W. Mistretta and K.R. Farrell, Model Adaptation Methods for Speaker Verication, Int. Conf. on Acoustics, Speech and Signal Processing, vol. 1, pp. 113116, 1998. Linguistic Data Consortium, http://www.ldc.upenn.edu. Reynolds, D. Comparison of Background Normalization Methods for Text-Independent Speaker Verication, Eurospeech 97, pp. 963-967, 1997. J. McLaughlin, D.A. Reynolds and T. Gleason, A Study of Computation Speed-Ups of the GMM-UBM Speaker Recognition System, Eurospeech 99, pp. 1215-1218, 1999. J. Thompson and J.S. Mason, The Pre-detection of Error-prone Class Members at the Enrollment Stage of Speaker Recognition Systems, Proc. ESCA-94 Workshop on Automatic Speaker Recognition, Identication and Verication, pp 127-130, 1994. S. Ong and M.P. Moody, Condence Analysis for TextIndependent Speaker Identication Using Statistical Feature Averaging, Applied Signal Processing, vol. 1, pp. 166-175, 1994.

[3]

[4]

[5]

[6]

5. CONCLUSIONS, FUTURE DIRECTIONS


Using our GMM SID system, we have demonstrated reasonable performance on an open-set speaker recognition task using a portion of the Switchboard telephone speech corpus. Building upon this system, we have been able to construct a speaker detection and tracking system which is capable of creating models on the y as data comes in. To evaluate performance of this new, dynamic system, we have proposed a metric which takes into account considerations not relevant in static, open-set evaluations. Our results indicate a number of directions for future work. A critical problem with a sequential classier such as ours is the tendency for errors to compound as time goes on. Figure 3 illustrates this effect. The top half of the gure shows score over time while the bottom half shows error (all misclassications being counted equally). Both of these measures increase dramatically as additional utterances are processed. This highlights the importance of high accuracy when adapting models with new data. We have observed that as models are adapted, the optimal detection threshold changes. Thus, a dynamic threshold and/or a model score normalization to stabilize the threshold should offer improvements over our current, xed threshold. At present, each new message is immediately incorporated into a model upon identication of the speaker (or used to create a new model in the event a new speaker is detected). Such a decision need not be made immediately, but can be deferred if a message is deemed to be unreliable. Alternatively, a message can be made to contribute to an existing model only in proportion to its goodness for SID. Methods for such utterance assessment have been proposed [10][11] and could be incorporated into our system.

[7] [8]

[9]

[10]

[11]

1.4 1.2 1

Score

0.8 0.6 0.4 0.2 0 0 50 100 150 200 250 300

80

60

Error (%)

40

20

REFERENCES
[1] C.L. Wayne, Topic Detection & Tracking (TDT), Overview & Perspective, Proc. of the DARPA Broadcast News Transcription and Understanding Workshop, 1998.

50

100

150

200

250

300

Number of messages

Figure 3: Normalized score (top) and percent error averaged over 52 scenarios and evaluated every 20 utterances.

I - 132

Вам также может понравиться