Вы находитесь на странице: 1из 9

Speech Endpoint Detection Based on Sub-band Energy and Harmonic Structure of Voice

Yanmeng Guo, Qiang Fu, and Yonghong Yan


ThinkIT Speech Lab, Institute of Acoustics, Chinese Academy of Sciences Beijing 100080 {yguo,qfu,yyan}@hccl.ioa.ac.cn

Abstract. This paper presents an algorithm of speech endpoint detection in noisy environments, especially those with non-stationary noise. The input signal is rstly decomposed into several sub-bands. In each sub-band, an energy sequence is tracked and analyzed separately to decide whether a temporal segment is stationary or not. An algorithm of voiced speech detection based on the harmonic structure of voice is brought forward, and it is applied in the non-stationary segment to check whether it contain speech or not. The endpoints of speech are nally determined according to the combination of energy detection and voice detection. Experiments in real noise environments show that the proposed approach is more reliable compared with some standard methods.

Introduction

Speech endpoint detection (EPD) is to detect the beginning and ending boundaries of speech in the input signal, which is important in many areas of speech processing. An accurate speech endpoint detection is crucial for the recognition performance in improving the recognition accuracy and reducing the computing complexity. Endpoint detection discriminates speech from noise by some features of the signal, such as energy[1][2], entropy[3][4], LSPE[5], statistic properties [6][7][8], etc.. Some methods treat speech and noise as separate classes and detect speech by models of speech and noise. These methods perform well in specic environments, but degrade rapidly when the models mismatch the environment. However, if the discrimination is based on some heuristically derived rules relating to the signal features, its performance relies on the properties directly, and it is easier to adapt to the unknown environments. For the practical speech recognition, it is critical to detect speech reliably under diverse circumstances. This paper develops a robust endpoint detection method that combines the advantages of several features by rules. Short-time energy is the most widely used parameter in endpoint detection[1][2][9][10]. But it is not sucient if the noise level is high. Fortunately, the
This work is (partly) supported by Chinese 973 program (2004CB318106), National Natural Science Foundation of China (10574140, 60535030), and Beijing Municipal Science and Technology Commission (Z0005189040391)

spectral energy distributions of speech and noise are often dierent, so speech is not corrupted equally in dierent frequency. This fact is exploited in this paper by analyzing and tracking the energy in 4 sub-bands, and taking more importance to the sub-bands with drastic energy variation. Another shortcoming of energy parameter is to misclassify high level noise as speech when the noise is time-varying. In this paper, this problem is solved by involving voice detection. If the non-stationary segment contains voiced speech, speech is detected, otherwise, it is classied as noise. Detecting voiced speech is also an important strategy to distinguish between speech and noise. Generally, voiced speech can be detected by tracking pitch[11], measuring periodicity[5][12][13], etc., but those methods are often disturbed by low-frequency noise or the abrupt changes of noise. However, voice has obvious harmonic structure in frequency domain even in very noisy case, and his paper proposes a robust algorithm that detects voice by adaptively checking such structure in frequency domain. This paper is organized as follows. The theory of the proposed algorithm is described in section 2. Section 3 evaluates and analyzes the performance of the algorithm. The conclusion is given in Section 4.

Algorithm

Assuming that the speech and additional noise are independent, the short-time energy of input signal is given by Ex = Es + En , where Es and En represent the energy of speech and noise, respectively. Thus the position of speech signal can be determined by searching the segments where Ex > En . However, noise may be non-stationary, and its energy En can hardly be estimated precisely. To solve this problem, we classify the input signal into two categories: the stationary, which is assumed to exist all the time, and the non-stationary, which may contain speech, noise, or both. The stationary component can be tracked using specied model mentioned in Sect.2.2. Then we apply voice detection method proposed in Sect.2.3 in the non-stationary segments to detect speech. The structure of the algorithm is shown in Fig.1. 2.1 Preprocessing

The 8KHz sampled noisy speech is divided into L frames with each frame 20ms long and overlapped 50%. After being applied by a window function and analyzed by short-time Fourier transform (STFT) of N (N256) points, the energy of kth frequency bin in the ith frame can be derived from the spectrum, and represented as Pi (k ), where 0 k < N/2. Setting borders at {0,500,1000,2000,4000}Hz, the signal is divided into 4 non-overlapped sub-bands. Thus the energy of subband m in frame i can be obtained by summing up the energy of its frequency components, and denoted as Ex ,m (i), where m = 0, 1, 2, 3, and i = 1, 2, ...L.

Input Signal Update Noise Model Stationary Noise


N Y

Detect Voice Contain Voice


Y N

Search Endpoints Output Endpoints

Fig. 1. Flowchart of the proposed algorithm

2.2

Energy Detection

Based on the character of energy sequence, additional noise can be classied to 5 classes here: stable noise, slow-varying noise, impulse noise, uctuant noise and step noise. All the kinds of noise are independent and additional, and their sum is the input noise. Stable noise, such as thermal noise or the noise of running machine, basically has stable energy distribution, and its energy sequence follows ergodic Gaussian distribution. Slow-varying noise denotes the noise whose energy distribution changes slowly, and the noise of wind blowing or coming car can be classied into this kind. In a short interval, it can be looked approximately as stable noise. Impulse noise involves those whose energy rise and fall rapidly, and its energy only keeps nonzero in short period. The typical examples are smack and click. Fluctuant noise has varying energy all the time, and it includes the babble noise, continual bump in car and the noise of several passing vehicles. Step noise is the noise whose energy distribution changes abruptly like steps, and it includes noise from turning on a machine as well as the noise from abrupt changes in telecommunication channel. It can be classied separately to stable, slow-change or uctuant noise before and after the step. Accordingly, the noise energy of sub-band m in frame i is expressed as En ,m (i) = Ep ,m (i)+ Eq ,m (i) in the duration of 100 300ms, which is about the length of a syllable. Ep ,m (i) is an ergodic stationary Gaussian random sequence composed of stable noise, slow varying noise and the stationary section of step noise, and Eq ,m (i) is a non-stationary sequence made from other noise. Hence, in the total energy of {Ex ,m (i) = Es ,m (i) + Ep ,m (i) + Eq ,m (i)}, {Es ,m (i)} and {Eq ,m (i)} are both non-stationary sequences that are dicult to be discriminated only by energy, and thats the reason to apply the voice detection. Stationary noise modeling An adaptive model is set up to track the stationary noise for each sub-band. For clarity, we omit the argument m hereafter in description of the model initialization and update. Let {Ep (i)} denote the energy sequence of stationary noise in sub-band m, and its probability distribution function is f (Ep ) = (1/ 2 )exp((Ep

)2 /2 2 ) in a short period, where and are mean and variance respectively. Dene the normalized variance = /, then f (Ep ) = (1/ 2)exp((Ep / 1)2 /22 ), where represents the relatively dynamic range. {Ep (i)}is the only stationary component in{Ex (i)}, so its distribution can be estimated in the segments where {Ex (i)} is stationary. However, {Ep (i)} only keeps stationary and ergodic in short period, and it dominates the signal in even less time. Therefore, its distribution is assumed to be stable in 200 300ms ( l frames), and and can be estimated by the beginning 80 120ms signal ( r frames) of it.

(j-1)th model jth model

jth model

Update model
r frames

Detect by energy

l-r frames jth Analysis Window ( j+1)th model

Detect by energy Update model Delete 1 frame Shift 1 frame Input 1 frame (j+1)th Analysis Window

Fig. 2. Strategy of update the analysis window

Accordingly, set the analysis window of l frames and calculate model parameters by its beginning r frames. Then set energy threshold as = + /, and apply it to test the latter l r frames, where is the sensitivity coecient and 0 < < 1. When a new frame is inputted, the analysis window shifts one frame, and the model is updated to calculate new , as shown in Fig.2. Model initialization and update The model for the sub-band is initialized in the rst analysis window by its energy of beginning r frames. Set r to their mean 1 = 1 i=1 Ex (i), and set to the normal variance 1 = r r [ i=1 (Ex (i) 1 )2 ]1/2 /(1 r). Initial signal may compose non-stationary components, and the distribution of {Ep (i)} is also time-varying, so the model is adjusted in all the following analysis windows to track the distribution of {Ep (i)}. Take the jth analysis window for example, as shown in Fig.2, get the mean and normal variance of the beginning r frame, denoted as j and j , then update and in the following 5 cases. 1. The input signal occasionally contains short silence or just constant component because of hardware errors. Hence, if j < sil , set = sil where sil is the experimental minimum of . 2. For the same reason, if j < sil , and j < j 1 , set = sil where sil is the experimental minimum of .

3. If j < c and j < c , then set = j and = j , where c is a constant and 1 < c < 1.5. This is to track the decreasing or slow varying noise; 4. If j < c and j < j 1 < j 2 , the noise is getting stationary and its level is lower, so set = j and = j . 5. If j (1 + j ) < (1 + j ), the noise is decreasing too, so set = j and = j as well. The above cases are checked one by one, and once a condition is met, update the parameters by it and neglect all the following cases. If none condition is met, keep the current and . Band selection and threshold setting The presence of speech improves the energy level in every sub-band, and in most cases, this is obvious in the sub-bands that are dominated by speech. Hence, the non-stationary signal are detected in the latter l r frames of the current analysis window. If the mean energy of r continuous frames in a sub-band is higher than , then the sub-band detects non-stationary signal. And for the consecutive r frames, if 2 of the 4 sub-bands detects non-stationary signal, and the mean energy in the other 2 subands is higher than , the non-stationary signal is detected. 2.3 Voice Detection Based on Harmonic Structure

The voice detection is carried out in the current analysis window after detecting non-stationary signal. In general, voice is modeled as the production of vocal tract excited by periodic glottal ow, so the short-term spectrum voice has energy peaks on pitch and harmonic frequencies. It is reected in narrow band spectrograms as parallel bright lines, because pitch varies slowly. Most noise dont have such character, so checking harmonics is an eective method for voice detection. The harmonic components dominant the energy of voice, so the harmonic character remains outstanding even with background noise. However, the spectral energy envelop of speech varies with pitch and formants, and the energy distribution of noise is also time-varying. Hence, the speech spectrum is not corrupted equally in dierent frequency, and the bands with clear harmonic character are also time-varying. In this paper, the voice is detected by an adaptive method of searching clear harmonic character in a wide band, and the information of neighbor frames is considered as well. This strategy keeps robust against distortions, low-frequency noise and pitch tracking error. Peak picking in a frame The spectral energy of voiced speech usually has peaks at harmonics, which are multiples of the fundamental frequency. However, some peaks will be submerged in corruption of noise, while a lot of spurious peaks are brought up. Fortunately, under most circumstances, at least 3 4 consecutive harmonics will keep clear, that is to say, a frame of corrupted voice has 3 4 spectral energy peaks with a spacing of fundamental frequency(60 450Hz) between adjacent ones. To detect the harmonics in frame i, peaks are picked from Pi (k ) as follows.

1. Extract all the local peaks in the spectrum. 2. Eliminate the peaks that are lower than an experimental threshold to delete some peaks caused by noise. 3. Merge the trivial (low and narrow) peaks into the dominant ones nearby. 4. Eliminate the remaining peaks with relatively small height or width. Matching peaks with harmonics If frame i contains voice, there will be spectral energy peaks at the multiples of fundamental frequency. We take various F 0 to see if {Qi (n)} match its multiples, in which F 0 is incremented in a step length of F = 1.5Hz within the range of [60Hz, 450Hz]. If {Qi (n)} contains peaks in the position of at least 4 consecutive harmonics, or 3 peaks matching the 1,2,3 multiples of F 0, then record the peaks as potential harmonics for F 0. It is assumed that every frame contains voice of one speaker at most. To eliminate the spurious harmonics, the continuity of F 0 and harmonics are checked. For the consecutive frames numbered from ib to ie , if F 0 uctuates within a limited extent and its harmonics, n F 0 (n + 3) F 0, are all matched in those frames, then frame ib to ie is detected as voice. The case of n = 0 means that the harmonics of F 0, 2F 0 and 3F 0 are concerned. 2.4 Speech Endpoint Determination

Start point determination As is shown in Fig.1, if there exists voice in the analysis window, the non-stationary signal can be conrmed as speech, and its start point is searched in every sub-band based on energy. Take sub-band m as example (and number m omitted), after nding voice in the analysis window, the start point of speech is searched forward by i and i from the rst voiced frame. If a frame satises i > i+1 and i > , then it is detected as the start point for this sub-band. The earliest start point of all the sub-bands are detected as the temporary start frame bs . The onset of speech usually increases the signal energy abruptly, so the noise model keeps stable in the beginning of speech, and the energy threshold keeps low. Moreover, voice usually locates after unvoice, and it has much higher energy than unvoice, so the temporary start point probably locates in the unvoice section or before the unvoice. To get the rened start point, the second step is to nd the local maximum of i near frame bs , and set it as the beginning point. If there isnt any i maximum for a sub-band in a predened boundary, it keeps bs . At last, the earliest start frame of the sub-bands is detected as the start of speech. End point determination The end point is searched after nding the start point. End threshold is initialized as = + . The parameters and are updated by i and i based only on the criteria 1, 2 and 5 whenever a new frame shifts into the analysis window. Thus the noise can be updated by the weaker or more stationary noise in speech pauses. For every sub-band, if none of the successive l frames in the analysis window meets the rule that every frame has energy higher than , where l is an experimental threshold with range of

8 < l < 20, then the end frame is detected as the rst frame of that analysis window. And if 3 sub-bands detect end point, and there exists no voice in the analysis frame, the endpoint is determined.

Evaluation and Analysis

The accuracy of endpoint detection has a strong inuence on the performance of speech recognition by triggering on and o the recognizer on speech boundaries. In this paper, the performance of the proposed algorithm is tested using a grammar-based speaker-independent speech recognition system, and the reference endpoint detection approaches are ETSI AFE[14] and VAD in G.729B[9]. Due to the wide range of speech recognition equipments and circumstances, the test data are recorded in several environments by PDA, telephone, and mobile phones, and each test le contains a segment of noisy speech with 2 6 syllables of Chinese words.
Table 1. Recognition Performance Results Data 1 2 3 4 5 6 7 8 9 10 11 total 6787 4498 1591 1592 796 1992 795 499 48 500 498 Correct Rate(%) ETSI G729 Proposed 88.5 83.7 86.7 62.3 62.0 67.3 78.8 73.0 81.5 62.3 61.9 63.4 39.2 40.9 42.6 46.3 46.6 46.8 63.3 63.3 65.7 75.9 77.7 82.2 72.9 68.8 75 74.6 72.6 84.4 77.9 69.5 78.7 Error Rate(%) ETSI G729 Proposed 11.3 13.6 12.4 37.3 35.0 30.2 21.1 21.4 18.7 37.5 37.8 35.5 60.8 59.0 56.5 50.3 51.0 49.4 36.5 36.6 33.6 13.8 15.2 9.42 27.0 31.2 25 19.2 21.4 11.2 14.6 19.0 14.5 Rejection Rate(%) ETSI G729 Proposed 0.19 2.68 0.80 0.39 2.96 2.38 0 5.53 0.13 0.13 0.25 1.07 0 0 0.88 3.26 2.36 3.77 0.12 0 0.75 10.2 7.01 8.42 0 0 0 6.2 6 4.4 7.43 12.0 6.83

Table 1 shows the comparative results for dierent EPDs. Data 1 and data 2 were both recorded in quiet oce by telephone, but data 2 was in the hands-free mode, so there are much more noise of clicks, electric fan and so on. Data 3 was recorded by PDA in oce with window opened, so there exists much more impulse noise than data 1 and 2. Data 4 was recorded by PDA in supermarket when there were not too many clients. Hence, the most important noise was the stepping and clicking of the clients, and occasionally with some voice of them. Data 5 was recorded in an airport lounge by PDA, and the broadcast was not playing. The main interference was from people talking, stepping and baggage moving nearby. Data 6 was recorded near a noisy roadside, and there were people talking, moving and vehicles running. Data 7 was recorded by PDA mobile outside the gate of a park. There people talking and vehicles moving around. Data 8 was recorded by GSM mobile on the side of a high way, so the

noise from wind and vehicle was really serious. Data 9 was also recorded by GSM mobile near the a high way, but there was music played and the mobile telephone was in mode of adaptive noise-canceling, so whenever the speech begins, the volume of noise is suppressed automatically. Data 10 was recorded by CDMA mobile in oce with window opened, and the speech volume was very low because of the telecom channel. Data 11 was recorded in oce of opened window, but the equipment was PAS mobile, so the volume was a little higher than data 10. As can be seen in Table 1, the proposed algorithm has performance comparative to G.729B and ETSIAFE in quiet environment, and outperforms them in most noisy environments, especially in the time-varying noise. For the database with noise of vehicles, steps, clicks and other environment noise, such as data 2, 3 and 8, the proposed algorithm is much superior than the standards. Even if the energy of non-stationary noise is high, it is still rejected by voice detection and can not enter the recognizer, because there is not harmonic structure in its spectrum. However, the voice detection cannot be so eective for the noise from other humans voice, as can be seen in data 4, 5, 6 and 7. Such noise has the character of harmonic as well, so some of them could be detected as voice. Fortunately, the energy detection serves as the rst step for speech detection, and the noise with low energy is rejected rst. Then, the voice detection also discriminates some interfering voice from the users voice, because the harmonics in far-away speech are usually not as continuous and clear as the users. For the interfering talkers nearby, some more eective approaches are still needed. In the case of quite environment, as is in data 1, the proposed algorithm has better performance than G.729B, whereas not as good as ETSIAFE. This is because of the mis-hit in voice detection when the users speech is short or hoarse. After all, it is still acceptable for most practical purposes. The advantage of adaptive energy tracking is clear in the case of data 9, 10 and 11, in which the volume is low or time-varying. By tracking the stationary noise in 4 sub-bands, the non-stationary signal is detected to start the voice detection, so the nal detection is not aected by the level or the variance of signal.

Conclusion

This paper puts forward a speech endpoint detection algorithm for real noisy environment. It performs reliably in most noisy environments, especially in those with abrupt changes of noise energy, which is typical in mobile and portable circumstances. The following research will focus on the environments with interfering speech and music.

Acknowledgement

The authors would like to thank Heng Zhang for his helpful suggestions.

References
1. J.F.Lynch, J.G.Josenhans, R.E.Crochiere: Speech/silence segmentation for realtime coding via rule based adaptive endpoint detection. Proc. ICASSP, Dallas (1987) 13481351 2. Li, Q.: Robust endpoint detection and energy normalization for real-time speech and speaker recognition. IEEE Transactions on Speech and Audio Processing 10(3) (2002) 146157 3. Huang, L.S., Yung, C.H.: A novel approach to robust speech endpoint detection in car environments. Proc. ICASSP (2000) 17511754 4. Weaver, K., Waheed, K., Salem, F.M.: An entropy based robust speech boundary detection algorithm for realistic noisy environments. IEEE Conference on Neural Networks (2003) 680685 5. R.Tucker: Voice activity detection using a periodicity measure. IEE Proceedings 139(4) (1992) 377380 6. H.Othman, T.Aboulnasr: A semi-continuous state transition probability hmmbased voice activity detection. Proc. ICASSP V (2004) 821824 7. S.Gazor, W.Zhang: A soft voice activity detector based on a laplacian-gaussian model. IEEE Transactions on Speech and Audio Processing 11(5) (2003) 498505 8. Li, K., Swamy, M.N.S., Ahmad, M.O.: Statistical voice activity detection using low-variance spectrum estimation and an adaptive threshold. IEEE Transactions on Speech and Audio Processing 13(5) (2005) 965974 9. Benyassine, A., Shlomot, E., Su, H.Y.: Itu recommendation g.729 annex b: A silence compression scheme for use with g.729 optimized for v.70 digital simultaneous voice and data applications. IEEE Communication Magazine (1997) 6473 10. Marzinzik, M., Kollmeier, B.: Speech pause detection for noise spectrum estimation by tracking power envelope dynamics. IEEE Transactions on Speech and Audio Processing 10(2) (2002) 109118 11. Zhang, T., Kuo, C.C.J.: Audio content analysis for online audiovisual data segmentation and classication. IEEE Transactions on Speech and audio Processing 9(4) (2001) 441457 12. Tanyer, S.G., Ozer, H.: Voice activity detection in nonstationary noise. IEEE Transactions on Speech and Audio Processing 8(4) (2000) 478482 13. Sene, S.: Real-time harmonic pitch detector. IEEE Transactions on Acoustics, Speech, and Signal Processing ASSP-26(4) (1978) 360365 14. ETSI: Es 201 108 recommendation: Speech processing, transmission and quality aspects (stq); distributed speech recognition; advanced front-end feature extraction algorithm; compression algorithms. (2002)

Вам также может понравиться