%recognition Accuracy: Speech Recognition Is The Translation of Spoken Words Into Text

speech recognition is the translation of spoken words into text.
%Recognition Accuracy
Word LPC+DTW With stored Speech 100 100 100 100 100 100 100 100 100 100 Real time Speech MFCC+DTW With stored Speech 100 100 100 100 100 100 100 100 100 100 Real time Speech MFCC+HMM With stored Speech 100 100 100 100 100 100 100 100 100 100 Real time Speech WAVELET+DTW With stored Speech 100 100 100 100 100 100 100 100 100 100 Real time Speech
Sunna Okkati Rendu Moodu Nalugu Idu Aru Edu Enamidi Tommidi
MFCC have been the dominant feature used for speech recognition for some time . Their success has been due to their ability to represent the Speech amplitude spectrum in a compact form. Each step in the process of cresting MFCC festure is motivated by perceptual or computation considerations. We examine these steps in more details in the following paragraphs. A more complete of the process and assumption
Figure shows the process of creating MFCC features. The first step is to divide the speech signal into frames, usually by applying a windowing function at intervals. The aim here is to model
small sections of the signal thet are statistically stationary. The window function, tipically a hamming window, removes edge effects. We generate a cepstral vector for each frame
Waveform
Feature Extraction Using Linear Predictive Coding

One of the most powerful signal analysis techniques is the method of linear prediction. LPC of speech has become the predominant technique for estimating the basic parameters of speech. It provides both an accurate estimate of the speech parameters and it is also an efficient computational model of speech. The basic idea behind LPC is that a speech sample can be approximated as a linear combination of past speech samples. Through minimizing the sum of squared differences (over a finite interval) between the actual speech samples and predicted values, a unique set of parameters or predictor coefficients can be determined. These coefficients form the basis for LPC of speech. The analysis provides the capability for computing the linear prediction model of speech over time. The predictor coefficients are therefore transformed to a more robust set of parameters known as cepstral coefficients.
Voice signal sampled directly from microphone, is processed for extracting the features. Method used for feature extraction process is Linear Predictive Coding using LPC Processor. The basic steps of LPC processor include the following : 1. Preemphasis: The digitized speech signal, s(n), is put through a low order digital system, to spectrally flatten the signal and to make it less susceptible to finite precision effects later in the signal processing. The output of the preemphasizer network is related to the input to the network, s(n) , by difference equation:
2. Frame Blocking: The output of preemphasis step, ~s (n) , is blocked into frames of N samples, with adjacent frames being separated by M samples. If x (n) l is the l th frame of speech, and there are L frames within entire speech signal, then
where n = 0,1,,N 1 and l = 0,1,,L 1 3. Windowing: After frame blocking, the next step is to window each individual frame so as to minimize the signal discontinuities at the beginning and end of each frame. If we define the window as w(n), 0 n N 1, then the result of windowing is the signal:
where 0 n N 1
Typical window is the Hamming window, which has the form
4. Autocorrelation Analysis: The next step is to auto correlate each frame of windowed signal in order to give
where the highest autocorrelation value, p, is the order of the LPC analysis. 5. LPC Analysis: The next processing step is the LPC analysis, which converts each frame of p + 1 autocorrelations into LPC parameter set by using Durbins method. This can formally be given as the following algorithm:
Solving the above recursively from i=1.2..p the LPC coefficient, m a , is given as
LPC advantages LPC provides good model of speech signal It is a Production based method LPC represents the spectral envelope by low dimension feature vectors Provides linear characteristics LPC leads to a reasonable source-vocal tract separation LPC is analytically tractable model The method of LPC is mathematically precise and straight forward to implement in either software in hardware LPC disadvantages The LP models the input signal with constant weighting for the whole frequency range. However human perception does not have constant frequency perception in the whole frequency range. A serious problem with the LPC is that they are highly correlated but it is desirable to obtain less correlated features for acoustic modeling. An inherent drawback of conventional LP is its inability to include speech specic a priori information in the modeling process.
Restrictions on Warping Function

Warping function F is a model of time-axis fluctuation in a speech pattern. Accordingly, it should approximate the properties of actual time-axis fluctuation. In other words, function F, when viewed as a mapping from the time axis of pattern A onto that of pattern B, must preserve linguistically essential structures in pattern A time axis and vice
versa. Essential speech pattern time-axis structures are continuity, monotonicity (or restriction of relative timing in a speech), limitation on the acoustic parameter transition speed in a speech, and so on. These conditions can be realized as the restrictions on warping function F or point c(k)=(i(k),j(k)).
1) Monotonic conditions: i (k-1) i (k) and j (k-1) j (k). 2) Continuity conditions: : i(k)- i(k-1) 1 and j(k)- j(k-1) 1.
Equation Equation
As a result of these two restrictions, the following relation holds between two consecutive points
(( ) ( ) ) ( ) ) {( ( ) (( ) ( ))
Equation
3) Boundary conditions: i (1) =1, j (1) =1, and i (K) =I, j (K) = J. 4) Adjustment window condition: | i (k)-j(k)|
Equation Equation
Where r is an appropriate positive integer, called window length. This condition corresponds to the fact that time-axis fluctuation in usual cases never causes too excessive timing difference.
5) Slope constraint condition:
Neither too steep nor too gentle a gradient should be allowed for warping function F because such deviations may cause undesirable time-axis warping. Too steep a gradient,
for example, causes an unrealistic correspondence between very short patterns A segment and a relatively long pattern B segment. Then, such a case occurs where a short segment in consonant or phoneme transition part happens to be in good coincidence with an entire steady vowel part. Therefore, a restriction called a slope constraint condition was set upon the warping function F, so that its first derivative is of discrete form. The slope constraint condition is realized as a restriction on the possible relation among (or the possible
configuration of) several consecutive points on the warping function, as is shown in Fig. 6.2(a) and (b). To put it concretely, if point c (k) moves forward in the direction of i (or j)axis consecutive m times, then point c (k) is not allowed to step further in the same direction before stepping at least n times in the diagonal direction. The effective intensity of the slope constraint can be evaluated by the following measure P = n/m.
Fig. Slope constraint on warping function
The larger the P measure, the more rigidly the warping function slope is restricted. Whenp = 0, there are no restrictions on the warping function slope. When p = (that is m =
0), the warping function is restricted to diagonal line j = i. Nothing more occurs than a conventional pattern matching no time normalization. Generally speaking, if the slope constraint is too severe, then time-normalization would not work effectively. If the slope constraint is too lax, then discrimination between speech patterns in different categories is degraded. Thus, setting neither a too large nor a too small value for p is desirable. Section IV reports the results of an investigation on an optimum compromise on p value through several experiments.
In Fig. 6.2(c) and (d), two examples of permissible point c (k) paths under slope constraint condition p = 1 are shown. The Fig. 6.2(c) type is directly derived from the above definition, while Fig. 6.2(d) is an approximated type, and there is another constraint. That is, the second derivative of warping function F is restricted, so that the point c (k) path does not orthogonally change its direction. This new constraint reduces the number of paths to be searched. Therefore, the simple Fig. 6.2(d) type is adopted afterward, except for the p = 0 case.

%recognition Accuracy: Speech Recognition Is The Translation of Spoken Words Into Text

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

%recognition Accuracy: Speech Recognition Is The Translation of Spoken Words Into Text

Загружено:

Авторское право:

Доступные форматы

speech recognition is the translation of spoken words into text.

Feature Extraction Using Linear Predictive Coding

Typical window is the Hamming window, which has the form

Restrictions on Warping Function

5) Slope constraint condition:

Fig. Slope constraint on warping function

Вам также может понравиться