Вы находитесь на странице: 1из 10

Master of Science Thesis Data Communication Systems

Vector Quantization and Speech encoding methods Part1


(Continue from previous)

Brunel University Electronic and Computer Engineering London, UK

Ippokratis Karakotsoglou System Engineer Telecommunications

2.1 Speech signal coding and transmission


2.1.1 Speech signal brief analysis
Speech signals are nonstationary or at best they can be considered quasi stationary over short segments of 5-20ms. I mention this because it is important to realize that the properties of speech are defined over short segments. Speech can be classified as voiced, unvoiced or mixed. Voice speech can be considered periodic in the time domain and harmonically structured in the frequency domain. Unvoiced speech is random and broadband. The energy of voiced segments is higher than unvoiced segments.

2.1.2 Speech coding and transmission


Voiced speech is produced by exciting the vocal tract with air pulses, which in turn are being generated by the vibration of the vocal chords. Forcing air through the vocal tract produces unvoiced speech. To produce a specific fragment of speech we need to generate a sequence of sound inputs or excitation signals and a corresponding sequence of relevant vocal tract approximations. At the transmitter speech is divided into segments and analysed to determine parameters of the vocal tract filter and find the model of the excitation signal that is also transmitted to the receiver. The excitation signal is synthesized in the receiver and then used to drive the vocal tract filter.

2.1.3 Channel Vocoder


In the channel vocoder each input segment of speech is analysed using a analysis filters. The energy level output of these filters is estimated at fixed intervals and transmitted to the receiver. In digital representations the energy estimates are the

average squared values of the filter output. In analogue representations this is the sampled value of the envelope. An estimate is produced generally 50 times every second and decision is made on whether the speech in this segment is voiced or unvoiced. Voiced sounds are the /a/ /e/ /o/ and unvoiced sounds are the /s/ /f/. Voiced sounds have periodic structure as you can see in the short-time spectrum in the figure below. In fact it is not exactly periodic but quasi-periodic the structure of voiced sounds. This nearly-periodicity of speech is may be the cause vocal chords vibration. The shape of the spectral envelope that can fit the short time spectrum of voiced speech is related with the transfer characteristics of the vocal tract and the spectral tilt due to the glottal pulse. The envelope has peaks, which are the resonant modes of the vocal tract. There are three to five resonants below 5 kHz. The first three usually appear below 3 kHz and they are very important in speech synthesis and perception. These peaks sometimes are called formants. The period of the fundamental harmonic is called pitch period. Unvoiced sounds have a noise like structure.

2.1.4 The Linear Predictive Coder Standard LPC-10


In this coder the vocal tract is modelled as a linear filter whose output yn is related to

the input n by yn= biyn-1+Gn


i =1

Where G is the gain of the filter.

2.1.5 LPC Transmitter-Receiver


At the transmitter a decision has to be made on whether the speech segment is voiced or unvoiced. This decision is made on the basis on the number of zero crossings within the speech segment. This indicates the energy level because voiced speech has

more energy since its sample have larger amplitudes while unvoiced speech has higher frequencies and consequently crossing the x=0 axis more often than voiced speech. Both, voiced and unvoiced have average values close to zero. After the analysis is done at the transmitter the parameters including this decision and the pitch period are sent to the receiver The speech at the transmitter is sampled at 8000 samples per second and broken into 180 segments that is 22.5 ms of speech per segment. To estimate the pitch period LPC-10 uses an algorithm that is known as the Average Magnitude Difference Function (AMDF) Now, to summarize the parameters that need to be transmitted are the voice/unvoiced indication, the pitch period and the filter coefficients of the vocal tract filter.

2.1.6 Code Excited Linear Prediction (CELP)


A basic problem in all speech systems is the reproduction of natural speech at the receiver because the human ear is very sensitive to pitch errors. Several efforts have been made and one approach introduced by Atal and Schroeder in 1984 has improved the quality of the receiving speech. The method is known as Code Excited Linear Prediction (CELP) CELP coders are generally able of producing medium-rate and even low-rate speech adequate for communications. With this method we do not have a codebook but instead we have a variety of excitation signals. For each speech segment the encoder finds the excitation vector, which generates the synthesized speech that best matches the speech segment to be encoded.

2.1.7 CCITT G.728 LD-CELP


Because there is some processing at the transmitting end there is also a delay during the encoding process. This can be analysed as follows: We first wait to receive a given period segment of speech then extract the parameters. The time we need to extract the parameters doubles the time we need to receive a speech segment. So, it is like we need time equal to the time we need to receive two segments. CCITT approved recommendation G.728 deals with these issues. G.728 contains the description of an algorithm for the coding of speech signals at 16 Kbits/s using Low-Delay Code Excited Linear Prediction (LD-CELP) ITU G.728 [20] LD-CELP encoder and decoder are shown on the images below respectively. On the idea of analysis-by-synthesis techniques used in CELP is also based LD-CELP. However backward adaptation of predictors and gain achieves an algorithmic delay of 0.625 ms. Specifically as the speech is sampled at 8000 samples per second this rate corresponds to a rate of 2 bits per sample. In order to reduce the delay the speech segment size has to be reduced significantly. The G.728 recommendation uses a segment of five samples. This means 5 X 2 = 10, only ten bits available to us for encoding the parameters. The algorithm obtains the vocal tract filter parameters in adaptive manner backwards by analysing previously decoded segments. That means that the excitation gain is updated by using the gain information in the previously quantized excitation. As I said before, with backward adaptation we achieve an algorithmic delay of 0.625 ms plus we have the ten bits free to encode the excitation sequence. The frame-size is then 2.5 ms while the sub frame is 0.625 ms. LD does not use LTP. The order of the short-term predictor is increased to fifty (p=50). The hybrid window on which is based the autocorrelation analysis for LP allows the computation of the autocorrelation sequences with single precision integer arithmetic.

Only the index to excitation codebook is transmitted. Because ten bits would be able to index 1024 sequences it is very difficult to encode 1024 sequences every 0.625 ms. G.728 algorithm uses a product codebook where each excitation sequence is represented by a normalized excitation sequence and a gain. The final excitation is the product of the normalized excitation and the gain. Of the ten bits 7 bits are used as index to a codebook with 127 sequences and three bits are used to encode the gain. G.728 Encoder Operation For each input segment the encoder finds the one from the 1024 codevectors that minimizes a frequency-weighted mean squared error. The codebook with 1024 candidate codevectors is available to the encoder. The search is done on the 1024 codebook vectors stored in the excitation codebook. A group of five consecutive samples, which are taken at 125 s intervals, is called a vector or codevector of that signal. A group of four vectors is one adaptation cycle. Each codevector is passed through a gain scaling unit and a synthesis filter. The ten-bit codebook index of the corresponding best codevector is transmitted to the decoder. The excitation gain, the synthesis filter coefficients and perceptual weighting will be periodically updated. G.728 Decoder Operation This also operated on a segment-by-segment basis. When the decoder receives the 10bit index does a table look-up search to find the corresponding codevector from the excitation codebook. This then passed through again scaling unit and a synthesis filter to produce the decoded signal vector. The signal vector is next processed to enhance the quality

2.1.8 Commonly Used Speech Codecs


They are divided into three classes - waveform codecs, source codecs and hybrid codecs. Waveform codecs are used at high bit rates, and give very good quality speech. Source codecs operate at low bit rates, but the quality of speech is unnatural and sounds not realistic. Hybrid codecs try to fill the gap between the Waveform Codecs and Source Codecs, and give a better more realistic quality speech at intermediate rates. Such a coder is CELP, which like all hybrid coders combines the features of traditional vocoders with waveform-matching features of waveform coders. The figure shows how the speech quality of the codecs varies with the bit rate.

2.1.9 Speech Coding and mobile communications


Recent advances in speech coding after the 1980, pushed mobile telephony towards using low-rate algorithms discussed before. In North America an 8-kbit/s-hybrid coder has been selected. In Europe Group Speciale Mobile (GSM) has implemented a standard that uses a 13-kbit/s regular pulse excitation algorithm.

2.2 Quantization and information theory


2.2.1 Quality of a communication System
Shannon [1] proposed as follows: With a particular system P(x,y) we define the rate R1 of generating information for a given quality v1 of reproduction to the minimum of R when we keep v fixed at v1 and vary Px(y). That is : R1=Min P(x,y)logP(x,y)/P(x)P(y)dxdy Subject to constrain: v1= P(x,y)p(x,y)dxdy

2.2.2 Scalar quantization the quantization problem


Using scalar quantization to compress/encode speech introduces a lot of distortion because the encoder represents all the source outputs that fall into a particular interval by the same codeword that is mapped to that interval. I will try to define the design problem clearly here. Consider a source that generates numbers between 5.0 and +5.0. Lets say we have defined previously the intervals by the codewords mapped to them as seen in the table at the end of this document. All the values between 0.5 and less that 1.0 would be assigned the codeword 101. That means 0.6 and 0.95 input values would be assigned the same codeword 101. On the other end, the decoder will reconstruct based on the same mapping. An input value of 0.95 would take a code value of 101, which maps to 0.5. The decoder will decode as 0.5. This introduces an error of 0.95-0.5=0.45. We can see here that an input to the quantizers of 0.95 will result in an output of 0.5. What was described just previously is the basic problem of all scalar quantizers.

The difference between the input x and the output y is called the quantization error and sometimes is referred as quantization distortion or quantization noise. It is interesting to find the variance of this difference (x-y). The variance is the deviation from the mean value. Because it may be negative, we take the square of this difference. Thats why we speak about the mean squared quantization error 2. The variance is the mean of squared deviations. Although amplitude levels of speech define a discrete random variable, they are being modelled with continuous distributions because they simplify the process. The variance of a discrete set is given by 2=pi(xi-y)2. The standard deviation is the square root of 2, which is . For a continuous random variable the standard deviation is given by the square of the integration of (x-y)2f(x)dx. Where f(x) is the pdf of f and the integration is taken from minus infinitive to plus infinitive.

2.2.3 Uniform Quantization


In Uniform Quantization all intervals are of the same size. The process is to represent the output values (samples) according to which interval they fall with the reconstruction values, which are also spaced evenly. This constant spacing is also called step and is usually characterised by its step size. Some kinds of Uniform Quantization are the midrise quantizer where the step size is one, the midtread quantizer, which has one of its outputs zero. When we use a fixed-length code where each codeword is n bits then the number of construction levels is 2n. The signal to noise ratio SNR (dB) proved to be 6.02 dB. For every additional bit in the quantizer we get an increase in the SNR of 6.02dB.

2.2.4 PCM (Pulse Code Modulation)


Speech is sampled 8000 times per second and then each sample must be quantized. PCM can be thought as the conversion of an Analogue signal to a Digital (A/D) conversion. Analogue signal amplitudes can take any value over a continuous range. Digital can take only a finite number of values. If the amplitude of analogue signal ranges from L to +L then by partitioning this range (-L, +L) into n subintervals each one will have a magnitude of k=2L/n. Then each sample is approximated by the midpoint of the subinterval in which it falls. So, we have n levels to represent the signal. Then if represent it in binary form we will need log2n bits to represent the signal. For good speech quality, it was found that by using non-uniform quantization 8-bits per sample was sufficient. PCM is the simplest method but also the most expensive in terms of data rates since there is no mechanism of exploiting signal redundancy (correlation).

2.2.5 DPCM (Differential Pulse Code Modulation)


This is the process of encoding a signal while knowing its past behaviour up to a certain point in time. Then, we can use prediction to obtain estimates of the future values of the signal.

2.2.6 ADPCM (Adaptive Differential Pulse Code Modulation)


Adaptive Differential Pulse Code Modulation instead of quantizing the (speech) signal directly quantizes the difference between signal and a prediction that has been made on the speech signal. If the prediction is good then this difference will have a lower variance and it will be quantized with fewer bits. This difference then is added during the decoding process to the predicted signal to give a final reconstructed signal that is the addition of the predicted signal and the difference.

Вам также может понравиться