Вы находитесь на странице: 1из 5

Packet Loss and Compression Effects on Vocal Recognition

Pedro Mayorga O.1,2, Laurent Besacier3, Ana M. Hernández C.1,2


1
Instituto Tecnológico de Mexicali, 2Facultad de Ingeniería de la UABC, 3GEOD Team-CLIPS
Laboratory, UMR CNRS 5524, INPG-UJF
1,2
Mexicali B.C., México, 3Grenoble Cedex 9, France.
pedromayorga@hotmail.com, ahmxli@yahoo.com

Abstract packet can be lost and UDP does not have any error
recovery mechanism. In the literature, it has been
This paper deals with the effects of packet loss shown that packet loss reduces speech recognition
and voice data compression on vocal recognition over performance more dramatically than speech
IP connections. One contribution of this paper compression [3][4][5]. On the other hand, for another
consists in diagnosing most precisely the problems due task like speaker verification, the effect of packet loss
to the compression and the packet losses for two is no very important, while noise and speech
different recognition tasks: the automatic speech compression effects are more crucial [6][7].
recognition and speaker recognition. Another Different strategies can be proposed to access a
contribution corresponds to the proposal for distant vocal recognition server [8][9]. In recent times,
recovering techniques in order to improve the one solution was proposed where speech spectral
robustness of systems under significant packet losses analysis (the extraction of acoustic vectors) is
conditions. From the diagnosis, a more important performed on the client terminal in order to transmit
degradation due to the compression on the speaker the acoustic features to the distant vocal recognition
verification task was noted. The experimental results server. This approach allows avoiding speech
show that the techniques of interleaving based on the compression degradation, but does not to solve the
transmitter combined with the interpolation based on problem of packet losses over Internet.
the receiver, prove to be the most efficient.
2. Methodology
Keywords –– Packet loss, robustness, VoIP.
1) Databases used: We conducted a series of speech
1. Introduction recognition experiments with 120-recorded sentences
focused on reservation and tourist information task
Now is possible access a distant vocal recognition (CSTAR120 database) used in CSTAR project [10].
server in different ways, from mobile phones or from In speaker verification task the XM2VTS [11]
other communication terminals. However, there are database was used. In acquiring the XM2VTS
several constraints with remote vocal recognition: database, 295 volunteers from the University of Surrey
firstly, the lost packets and secondly, the degradation visited a recording studio four times at approximately
of transmission due to the speech compression after one month intervals. On each visit (session) two
applying speech coders. Today the most popular recordings (shots) were made. The first shot consisted
speech coders in voice transmission over IP (VoIP) of speech while the second consisted of rotating head
are: G.723.1, G.729, G.728, G.726/7 and G.711. This movements. This multimodal database is used by
set of coders is also used in video transmission and is many partners of COST Action 275 [12]. The work
known as the standard H323. There are several described in this paper was made on its speech part
software packages for videoconferencing which can were the volunteers were asked to read three sentences
also be used for voice transmission on the Internet, for twice. The three sentences remained the same
example Microsoft’s NetMeetingTM [1][2]. throughout all four recording sessions and a total of
TM
Microsoft’s NetMeeting uses the H323 standard. 7080 speech files were made available on four CD-
In interactive speech applications over the ROMS. The audio, which had originally been stored in
Internet, the UDP transport protocol is used to carry mono, 16 bits, 32 kHz, PCM wave files, was
speech or audio signal packets, nevertheless a data

Proceedings of the Electronics, Robotics and Automotive Mechanics Conference (CERMA'06)


0-7695-2569-5/06 $20.00 © 2006

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on February 20, 2009 at 05:43 from IEEE Xplore. Restrictions apply.
downsampled to 8kHz. This is input sampling verification experiments are conducted in matched
frequency required in the speech codecs for VoIP. conditions (same training/test conditions).
2) Codecs used: H323 is a standard commonly used The ELISA system on XM2VTS is based on the
for transmitting voice and video over IP networks. The LIA system presented to the NIST 2002 speaker
audio codecs used in this standard are G711, G722, recognition evaluation. The speaker verification
G723, G728 and G729. We conducted our system uses 32 parameters (16 LFCC + 16 ¨LFCC).
experiments with the codec which has the lowest bit Silence frame removal is applied before centering
rate: G723.1 (6.4 and 5.3 kbit/s), and the one with the (CMS) and reducing vectors.
highest bit rate: G711 (64 kbit/s). For the world model, 128 Gaussian-component
GMM was trained using Switchboard II phase II data
3) In speech recognition experiments the French (8 kHz landline telephone) and then adapted (MAP
speech recognition system RAPHAEL was used. [18], mean only) on XM2VTS data (25 evaluation
RAPHAEL uses JANUS-III toolkit from CMU [13]. impostor set). The client models are 128 Gaussian-
The context dependent acoustic model (750 CD component GMM developed by adapting (MAP, mean
codebooks, 16 gaussians each) is learned on a corpus, only) the previous world model. Decision logic is
which contains 12 hours of continuous speech of 72 based on using the conventional log-likelihood ratio
speakers extracted from Bref80 database [14]. The (LLR).
system uses 24-dimensional LDA features obtained
from 43-dimensional acoustic vectors (13MFCC, 4) Packet loss and compression effects over IP
13¨MFCC, 13¨¨MFCC, E, ¨E, ¨¨E, zero-crossing networks. This section proposes a methodology for
parameter) and extracted every 10ms. The vocabulary evaluating the voice packets over IP networks. The
(5k words) contains nearly 5500 phonetic variants of idea is to duplicate an existing database (XM2VTS or
2900 distinct words; it is specific to the tourist CSTAR120) used for vocal recognition by passing its
reservation and information domain. The trigram speech signals through different coders and different
language model that we used for our experimentation network conditions representative of what can occur
was computed using Internet documents because it has over the Internet.
been shown that they give a very large amount of Packet loss simulation and speech compression:
training data for spoken language modeling [15]. There are two main transport protocols used on IP
Speaker verification experiments with the ELISA networks. These are UDP and TCP. While UDP
system. The ELISA consortium groups several protocol does not allow any recovery of transmission
laboratories working on speaker verification. One of errors, TCP include some error recovery processes.
the main objectives of the consortium is to emphasize However, the transmission of speech via TCP
assessment of performance. Particularly, the connections is not very realistic. This is due to the
consortium has developed a common speaker requirement for real-time (or near real-time)
verification system which has been used for operations in most speech related applications [19]. As
participants at various NIST speaker verification a result, the choice is limited to the use of UDP which
evaluations campaigns [16]. involves packet loss problem. The process of audio
The ELISA system is a complete frame work packet loss can be simply characterized using a Gilbert
designed for speaker verification. It is a GMM-based model [20] consisting of two states (Fig. 1).
system including audio parameterization as well as
score normalization techniques for speaker p
verification.
For the purpose of our experiments, the Lausanne
protocol (configuration 2) is adopted. This has already
been defined for the XM2VTS database. There are 199 1-p 0 1 1-q
clients in the XM2VTS DB. The training of the client
models is carried out using full session 1 and full
session 2 of the clients part of XM2VTS [17]. 398
client test accesses are obtained using the full session q
4(x2 shots) of the clients part. 111440 impostor Figure 1. Gilbert model.
accesses are obtained using the impostor part of the
database (70 impostor x4 sessions x2 shots x199 One of the states (state 1) represents a packet loss
clients = 111440 impostor accesses). The 25 and the other state (state 0) represents the case where
evaluation impostors of XM2VTS are used to develop packets are correctly transmitted. The transition
a world model. The text-independent speaker probabilities in this statistical model, as shown in Fig.

Proceedings of the Electronics, Robotics and Automotive Mechanics Conference (CERMA'06)


0-7695-2569-5/06 $20.00 © 2006

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on February 20, 2009 at 05:43 from IEEE Xplore. Restrictions apply.
1, are represented by p and q. In other words, p is the network conditions that occur in the case of real
probability of going from state 0 to state 1, and q is the applications.
probability of going from state 1 to state 0.
Different values of p and q define different packet 5) Reconstruction strategies. Against packet loss, the
loss conditions that can occur on the Internet. The typical mechanisms belong to two classes. Firstly, the
probability that n consecutive packets are lost is given mechanisms of Automatic Repeat request (ARQ)
by p(1-q)n-1. If (1-q)>p, then the probability of losing a based on the retransmission of packet that did not
packet is greater after having already lost a packet than arrive at its destination. Unfortunately, these
after having successfully received a packet. This is mechanisms can not be used for the audio
generally the case in data transmission on the Internet transmission in real-time [3][8]. Secondly, the
where packet losses occur as bursts. Note that p + q is Forward Error Correction mechanisms (FEC) based on
not necessarily equal to 1. Different values of p and q the transmission of redundant information [3][8], like
representing different network conditions considered XOR technique. Another approach in audio
in our experiments were used. transmission is the technique of interleaving. This
Two degraded versions of database (CSTAR120 mechanism is based on the redistribution of samples,
in speech recognition case and XM2VTS in speaker of such a way that if a packet is lost, the data can be
verification case) were obtained by applying G711 and reconstructed by a repair mechanism [3], for example
G723.1 codecs alone, without any packet loss. the interpolation calculus, the Lagrange interpolation,
Six degraded versions of XM2VTS were obtained the repetition technique, or the XOR technique.
using simulated packet loss conditions: two conditions Interleaving. The technique of interleaving
(average/bad) and three speech qualities distributes the effect of the lost packets in order to
(clean/G711/G723.1). The simulated average and bad reduce the bursting effects over IP connections; that
networks conditions considered in this study means that the information of a speech part is
correspond to 9% and 30% speech packet losses, distributed in the other packets. The original data units
respectively. Each packet contained 30ms of speech, are not combined in the original sequential order as
which was consistent with the duration proposed in produced by the coder, but interleaved by the
RTP (real-time protocol used under H323.). transmitter as shown in Fig. 2. The data units are
The CSTAR120 database (in speech recognition regrouped in crossed form before their transmission, in
case) was duplicated giving several versions, such a way that the units are redistributed and
according to the Gilbert model with 5 different separated to supply a distance between the lost ones.
conditions (Table I). In the receiver, they are rearranged into their original
sequence. As we can see, the lost of a packet results in
Table I. the lost of several speech units distributed in the other
Different packet loss conditions packets.
Repetition. The repetition technique consists in
condition 1 2 3 4 5 placing the lost packets by copies of the last received
p 0.10 0.05 0.07 0.20 0.25 packet. Here we have a tradeoff between the low
q 0.14 0.06 0.10 0.4 0.62 complexity and a reasonably good performance [8].
We can observe the repetition process in Fig. 3.
Real-conditions packet loss. In order to
investigate the effects of real network conditions, it 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
was decided to play and record the whole database
(CSTAR120 in speech recognition case and XM2VTS
in speaker verification case) through the network. This 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16
was carried out by playing the speech dataset into a
computer which was set up for videoconferencing. For
this purpose, a transatlantic connection was
1 5 9 13 Lost 3 7 11 15 4 8 12 16
established between France and Mexico using
videoconferencing software. The microphone on the
French site was then replaced with the audio output of
a computer playing the speech signals database. Due 1 3 4 5 7 8 9 11 12 13 15 16

to numerous networks breakdowns, the transmission


of speech signals had to be conducted using several Figure 2. Interleaving technique
different connections established on different days and
different times. This, of course, provided variations in

Proceedings of the Electronics, Robotics and Automotive Mechanics Conference (CERMA'06)


0-7695-2569-5/06 $20.00 © 2006

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on February 20, 2009 at 05:43 from IEEE Xplore. Restrictions apply.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 conditions: 31.8 % for G723, 29.8 % for G711 and
30.5 % for PCM, which means that the same quantity
of signal is lost in average. But, as we can see in table
1 2 3 4 Lost 9 10 11 12 13 14 15 16
II, the highest WER (Word Error Rate) is for G723,
with an average of 81.8 %, then 62.9 % for G711, and
1 2 3 4 1 2 3 4 9 10 11 12 13 14 15 16 53.5 % for PCM. Thus, for a same packet loss rate, the
highest compression level is, the higher the value of
Figure 3. Repetition technique WER will be. This difference may be due to the effect
of the compression itself, but also to the fact that in the
case of real transmissions with G723 (the highest
Packet I-1 Packet I Packet I+1 compression degree), one packet lost represents a
bigger quantity of consecutive voice information lost
than when G711 codec or no codec (PCM) are used.
Then, the lost information occurs dramatically as
broad bursts for G723 whereas it is more spread for
Lost G711 and PCM data transmitted. The correlation
between WER and PLR was also measured and the
results show that the real conditions do not really lead
to the same ideal and predictable results obtained in
Paket I=((Packet I-1)+(Packet I+1))/2 simulated conditions (in which a value of 0.98 was
obtained) since the correlation between WER and PLR
Figure 4. Interpolation technique is less important (0.64 instead of 0.98) and tends to
decrease with additional factors like speech
Simple Interpolation. Simple interpolation or compression.
calculation of the average consists in interpolating by
using the packets after and before the lost packet, as is B. Reconstruction performance
shown in Fig. 4. The advantage of this technique is its
simplicity; nevertheless its efficiency decreases as the When we recovering techniques, it is easier
number of lost packets increases. recover isolated losses than to recover losses of long
duration. The interleaving technique has the advantage
3. Results to cut the long duration losses (bursts losses) in
smaller pieces (small time-disjoint intervals). Another
A. Degradation with real transmissions advantage is that interleaving technique does not
increase the network load. Unfortunately, if a great
This experiment was performed to shown the amount of consecutive packets are used in
influence of the degradation due to real transmissions interleaving, the introduced delay might be
over IP, different audio bitstreams were used: G723, prohibitive.
G711 and PCM (no codec). The speech recognition Combined with interleaving technique, we can
performance was assessed and we show a summary of use other techniques to calculate the characteristics of
results in table II. the lost voice units. Interpolation and repetition have
as their main advantage their simplicity and low delay.
Table II. When near neighbor units are used to calculate lost
Results of WER and packet loss rate (PLR) in real units they give a very good performance. With these
VoIP conditions techniques we gain about 77 % of recognition
performance.
Audio bitstream G723 G711 PCM
Mean PLR 31.8 29.8 30.5 C. Speaker verification results
Mean WER 81.8 62.9 53.5 Table III.
Correlation coef. Results in EER % of speaker verification experiments
0.28 0.49 0.64
(WER, PLR)

In real VoIP conditions, there is the addition of Network condition Clean G711 G723
three problems: noise due to our experimental No packet loss 0.25 % 0.25 % 2.68 %
transmission protocol (not quantified here), Average condition
0.25% 0.25 % 6.28 %
degradation due to speech compression, and packet p=0.1, q=0.7
Bad network condition
losses. Comparable PLR’s were found for the three p=0.25, q=0.4
0.50 % 0.75 % 9%

Proceedings of the Electronics, Robotics and Automotive Mechanics Conference (CERMA'06)


0-7695-2569-5/06 $20.00 © 2006

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on February 20, 2009 at 05:43 from IEEE Xplore. Restrictions apply.
Comparing these results with those for speech [13] T. Zeppenfeld, M. Finke, K. Ries, M. Westphal, A.
recognition, it can be said that the speaker verification Waibel, Recognition of conversational telephone
performance is far less sensitive to packet loss. On the speech using the Janus speech engine, IEEE
other hand, the lat column of table III shows that the International Conference on Acoustics, Speech and
Signal Processing, Munich, Germany, 1997.
speaker verification performance is adversely affected
[14] L.F. Lamel, J. L. Gauvain, M. Eskenazi, BREF, a
by the compression effects. large Vocabulary Spoken Corpus for French,
Eurospeech, Genes, Italy, Vol 2, pp. 505-508, 24-26
4. Conclusion September 1991.
[15] D. Vaufreydaz, M. Akbar, M. Rouillard, J. Caelen,
Internet documents: a rich source for spoken
In our experiments, speaker verification
language modeling, ASRU’99, Keystone Colorado
performance is more sensitive to compression effects (USA), pp. 277-280.
than packet loss; on the other hand speech recognition [16] I. Magrin-Changnolleau, G. Gravier, and Blouet for
is more sensitive to packet loss effects. Recovering the Elisa consortium, Overview of the 2000-2001
strategies show to be performance using interleaving ELISA consortium research activities, in 2001: A
to distribute the lost speech units, and reconstruction Speaker Odyssey-The Speaker Recognition
techniques like interpolation o repetition. Workshop, pp. 67-72, Crete, Greece, June 2001.
[17] Luttin J. and Maitre G., Evaluation Protocol for the
XM2VTSDB Database (Lausanne Protocol),
5. References Technical Report 05, IDIAP Communication,
Martigny-Valais-Suisse, October 1998.
[1] F. Metze, J. McDonough and H. Soltau, Speech [18] Gauvain, J.L. and Lee C.-H., Maximum a posteriori
Recognition over NetMeeting Connection, Eurospeech, estimation for multivariate Gaussian mixture
September 2001, Scandinavia. observations of Markov chains, IEEE Trans. Speech
[2] L. Besacier, C. Bergamini, D. Vaufreydaz, E. Castelli, Audio Process., April 1994, 2, pp. 291-298.
The effect of speech and audio compression on speech [19] Black U.: Voice over IP, Prentice Hall, 2001.
recognition performance, in IEEE Multimedia signal [20] Bolot J-C. and Fosse-Parisis S., Adaptive FEC-based
Processing Workshop, Cannes, France, October 2001. error control for Internet telephony, Proc. IEEE
[3] P. Mayorga-Ortiz, R. Lamy, L. Besacier, Recovering of Infocom’99, New York, NY, USA, March 1999.
pscket loss for distributed speech recognition, Proc.
EUSIPCO 2002, Toulouse, France, Sep. 2002.
[4] B. Milner and S. Semnani, Robust Speech Recognition
Over IP Networks, ICASSP2000, Istambul (Turkey).
[5] N. W. D. Evans, J. S. Mason, R. Aukenthaler and R.
Staper, Assessment of speaker verification degradation
due to packet loss in the context of wireless mobile
devices, COST275 Workshop on Biometrics over the
Internet, Rome, Italy, November 7-8, 2002.
[6] L. Besacier, P. Mayorga, JF. Bonastre, C. Fredouille,
Methodology for Evaluating Speaker Verification
Robustness over IP Networks, COST275 Workshop on
Biometrics over the Internet, Rome, Italy, November 7-
8, 2002.
[7] Besacier L., Mayorga P., Bonastre J-F., Fredouille C.,
Meignier S., Overview of compression and packet loss
effects in speech biometrics, 2003 IEE Proceedings
Vision, Image & Signal Processing-Special issue on
Biometrics on the Internet. Vol. 150, no6, December
2003.
[8] C. Perkins, O. Hodson, V. Hardaman, A survey of
Packet loss Recovery Techniques for streaming audio,
IEEE Network Magazine, pp. 40-48, Sept./Oct. 1998.
[9] M. Yajnik, S. Moon, J Kurose and D. Towsley,
Measurement and Modeling of temporal Dependence in
Packet Loss, IEEE Infocom’99, New York, USA,
March 1999.
[10] http://www.c-star.org.
[11] http://www.ee.surrey.ac.uk/Research/VSSP/xm2vtsdb.
[12] http://cost.cordis.lu.

Proceedings of the Electronics, Robotics and Automotive Mechanics Conference (CERMA'06)


0-7695-2569-5/06 $20.00 © 2006

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on February 20, 2009 at 05:43 from IEEE Xplore. Restrictions apply.

Вам также может понравиться