1767 1913 1 RV

Review of "Gaussian mixture model based quantization of line spectral frequencies for adaptive multirate speech codec".
Generale comment The manuscript describes an approach for optimizing the quantization of LSP using Huffman coding and GMM. The main criticism is that the contribution of the work is not clearly defined. There are so many papers on AMR speech coders that it is very difficult to know if the reported approaches are novel or not. [Authors]: The idea of using GMM-based spectral envelope quantizers has already been described in several papers [Hedelin & Skoglund, 2000, Subramaniam & Rao, 2001, Subramaniam & Rao, 2003]. However, applying such a concept on a real codec introduces the problem of fixed code length available for the spectral envelope coding, which does not go along with the concept of entropy coding. By implementing a GMM-based spectral envelope quantizer in a typical CELP-based codec, the interaction between the excitation model (fixed and adaptive codebook) and the quantized spectral envelope model (LPC/LSF) comes to the fore. The CELP codecs closed loop nature makes their corresponding analysis inseparable. For this purpose, we have implemented and evaluated the proposed GMM-based spectral envelope quantizer in a real CELP-based codec. We have chosen the AMR codec as typical representative of such a codec as it is widely used in the GSM and UMTS systems. We propose two techniques to adapt the entropy coding to the fixed-rate modes of the AMR codec, which represents the main novelty in this paper compared to the previous work. The performance of such a quantizer has been analyzed by evaluating the incurred spectral envelope distortion and also by measuring the performance of the entire CELP codec by using the PESQ measure. We have updated the Introduction chapter of the manuscript with this information, in order to emphasize and define the contribution of the work more clearly. The authors compare the performances with a 'referent' coder, but only at page 11 there is a reference to the technical specification of 3GPP, which makes also available the source code, so one may conclude that probably the referent coder is that indicated in the first Reference. You should state initially what is the referent coder. However, why the authors use that coder as a reference? why the authors did not take some more recent work as a reference?. Perhaps because there is no code available? in this case I'd spend some time, at the initial stage of the research,to adapt the available code to at least one most recent version and then I'd start from that, in order to eventually show that the proposed algorithm represent a clear advance in the field. It is clear that if you take an old implementation of the coder it is possible to easily get some improvement. In other words, the authors should clearly indicate what is the referent coder, and explain why they compare with exactly that coder. [Authors]: Your comment was kindly accepted. The referent AMR codec is now clearly specified in the Introduction chapter. We have chosen the AMR codec because of it's prevalence in the todays telecom systems (GSM and UMTS), but the proposed concept can be applied to any other constrained resolution predictive or memoryless LSF coder. We use the AMR codec for the concept verification purpose only. That way we provided relevant results without designing an entire codec from scratch, i.e. only by replacing the spectral envelope quantization part. This approach provide an ability to objectively compare the LSF quantizers in the proposed and referent codecs. For the first manuscript revision, we have used the v5.2.0 floating point version of the referent AMR codec (3GPP TS 26.104), dated from 06/2003. It is a version that we have experimented with earlier, so we used it for the purpose of this manuscript as well. It's spectral envelope quantizer is identical to the one used in the latest version of the referent AMR codec (v9.0.0, dated from 12/2009), i.e. the spectral envelope quantizer in the referent codec wasn't changed through the subsequent codec revisions. To be more specific, between the latest version (v9.0.0) and the v5.2.0 that was used in our initial research, there are only minor differences considering the Discontinuous Transmission (DTX) functionality and some bit order changes in comfort noise frames. An excerpt from the revision history is included below:
2004-01-06 2004-01-06 2004-04-01 2007-06-21 2008-12-18 2009-12-18 change)
5.3.0 6.0.0 6.1.0 7.0.0 8.0.0 9.0.0
Correction on the implementation of the interface of decoder.c Correction on the default behaviour of the unix makefile Correction of floating point AMR DTX functionality Bit order of Mode Indication in AMR comfort noise frames Version for Release 8 (Upgraded unchanged from Rel-7) Version for Release 9 (Automatic upgrade from previous Release without technical
However, in order to verify the results, we have repeated the experiments on the latest referent AMR codec revision (v9.0.0). As expected, new results appear to be identical to the previous results. I find that some partes are not sufficiently described. For example, the ECSQ indicated in fig1 and fig2 and referenced as Gyorgy&Linder, should be better described: not all the readers may know what are the constraits and they should find it useful to find at least a brief description. [Authors]: We have added a brief description of Entropy Constrained Scalar Quantization (ECSQ) along with a short discussion on its usage in typical speech codecs. We placed it in the Chapter 3 ("GMM-based LSF Vector Quantizer Description"), in the subchapter named "Scalar Quantization of the Transformed Components". Regarding the performance evaluation, I find it suspicious that PESQ did not give significant improvements (the reported differences are only noise). [Authors]: It was shown that the modified codec versions perform almost equally (using equal number of bits per LSF vector) or similar (for reduced bitrates at which they achieve the same average SD as the referent AMR codec) to the referent codec in the PESQ score sense. However, further reduction of the spectral envelope coding rate shows a clear PESQ score degradation. This is visible on Figure 4 which has also been added to the manuscript along with this explanation. The relatively flat slope of PESQ score curves shows that the CELP codec is able to produce satisfying total results even in conditions when spectral envelope quantization causes a relatively high difference between the ideal envelope (determined by the LPC analysis) and the envelope described by the quantized LSF vector. This is a consequence of the CELP codec closed-loop structure. Such a codec is able to find an excitation which will synthesize a speech segment close enough to the input sequence, even with a big mismatch in the corresponding spectral envelopes. This fact is more visible for higher-rate codecs, as they use a high percentage of their total rate for excitation coding (87% for the 10.2 Kb/s mode), which gives them the ability to find an appropriate excitation in order to compensate for the spectral envelope model imperfection. That is the reason that the curve slope becomes steeper for lower-rate codecs, as well for lower spectral envelope coding rates.
PESQ score vs. Rate (1 LSF vector/frame)

4 3.9 3.8 3.7
PESQ score
3.6 3.5 3.4 3.3 3.2 3.1 16
ref. AMR 10.2 Kb/s 7.95 Kb/s 7.4 Kb/s 6.7 Kb/s 5.9 Kb/s 5.15 Kb/s 4.75 Kb/s
18
20
22
24
26
28
average bitrate (bits/frame)
Figure. 4. PESQ score vs. spectral envelope coding rate for the mq version of the GMM-based LSF VQ. The curves represent different AMR codec modes which calculate 1 LSF vector / frame. Why the author base all the bit rate reduction considerations on SD? it is know that SD has no subjective implications. Despite the fact that SD is a poor match to the perceptionally based error models, the LSF quantizers are most oftenly evaluated using the SD measure. Experiments have shown that even by using the open-loop coders, in case that the average SD is below 1dB and that the number of outliers (with SD higher than 2dB) is less than 2%, the difference between the ideal and the quantized spectral envelope is not percievable. Just because of the previously described closed-loop coder structure, such requirements are considered as a conservative measure, i.e. the CELP coder is able to cover even bigger differences, which has been shown by our experiments which simultaneously evaluate the system using both measures (SD and PESQ). However, SD can be definitely used to compare two different spectral envelope coders, for an isolated analysis of the distortion that they introduce, without considering the excitation coding or the closed-loop. The coder that introduces less SD represents a more accurate LPC model and thereby improves the CELP coder performance. Such an SD measure is also in accordance with the spectral envelope coder (vector selection method from the codebook) which uses the weighted Euclidian distance between the original and quantized LSF vectors, with the weighting factors according to (3GPP TS 26.090). Those weighting factors actually approximate an ordinary SD difference above the linear frequency scale. In conclusion, it appears that the only clear interesting features of the approach is the computational complexity. However, the computational and memory complexities are not adequately reported or emphasized. It seems that the most performant coder is 1.5 less complex (than the referent coder? please specify). What about the decoder? it is not clear. [Authors]: We have kindly accepted your comment and expanded the computational complexity and memory requirements analysis in the manuscript. Due to the manuscript space limitation, for comparison purpose, we have decided to present the actual computational and memory requirements for both the referent and the modified AMR codec versions, but only for the mode using 10,2 Kb/s bitrate. The comparison is made for both modified codec versions, comprising the mq and 2best GMM-based spectral envelope quantizers. The results are summarized in Table 2. which has also been added to the manuscript for their better overview and easier comparison.
Table 2. Computational complexity and memory requirements comparison between the LSF VQs used in the referent and modified AMR codec versions
SVQ (referent AMR codec) encoder computational complexity (flops) memory requirements (floats) 17405 4352 decoder mq ver. 0 10527 2660 2best ver. 3990 mq ver. 270 4780 2best ver. GMM-based LSF VQ (modified AMR codec) encoder decoder
The mq quantizer version appears to be 1.65 less complex comparing to the SVQ which is used for the same task in the referent AMR encoder. The 2best version show even lower computational requirements, as in comparison with the SVQ it needs 4.63 times less flops to quantize an LSF vector. On the decoder side, the computational complexity of the LSF VQ is invariant to the modified codec version. To decode the quantized vector they need 270 flops, which is a significant increase in decoding computational complexity compared to the simple reading of three indexed subvectors in the SVQ case. However, this increased LSF decoding complexity is completely negligible in comparison to the complexity of the whole AMR decoder. The memory requirements are also invariant to the modified codec version. On the encoder side, in comparison to the SVQ storage requirements of the referent AMR codec, the GMM-based LSF quantizers have about 1.6 times less storage requirements. On the decoder side, the GMM-based LSF quantizers have similar memory requirements as the SVQ of the referent AMR codec. We'd like to emphasize that about 60% of the encoder memory requirements and about 80% of the decoder memory requirements are dedicated to the Huffman tables in the GMM-based LSF quantizers. As they mainly contain integers representing symbols and row indices, they can be efficiently stored in narrower memory locations. For example, if we assume storing the SVQ codebook entries in 32-bit locations (float), by storing the Huffman table elements in 8-bit locations the above mentioned GMM-based LSF quantizer memory requirements can be further reduced by an approximate factor of 2.
Specific comments Please indicate 10,2 Kb/s instead of 10k2 or 12,2Kb/s instead of 12k2 and so on. [Authors]: Corrected in manuscript as suggested. For the described reasons, I reccomend a major revision of the manuscript

1767 1913 1 RV

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

1767 1913 1 RV

Загружено:

Авторское право:

Доступные форматы

Review of "Gaussian mixture model based quantization of line spectral frequencies for adaptive multirate speech codec".

2004-01-06 2004-01-06 2004-04-01 2007-06-21 2008-12-18 2009-12-18 change)

5.3.0 6.0.0 6.1.0 7.0.0 8.0.0 9.0.0

PESQ score vs. Rate (1 LSF vector/frame)

3.6 3.5 3.4 3.3 3.2 3.1 16

average bitrate (bits/frame)

Вам также может понравиться