Академический Документы
Профессиональный Документы
Культура Документы
Jean-Marc Valin
December 8, 2007
c
Copyright
2002-2007 Jean-Marc Valin/Xiph.org Foundation
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation
License, Version 1.1 or any later version published by the Free Software Foundation; with no Invariant Section, with no Front-
Cover Texts, and with no Back-Cover. A copy of the license is included in the section entitled "GNU Free Documentation
License".
2
Contents
1 Introduction to Speex 6
1.1 Getting help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 About this document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Codec description 7
2.1 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Codec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Preprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Adaptive Jitter Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Acoustic Echo Canceller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6 Resampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4 Command-line encoder/decoder 13
4.1 speexenc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 speexdec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3
Contents
A Sample code 36
A.1 sampleenc.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
A.2 sampledec.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
D Speex License 60
4
List of Tables
5
1 Introduction to Speex
The Speex codec (http://www.speex.org/) exists because there is a need for a speech codec that is open-source and
free from software patent royalties. These are essential conditions for being usable in any open-source software. In essence,
Speex is to speech what Vorbis is to audio/music. Unlike many other speech codecs, Speex is not designed for mobile phones
but rather for packet networks and voice over IP (VoIP) applications. File-based compression is of course also supported.
The Speex codec is designed to be very flexible and support a wide range of speech quality and bit-rate. Support for very
good quality speech also means that Speex can encode wideband speech (16 kHz sampling rate) in addition to narrowband
speech (telephone quality, 8 kHz sampling rate).
Designing for VoIP instead of mobile phones means that Speex is robust to lost packets, but not to corrupted ones. This is
based on the assumption that in VoIP, packets either arrive unaltered or don’t arrive at all. Because Speex is targeted at a wide
range of devices, it has modest (adjustable) complexity and a small memory footprint.
All the design goals led to the choice of CELP as the encoding technique. One of the main reasons is that CELP has long
proved that it could work reliably and scale well to both low bit-rates (e.g. DoD CELP @ 4.8 kbps) and high bit-rates (e.g.
G.728 @ 16 kbps).
• This manual
• Other documentation on the Speex website (http://www.speex.org/)
• Mailing list: Discuss any Speex-related topic on speex-dev@xiph.org (not just for developers)
• IRC: The main channel is #speex on irc.freenode.net. Note that due to time differences, it may take a while to get
someone, so please be patient.
• Email the author privately at jean-marc.valin@usherbrooke.ca only for private/delicate topics you do not wish to discuss
publically.
Before asking for help (mailing list or IRC), it is important to first read this manual (OK, so if you made it here it’s already
a good sign). It is generally considered rude to ask on a mailing list about topics that are clearly detailed in the documentation.
On the other hand, it’s perfectly OK (and encouraged) to ask for clarifications about something covered in the manual. This
manual does not (yet) cover everything about Speex, so everyone is encouraged to ask questions, send comments, feature
requests, or just let us know how Speex is being used.
Here are some additional guidelines related to the mailing list. Before reporting bugs in Speex to the list, it is strongly
recommended (if possible) to first test whether these bugs can be reproduced using the speexenc and speexdec (see Section 4)
command-line utilities. Bugs reported based on 3rd party code are both harder to find and far too often caused by errors that
have nothing to do with Speex.
6
2 Codec description
This section describes Speex and its features into more details.
2.1 Concepts
Before introducing all the Speex features, here are some concepts in speech coding that help better understand the rest of the
manual. Although some are general concepts in speech/audio processing, others are specific to Speex.
Sampling rate
The sampling rate expressed in Hertz (Hz) is the number of samples taken from a signal per second. For a sampling rate
of Fs kHz, the highest frequency that can be represented is equal to Fs /2 kHz (Fs /2 is known as the Nyquist frequency).
This is a fundamental property in signal processing and is described by the sampling theorem. Speex is mainly designed for
three different sampling rates: 8 kHz, 16 kHz, and 32 kHz. These are respectively refered to as narrowband, wideband and
ultra-wideband.
Bit-rate
When encoding a speech signal, the bit-rate is defined as the number of bits per unit of time required to encode the speech. It
is measured in bits per second (bps), or generally kilobits per second. It is important to make the distinction between kilobits
per second (kbps) and kilobytes per second (kBps).
Quality (variable)
Speex is a lossy codec, which means that it achives compression at the expense of fidelity of the input speech signal. Unlike
some other speech codecs, it is possible to control the tradeoff made between quality and bit-rate. The Speex encoding process
is controlled most of the time by a quality parameter that ranges from 0 to 10. In constant bit-rate (CBR) operation, the quality
parameter is an integer, while for variable bit-rate (VBR), the parameter is a float.
Complexity (variable)
With Speex, it is possible to vary the complexity allowed for the encoder. This is done by controlling how the search is
performed with an integer ranging from 1 to 10 in a way that’s similar to the -1 to -9 options to gzip and bzip2 compression
utilities. For normal use, the noise level at complexity 1 is between 1 and 2 dB higher than at complexity 10, but the CPU
requirements for complexity 10 is about 5 times higher than for complexity 1. In practice, the best trade-off is between
complexity 2 and 4, though higher settings are often useful when encoding non-speech sounds like DTMF tones.
7
2 Codec description
Perceptual enhancement
Perceptual enhancement is a part of the decoder which, when turned on, attempts to reduce the perception of the noise/dis-
tortion produced by the encoding/decoding process. In most cases, perceptual enhancement brings the sound further from the
original objectively (e.g. considering only SNR), but in the end it still sounds better (subjective improvement).
2.2 Codec
The main characteristics of Speex can be summarized as follows:
• Fixed-point implementation
2.3 Preprocessor
This part refers to the preprocessor module introduced in the 1.1.x branch. The preprocessor is designed to be used on the
audio before running the encoder. The preprocessor provides three main functionalities:
• noise suppression
• automatic gain control (AGC)
• voice activity detection (VAD)
8
2 Codec description
The denoiser can be used to reduce the amount of background noise present in the input signal. This provides higher quality
speech whether or not the denoised signal is encoded with Speex (or at all). However, when using the denoised signal with the
codec, there is an additional benefit. Speech codecs in general (Speex included) tend to perform poorly on noisy input, which
tends to amplify the noise. The denoiser greatly reduces this effect.
Automatic gain control (AGC) is a feature that deals with the fact that the recording volume may vary by a large amount
between different setups. The AGC provides a way to adjust a signal to a reference volume. This is useful for voice over
IP because it removes the need for manual adjustment of the microphone gain. A secondary advantage is that by setting the
microphone gain to a conservative (low) level, it is easier to avoid clipping.
The voice activity detector (VAD) provided by the preprocessor is more advanced than the one directly provided in the
codec.
2.6 Resampler
In some cases, it may be useful to convert audio from one sampling rate to another. There are many reasons for that. It can
be for mixing streams that have different sampling rates, for supporting sampling rates that the soundcard doesn’t support, for
transcoding, etc. That’s why there is now a resampler that is part of the Speex project. This resampler can be used to convert
between any two arbitrary rates (the ratio must only be a rational number) and there is control over the quality/complexity
tradeoff.
9
3 Compiling and Porting
Compiling Speex under UNIX/Linux or any other platform supported by autoconf (e.g. Win32/cygwin) is as easy as typing:
% ./configure [options]
% make
% make install
–prefix=<path> Specifies the base path for installing Speex (e.g. /usr)
–enable-shared/–disable-shared Whether to compile shared libraries
–enable-static/–disable-static Whether to compile static libraries
–disable-wideband Disable the wideband part of Speex (typically to save space)
–enable-valgrind Enable extra hits for valgrind for debugging purposes (do not use by default)
–enable-sse Enable use of SSE instructions (x86/float only)
–enable-fixed-point Compile Speex for a processor that does not have a floating point unit (FPU)
–enable-arm4-asm Enable assembly specific to the ARMv4 architecture (gcc only)
–enable-arm5e-asm Enable assembly specific to the ARMv5E architecture (gcc only)
–enable-fixed-point-debug Use only for debugging the fixed-point code (very slow)
–enable-epic-48k Enable a special (and non-compatible) 4.8 kbps narrowband mode (broken in 1.1.x and 1.2beta)
–enable-ti-c55x Enable support for the TI C5x family
–enable-blackfin-asm Enable assembly specific to the Blackfin DSP architecture (gcc only)
–enable-vorbis-psycho Make the encoder use the Vorbis psycho-acoustic model. This is very experimental and may be
removed in the future.
3.1 Platforms
Speex is known to compile and work on a large number of architectures, both floating-point and fixed-point. In general, any
architecture that can natively compute the multiplication of two signed 16-bit numbers (32-bit result) and runs at a sufficient
clock rate (architecture-dependent) is capable of running Speex. Architectures on which Speex is known to work (it probably
works on many others) are:
10
3 Compiling and Porting
• TI C6xxx
• TriMedia (experimental)
Operating systems on top of which Speex is known to work include (it probably works on many others):
• Linux
• µ Clinux
• MacOS X
• BSD
• Other UNIX/POSIX variants
• Symbian
The source code directory include additional information for compiling on certain architectures or operating systems in
README.xxx files.
If you are going to be writing assembly, then the following functions are usually the first ones you should consider optimising:
• filter_mem16()
• iir_mem16()
• vq_nbest()
• pitch_xcorr()
• interp_pitch()
The filtering functions filter_mem16() and iir_mem16() are implemented in the direct form II transposed (DF2T).
However, for architectures based on multiply-accumulate (MAC), DF2T requires frequent reload of the accumulator, which
can make the code very slow. For these architectures (e.g. Blackfin and Coldfire), a better approach is to implement those
functions as direct form I (DF1), which is easier to express in terms of MAC. When doing that however, it is important to
make sure that the DF1 implementation still behaves like the original DF2T behaviour when it comes to filter values.
This is necessary because the filter is time-varrying and must compute exactly the same value (not counting machine rounding)
on any encoder or decoder.
11
3 Compiling and Porting
Speex also has several methods for allocating temporary arrays. When using a compiler that supports C99 properly (as of 2007,
Microsoft compilers don’t, but gcc does), it is best to define VAR_ARRAYS. That makes use of the variable-size array feature
of C99. The next best is to define USE_ALLOCA so that Speex can use alloca() to allocate the temporary arrays. Note that on
many systems, alloca() is buggy so it may not work. If none of VAR_ARRAYS and USE_ALLOCA are defined, then Speex
falls back to allocating a large “scratch space” and doing its own internal allocation. The main disadvantage of this solution
is that it is wasteful. It needs to allocate enough stack for the worst case scenario (worst bit-rate, highest complexity setting,
...) and by default, the memory isn’t shared between multiple encoder/decoder states. Still, if the “manual” allocation is the
only option left, there are a few things that can be improved. By overriding the speex_alloc_scratch() call in os_support.h, it
is possible to always return the same memory area for all states1 . In addition to that, by redefining the NB_ENC_STACK and
NB_DEC_STACK (or similar for wideband), it is possible to only allocate memory for a scenario that is known in advange.
In this case, it is important to measure the amount of memory required for the specific sampling rate, bit-rate and complexity
level being used.
12
4 Command-line encoder/decoder
The base Speex distribution includes a command-line encoder (speexenc) and decoder (speexdec). Those tools produce and
read Speex files encapsulated in the Ogg container. Although it is possible to encapsulate Speex in any container, Ogg is the
recommended container for files. This section describes how to use the command line tools for Speex files in Ogg.
4.1 speexenc
The speexenc utility is used to create Speex files from raw PCM or wave files. It can be used by calling:
speexenc [options] input_file output_file
The value ’-’ for input_file or output_file corresponds respectively to stdin and stdout. The valid options are:
–narrowband (-n) Tell Speex to treat the input as narrowband (8 kHz). This is the default
–wideband (-w) Tell Speex to treat the input as wideband (16 kHz)
–ultra-wideband (-u) Tell Speex to treat the input as “ultra-wideband” (32 kHz)
–quality n Set the encoding quality (0-10), default is 8
–bitrate n Encoding bit-rate (use bit-rate n or lower)
–vbr Enable VBR (Variable Bit-Rate), disabled by default
–abr n Enable ABR (Average Bit-Rate) at n kbps, disabled by default
–vad Enable VAD (Voice Activity Detection), disabled by default
–dtx Enable DTX (Discontinuous Transmission), disabled by default
–nframes n Pack n frames in each Ogg packet (this saves space at low bit-rates)
–comp n Set encoding speed/quality tradeoff. The higher the value of n, the slower the encoding (default is 3)
-V Verbose operation, print bit-rate currently in use
–help (-h) Print the help
–version (-v) Print version information
Speex comments
–comment Add the given string as an extra comment. This may be used multiple times.
–author Author of this track.
–title Title for this track.
13
4 Command-line encoder/decoder
4.2 speexdec
The speexdec utility is used to decode Speex files and can be used by calling:
The value ’-’ for input_file or output_file corresponds respectively to stdin and stdout. Also, when no output_file is specified,
the file is played to the soundcard. The valid options are:
14
5 Using the Speex Codec API (libspeex)
The libspeex library contains all the functions for encoding and decoding speech with the Speex codec. When linking on a
UNIX system, one must add -lspeex -lm to the compiler command line. One important thing to know is that libspeex calls are
reentrant, but not thread-safe. That means that it is fine to use calls from many threads, but calls using the same state from
multiple threads must be protected by mutexes. Examples of code can also be found in Appendix A and the complete API
documentation is included in the Documentation section of the Speex website (http://www.speex.org/).
5.1 Encoding
In order to encode speech using Speex, one first needs to:
#include <speex/speex.h>
Then in the code, a Speex bit-packing struct must be declared, along with a Speex encoder state:
SpeexBits bits;
void *enc_state;
The two are initialized by:
speex_bits_init(&bits);
enc_state = speex_encoder_init(&speex_nb_mode);
For wideband coding, speex_nb_mode will be replaced by speex_wb_mode. In most cases, you will need to know the frame
size used at the sampling rate you are using. You can get that value in the frame_size variable (expressed in samples, not
bytes) with:
speex_encoder_ctl(enc_state,SPEEX_GET_FRAME_SIZE,&frame_size);
In practice, frame_size will correspond to 20 ms when using 8, 16, or 32 kHz sampling rate. There are many parameters that
can be set for the Speex encoder, but the most useful one is the quality parameter that controls the quality vs bit-rate tradeoff.
This is set by:
speex_encoder_ctl(enc_state,SPEEX_SET_QUALITY,&quality);
where quality is an integer value ranging from 0 to 10 (inclusively). The mapping between quality and bit-rate is described
in Fig. 9.2 for narrowband.
Once the initialization is done, for every input frame:
speex_bits_reset(&bits);
speex_encode_int(enc_state, input_frame, &bits);
nbBytes = speex_bits_write(&bits, byte_ptr, MAX_NB_BYTES);
where input_frame is a (short *) pointing to the beginning of a speech frame, byte_ptr is a (char *) where the encoded frame
will be written, MAX_NB_BYTES is the maximum number of bytes that can be written to byte_ptr without causing an overflow
and nbBytes is the number of bytes actually written to byte_ptr (the encoded size in bytes). Before calling speex_bits_write,
it is possible to find the number of bytes that need to be written by calling speex_bits_nbytes(&bits), which returns
a number of bytes.
It is still possible to use the speex_encode() function, which takes a (float *) for the audio. However, this would make an
eventual port to an FPU-less platform (like ARM) more complicated. Internally, speex_encode() and speex_encode_int() are
processed in the same way. Whether the encoder uses the fixed-point version is only decided by the compile-time flags, not at
the API level.
After you’re done with the encoding, free all resources with:
speex_bits_destroy(&bits);
speex_encoder_destroy(enc_state);
That’s about it for the encoder.
15
5 Using the Speex Codec API ( libspeex)
5.2 Decoding
In order to decode speech using Speex, you first need to:
#include <speex/speex.h>
You also need to declare a Speex bit-packing struct
SpeexBits bits;
and a Speex decoder state
void *dec_state;
The two are initialized by:
speex_bits_init(&bits);
dec_state = speex_decoder_init(&speex_nb_mode);
For wideband decoding, speex_nb_mode will be replaced by speex_wb_mode. If you need to obtain the size of the frames
that will be used by the decoder, you can get that value in the frame_size variable (expressed in samples, not bytes) with:
speex_decoder_ctl(dec_state, SPEEX_GET_FRAME_SIZE, &frame_size);
There is also a parameter that can be set for the decoder: whether or not to use a perceptual enhancer. This can be set by:
speex_decoder_ctl(dec_state, SPEEX_SET_ENH, &enh);
where enh is an int with value 0 to have the enhancer disabled and 1 to have it enabled. As of 1.2-beta1, the default is now
to enable the enhancer.
Again, once the decoder initialization is done, for every input frame:
speex_bits_read_from(&bits, input_bytes, nbBytes);
speex_decode_int(dec_state, &bits, output_frame);
where input_bytes is a (char *) containing the bit-stream data received for a frame, nbBytes is the size (in bytes) of that
bit-stream, and output_frame is a (short *) and points to the area where the decoded speech frame will be written. A NULL
value as the second argument indicates that we don’t have the bits for the current frame. When a frame is lost, the Speex
decoder will do its best to "guess" the correct signal.
As for the encoder, the speex_decode() function can still be used, with a (float *) as the output for the audio. After you’re
done with the decoding, free all resources with:
speex_bits_destroy(&bits);
speex_decoder_destroy(dec_state);
Just because there’s an option for it doesn’t mean you have to turn it on – me.
The Speex encoder and decoder support many options and requests that can be accessed through the speex_encoder_ctl and
speex_decoder_ctl functions. These functions are similar to the ioctl system call and their prototypes are:
void speex_encoder_ctl(void *encoder, int request, void *ptr);
void speex_decoder_ctl(void *encoder, int request, void *ptr);
Despite those functions, the defaults are usually good for many applications and optional settings should only be used
when one understands them and knows that they are needed. A common error is to attempt to set many unnecessary
settings.
Here is a list of the values allowed for the requests. Some only apply to the encoder or the decoder. Because the last argument
is of type void *, the _ctl() functions are not type safe, and shoud thus be used with care. The type spx_int32_t is
the same as the C99 int32_t type.
SPEEX_SET_ENH‡ Set perceptual enhancer to on (1) or off (0) (spx_int32_t, default is on)
16
5 Using the Speex Codec API ( libspeex)
SPEEX_GET_FRAME_SIZE Get the number of samples per frame for the current mode (spx_int32_t)
SPEEX_SET_QUALITY† Set the encoder speech quality (spx_int32_t from 0 to 10, default is 8)
SPEEX_GET_QUALITY† Get the current encoder speech quality (spx_int32_t from 0 to 10)
SPEEX_SET_MODE† Set the mode number, as specified in the RTP spec (spx_int32_t)
SPEEX_GET_MODE† Get the current mode number, as specified in the RTP spec (spx_int32_t)
SPEEX_SET_VBR† Set variable bit-rate (VBR) to on (1) or off (0) (spx_int32_t, default is off)
SPEEX_GET_VBR† Get variable bit-rate (VBR) status (spx_int32_t)
SPEEX_SET_VBR_QUALITY† Set the encoder VBR speech quality (float 0.0 to 10.0, default is 8.0)
SPEEX_GET_VBR_QUALITY† Get the current encoder VBR speech quality (float 0 to 10)
SPEEX_SET_COMPLEXITY† Set the CPU resources allowed for the encoder (spx_int32_t from 1 to 10, default is 2)
SPEEX_GET_COMPLEXITY† Get the CPU resources allowed for the encoder (spx_int32_t from 1 to 10, default is
2)
SPEEX_SET_BITRATE† Set the bit-rate to use the closest value not exceeding the parameter (spx_int32_t in bits per
second)
SPEEX_GET_BITRATE Get the current bit-rate in use (spx_int32_t in bits per second)
SPEEX_SET_SAMPLING_RATE Set real sampling rate (spx_int32_t in Hz)
SPEEX_GET_SAMPLING_RATE Get real sampling rate (spx_int32_t in Hz)
SPEEX_RESET_STATE Reset the encoder/decoder state to its original state, clearing all memories (no argument)
SPEEX_SET_VAD† Set voice activity detection (VAD) to on (1) or off (0) (spx_int32_t, default is off)
17
5 Using the Speex Codec API ( libspeex)
SPEEX_MODE_FRAME_SIZE Get the frame size (in samples) for the mode
SPEEX_SUBMODE_BITRATE Get the bit-rate for a submode number specified through ptr (integer in bps).
Finally, applications may define custom in-band messages using mode 13. The size of the message in bytes is encoded with
5 bits, so that the decoder can skip it if it doesn’t know how to interpret it.
18
6 Speech Processing API (libspeexdsp)
As of version 1.2beta3, the non-codec parts of the Speex package are now in a separate library called libspeexdsp. This library
includes the preprocessor, the acoustic echo canceller, the jitter buffer, and the resampler. In a UNIX environment, it can be
linked into a program by adding -lspeexdsp -lm to the compiler command line. Just like for libspeex, libspeexdsp calls are
reentrant, but not thread-safe. That means that it is fine to use calls from many threads, but calls using the same state from
multiple threads must be protected by mutexes.
6.1 Preprocessor
In order to use the Speex preprocessor, you first need to:
#include <speex/speex_preprocess.h>
Then, a preprocessor state can be created as:
SpeexPreprocessState *preprocess_state = speex_preprocess_state_init(frame_size,
sampling_rate);
and it is recommended to use the same value for frame_size as is used by the encoder (20 ms).
For each input frame, you need to call:
speex_preprocess_run(preprocess_state, audio_frame);
where audio_frame is used both as input and output. In cases where the output audio is not useful for a certain frame, it is
possible to use instead:
speex_preprocess_estimate_update(preprocess_state, audio_frame);
This call will update all the preprocessor internal state variables without computing the output audio, thus saving some CPU
cycles.
The behaviour of the preprocessor can be changed using:
speex_preprocess_ctl(preprocess_state, request, ptr);
which is used in the same way as the encoder and decoder equivalent. Options are listed in Section 6.1.1.
The preprocessor state can be destroyed using:
speex_preprocess_state_destroy(preprocess_state);
19
6 Speech Processing API ( libspeexdsp)
20
6 Speech Processing API ( libspeexdsp)
where input_frame is the audio as captured by the microphone, echo_frame is the signal that was played in the
speaker (and needs to be removed) and output_frame is the signal with echo removed.
One important thing to keep in mind is the relationship between input_frame and echo_frame. It is important that,
at any time, any echo that is present in the input has already been sent to the echo canceller as echo_frame. In other words,
the echo canceller cannot remove a signal that it hasn’t yet received. On the other hand, the delay between the input signal
and the echo signal must be small enough because otherwise part of the echo cancellation filter is inefficient. In the ideal case,
you code would look like:
write_to_soundcard(echo_frame, frame_size);
read_from_soundcard(input_frame, frame_size);
speex_echo_cancellation(echo_state, input_frame, echo_frame, output_frame);
If you wish to further reduce the echo present in the signal, you can do so by associating the echo canceller to the prepro-
cessor (see Section 6.1). This is done by calling:
speex_preprocess_ctl(preprocess_state, SPEEX_PREPROCESS_SET_ECHO_STATE,echo_state);
in the initialisation.
As of version 1.2-beta2, there is an alternative, simpler API that can be used instead of speex_echo_cancellation(). When
audio capture and playback are handled asynchronously (e.g. in different threads or using the poll() or select() system call),
it can be difficult to keep track of what input_frame comes with what echo_frame. Instead, the playback comtext/thread can
simply call:
speex_echo_playback(echo_state, echo_frame);
every time an audio frame is played. Then, the capture context/thread calls:
speex_echo_capture(echo_state, input_frame, output_frame);
for every frame captured. Internally, speex_echo_playback() simply buffers the playback frame so it can be used by
speex_echo_capture() to call speex_echo_cancel(). A side effect of using this alternate API is that the playback audio is
delayed by two frames, which is the normal delay caused by the soundcard. When capture and playback are already synchro-
nised, speex_echo_cancellation() is preferable since it gives better control on the exact input/echo timing.
The echo cancellation state can be destroyed with:
speex_echo_state_destroy(echo_state);
It is also possible to reset the state of the echo canceller so it can be reused without the need to create another state with:
speex_echo_state_reset(echo_state);
6.2.1 Troubleshooting
There are several things that may prevent the echo canceller from working properly. One of them is a bug (or something
suboptimal) in the code, but there are many others you should consider first
• Using a different soundcard to do the capture and plaback will not work, regardless of what you may think. The only
exception to that is if the two cards can be made to have their sampling clock “locked” on the same clock source. If not,
the clocks will always have a small amount of drift, which will prevent the echo canceller from adapting.
• The delay between the record and playback signals must be minimal. Any signal played has to “appear” on the playback
(far end) signal slightly before the echo canceller “sees” it in the near end signal, but excessive delay means that part of
the filter length is wasted. In the worst situations, the delay is such that it is longer than the filter length, in which case,
no echo can be cancelled.
• When it comes to echo tail length (filter length), longer is *not* better. Actually, the longer the tail length, the longer it
takes for the filter to adapt. Of course, a tail length that is too short will not cancel enough echo, but the most common
problem seen is that people set a very long tail length and then wonder why no echo is being cancelled.
• Non-linear distortion cannot (by definition) be modeled by the linear adaptive filter used in the echo canceller and thus
cannot be cancelled. Use good audio gear and avoid saturation/clipping.
21
6 Speech Processing API ( libspeexdsp)
Also useful is reading Echo Cancellation Demystified by Alexey Frunze1 , which explains the fundamental principles of echo
cancellation. The details of the algorithm described in the article are different, but the general ideas of echo cancellation
through adaptive filters are the same.
As of version 1.2beta2, a new echo_diagnostic.m tool is included in the source distribution. The first step is to define
DUMP_ECHO_CANCEL_DATA during the build. This causes the echo canceller to automatically save the near-end, far-end
and output signals to files (aec_rec.sw aec_play.sw and aec_out.sw). These are exactly what the AEC receives and outputs.
From there, it is necessary to start Octave and type:
echo_diagnostic(’aec_rec.sw’, ’aec_play.sw’, ’aec_diagnostic.sw’, 1024);
The value of 1024 is the filter length and can be changed. There will be some (hopefully) useful messages printed and echo
cancelled audio will be saved to aec_diagnostic.sw . If even that output is bad (almost no cancellation) then there is probably
problem with the playback or recording process.
1 http://www.embeddedstar.com/articles/2003/7/article20030720-1.html
22
6 Speech Processing API ( libspeexdsp)
jitter_buffer_remaining_span(state, remaining);
The second argument is used to specify that we are still holding data that has not been written to the playback device.
For instance, if 256 samples were needed by the soundcard (specified by desired_span), but jitter_buffer_get()
returned 320 samples, we would have remaining=64.
6.4 Resampler
Speex includes a resampling modules. To make use of the resampler, it is necessary to include its header file:
#include <speex/speex_resampler.h>
For each stream that is to be resampled, it is necessary to create a resampler state with:
SpeexResamplerState *resampler;
resampler = speex_resampler_init(nb_channels, input_rate, output_rate, quality, &
err);
where nb_channels is the number of channels that will be used (either interleaved or non-interleaved), input_rate is the
sampling rate of the input stream, output_rate is the sampling rate of the output stream and quality is the requested quality
setting (0 to 10). The quality parameter is useful for controlling the quality/complexity/latency tradeoff. Using a higher
quality setting means less noise/aliasing, a higher complexity and a higher latency. Usually, a quality of 3 is acceptable for
most desktop uses and quality 10 is mostly recommended for pro audio work. Quality 0 usually has a decent sound (certainly
better than using linear interpolation resampling), but artifacts may be heard.
The actual resampling is performed using
err = speex_resampler_process_int(resampler, channelID, in, &in_length, out, &
out_length);
where channelID is the ID of the channel to be processed. For a mono stream, use 0. The in pointer points to the first sample
of the input buffer for the selected channel and out points to the first sample of the output. The size of the input and output
buffers are specified by in_length and out_length respectively. Upon completion, these values are replaced by the number of
samples read and written by the resampler. Unless an error occurs, either all input samples will be read or all output samples
will be written to (or both). For floating-point samples, the function speex_resampler_process_float() behaves similarly.
It is also possible to process multiple channels at once.
To be continued...
23
7 Formats and standards
Speex can encode speech in both narrowband and wideband and provides different bit-rates. However, not all features need
to be supported by a certain implementation or device. In order to be called “Speex compatible” (whatever that means), an
implementation must implement at least a basic set of features.
At the minimum, all narrowband modes of operation MUST be supported at the decoder. This includes the decoding of
a wideband bit-stream by the narrowband decoder1. If present, a wideband decoder MUST be able to decode a narrowband
stream, and MAY either be able to decode all wideband modes or be able to decode the embedded narrowband part of all
modes (which includes ignoring the high-band bits).
For encoders, at least one narrowband or wideband mode MUST be supported. The main reason why all encoding modes
do not have to be supported is that some platforms may not be able to handle the complexity of encoding in some modes.
1 The wideband bit-stream contains an embedded narrowband bit-stream which can be decoded alone
24
7 Formats and standards
25
8 Introduction to CELP Coding
Do not meddle in the affairs of poles, for they are subtle and quick to leave the unit circle.
Speex is based on CELP, which stands for Code Excited Linear Prediction. This section attempts to introduce the principles
behind CELP, so if you are already familiar with CELP, you can safely skip to section 9. The CELP technique is based on
three ideas:
1. The use of a linear prediction (LP) model to model the vocal tract
2. The use of (adaptive and fixed) codebook entries as input (excitation) of the LP model
3. The search performed in closed-loop in a “perceptually weighted domain”
This section describes the basic ideas behind CELP. This is still a work in progress.
where y[n] is the linear prediction of x[n]. The prediction error is thus given by:
N
e[n] = x[n] − y[n] = x[n] − ∑ ai x[n − i]
i=1
The goal of the LPC analysis is to find the best prediction coefficients ai which minimize the quadratic error function:
" #2
L−1 L−1 N
E= ∑ [e[n]] 2
= ∑ x[n] − ∑ ai x[n − i]
n=0 n=0 i=1
∂E
That can be done by making all derivatives ∂ ai equal to zero:
" #2
∂E ∂ L−1 N
= ∑ x[n] − ∑ ai x[n − i] = 0
∂ ai ∂ ai n=0 i=1
26
8 Introduction to CELP Coding
For an order N filter, the filter coefficients ai are found by solving the system N × N linear system Ra = r, where
R(0) R(1) · · · R(N − 1)
R(1) R(0) · · · R(N − 2)
R=
.. .. . . ..
. . . .
R(N − 1) R(N − 2) · · · R(0)
R(1)
R(2)
r=
..
.
R(N)
with R(m), the auto-correlation of the signal x[n], computed as:
N−1
R(m) = ∑ x[i]x[i − m]
i=0
Toeplitz, the Levinson-Durbin algorithm can be used, making the solution to the problem O N 2
Because R is Hermitian
instead of O N 3 . Also, it can be proven that all the roots of A(z) are within the unit circle, which means that 1/A(z) is always
stable. This is in theory; in practice because of finite precision, there are two commonly used techniques to make sure we have
a stable filter. First, we multiply R(0) by a number slightly above one (such as 1.0001), which is equivalent to adding noise
to the signal. Also, we can apply a window to the auto-correlation, which is equivalent to filtering in the frequency domain,
reducing sharp resonances.
27
8 Introduction to CELP Coding
30
Speech signal
LPC synthesis filter
Reference shaping
20
10
Response (dB)
0
-10
-20
-30
-40
0 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz)
8.6 Analysis-by-Synthesis
One of the main principles behind CELP is called Analysis-by-Synthesis (AbS), meaning that the encoding (analysis) is
performed by perceptually optimising the decoded (synthesis) signal in a closed loop. In theory, the best CELP stream would
28
8 Introduction to CELP Coding
be produced by trying all possible bit combinations and selecting the one that produces the best-sounding decoded signal.
This is obviously not possible in practice for two reasons: the required complexity is beyond any currently available hardware
and the “best sounding” selection criterion implies a human listener.
In order to achieve real-time encoding using limited computing resources, the CELP optimisation is broken down into
smaller, more manageable, sequential searches using the perceptual weighting function described earlier.
29
9 Speex narrowband mode
This section looks at how Speex works for narrowband (8 kHz sampling rate) operation. The frame size for this mode is 20 ms,
corresponding to 160 samples. Each frame is also subdivided into 4 sub-frames of 40 samples each.
Also many design decisions were based on the original goals and assumptions:
• Minimizing the amount of information extracted from past frames (for robustness to packet loss)
• Dynamically-selectable codebooks (LSP, pitch and innovation)
• sub-vector fixed (innovation) codebooks
where g0 , g1 and g2 are the jointly quantized pitch gains and e[n] is the codec excitation memory. It is worth noting that when
the pitch is smaller than the sub-frame size, we repeat the excitation at a period T . For example, when n − T + 1 ≥ 0, we
use n − 2T + 1 instead. In most modes, the pitch period is encoded with 7 bits in the [17, 144] range and the βi coefficients
are vector-quantized using 7 bits at higher bit-rates (15 kbps narrowband and above) and 5 bits at lower bit-rates (11 kbps
narrowband and below).
30
9 Speex narrowband mode
31
9 Speex narrowband mode
Many current CELP codecs use moving average (MA) prediction to encode the fixed codebook gain. This provides slightly
better coding at the expense of introducing a dependency on previously encoded frames. A second difference is that Speex
encodes the fixed codebook gain as the product of the global excitation gain g f rame with a sub-frame gain corrections gsub f .
This increases robustness to packet loss by eliminating the inter-frame dependency. The sub-frame gain correction is encoded
before the fixed codebook is searched (not closed-loop optimized) and uses between 0 and 3 bits per sub-frame, depending on
the bit-rate.
The third difference is that Speex uses sub-vector quantization of the innovation (fixed codebook) signal instead of an
algebraic codebook. Each sub-frame is divided into sub-vectors of lengths ranging between 5 and 20 samples. Each sub-
vector is chosen from a bitrate-dependent codebook and all sub-vectors are concatenated to form a sub-frame. As an example,
the 3.95 kbps mode uses a sub-vector size of 20 samples with 32 entries in the codebook (5 bits). This means that the
innovation is encoded with 10 bits per sub-frame, or 2000 bps. On the other hand, the 18.2 kbps mode uses a sub-vector size
of 5 samples with 256 entries in the codebook (8 bits), so the innovation uses 64 bits per sub-frame, or 12800 bps.
So far, no MOS (Mean Opinion Score) subjective evaluation has been performed for Speex. In order to give an idea of
the quality achievable with it, table 9.2 presents my own subjective opinion on it. It sould be noted that different people
will perceive the quality differently and that the person that designed the codec often has a bias (one way or another) when
it comes to subjective evaluation. Last thing, it should be noted that for most codecs (including Speex) encoding quality
sometimes varies depending on the input. Note that the complexity is only approximate (within 0.5 mflops and using the lowest
complexity setting). Decoding requires approximately 0.5 mflops in most modes (1 mflops with perceptual enhancement).
32
9 Speex narrowband mode
where a1 and a2 depend on the mode in use and a3 = 1r 1 − 1−ra 1−ra2 with r = .9. The second part of the enhancement consists
1
33
10 Speex wideband mode (sub-band CELP)
For wideband, the Speex approach uses a quadrature mirror f ilter (QMF) to split the band in two. The 16 kHz signal is thus
divided into two 8 kHz signals, one representing the low band (0-4 kHz), the other the high band (4-8 kHz). The low band is
encoded with the narrowband mode described in section 9 in such a way that the resulting “embedded narrowband bit-stream”
can also be decoded with the narrowband decoder. Since the low band encoding has already been described, only the high
band encoding is described in this section.
34
10 Speex wideband mode (sub-band CELP)
35
A Sample code
This section shows sample code for encoding and decoding speech using the Speex API. The commands can be used to encode
and decode a file by calling:
% sampleenc in_file.sw | sampledec out_file.sw
where both files are raw (no header) files encoded at 16 bits per sample (in the machine natural endianness).
A.1 sampleenc.c
sampleenc takes a raw 16 bits/sample file, encodes it and outputs a Speex stream to stdout. Note that the packing used is not
compatible with that of speexenc/speexdec.
36
A Sample code
39 for (i=0;i<FRAME_SIZE;i++)
40 input[i]=in[i];
41
42 /*Flush all the bits in the struct so we can encode a new frame*/
43 speex_bits_reset(&bits);
44
45 /*Encode the frame*/
46 speex_encode(state, input, &bits);
47 /*Copy the bits to an array of char that can be written*/
48 nbBytes = speex_bits_write(&bits, cbits, 200);
49
50 /*Write the size of the frame first. This is what sampledec expects but
51 it’s likely to be different in your own application*/
52 fwrite(&nbBytes, sizeof(int), 1, stdout);
53 /*Write the compressed data*/
54 fwrite(cbits, 1, nbBytes, stdout);
55
56 }
57
58 /*Destroy the encoder state*/
59 speex_encoder_destroy(state);
60 /*Destroy the bit-packing struct*/
61 speex_bits_destroy(&bits);
62 fclose(fin);
63 return 0;
64 }
A.2 sampledec.c
sampledec reads a Speex stream from stdin, decodes it and outputs it to a raw 16 bits/sample file. Note that the packing used
is not compatible with that of speexenc/speexdec.
Listing A.2: Source code for sampledec
1 #include <speex/speex.h>
2 #include <stdio.h>
3
4 /*The frame size in hardcoded for this sample code but it doesn’t have to be*/
5 #define FRAME_SIZE 160
6 int main(int argc, char **argv)
7 {
8 char *outFile;
9 FILE *fout;
10 /*Holds the audio that will be written to file (16 bits per sample)*/
11 short out[FRAME_SIZE];
12 /*Speex handle samples as float, so we need an array of floats*/
13 float output[FRAME_SIZE];
14 char cbits[200];
15 int nbBytes;
16 /*Holds the state of the decoder*/
17 void *state;
18 /*Holds bits so they can be read and written to by the Speex routines*/
19 SpeexBits bits;
20 int i, tmp;
21
22 /*Create a new decoder state in narrowband mode*/
23 state = speex_decoder_init(&speex_nb_mode);
37
A Sample code
24
25 /*Set the perceptual enhancement on*/
26 tmp=1;
27 speex_decoder_ctl(state, SPEEX_SET_ENH, &tmp);
28
29 outFile = argv[1];
30 fout = fopen(outFile, "w");
31
32 /*Initialization of the structure that holds the bits*/
33 speex_bits_init(&bits);
34 while (1)
35 {
36 /*Read the size encoded by sampleenc, this part will likely be
37 different in your application*/
38 fread(&nbBytes, sizeof(int), 1, stdin);
39 fprintf (stderr, "nbBytes: %d\n", nbBytes);
40 if (feof(stdin))
41 break;
42
43 /*Read the "packet" encoded by sampleenc*/
44 fread(cbits, 1, nbBytes, stdin);
45 /*Copy the data into the bit-stream struct*/
46 speex_bits_read_from(&bits, cbits, nbBytes);
47
48 /*Decode the data*/
49 speex_decode(state, &bits, output);
50
51 /*Copy from float to short (16 bits) for output*/
52 for (i=0;i<FRAME_SIZE;i++)
53 out[i]=output[i];
54
55 /*Write the decoded audio to file*/
56 fwrite(out, sizeof(short), FRAME_SIZE, fout);
57 }
58
59 /*Destroy the decoder state*/
60 speex_decoder_destroy(state);
61 /*Destroy the bit-stream truct*/
62 speex_bits_destroy(&bits);
63 fclose(fout);
64 return 0;
65 }
38
B Jitter Buffer for Speex
Listing B.1: Example of using the jitter buffer for Speex packets
1 #include <speex/speex_jitter.h>
2 #include "speex_jitter_buffer.h"
3
4 #ifndef NULL
5 #define NULL 0
6 #endif
7
8
9 void speex_jitter_init(SpeexJitter *jitter, void *decoder, int sampling_rate)
10 {
11 jitter->dec = decoder;
12 speex_decoder_ctl(decoder, SPEEX_GET_FRAME_SIZE, &jitter->frame_size);
13
14 jitter->packets = jitter_buffer_init(jitter->frame_size);
15
16 speex_bits_init(&jitter->current_packet);
17 jitter->valid_bits = 0;
18
19 }
20
21 void speex_jitter_destroy(SpeexJitter *jitter)
22 {
23 jitter_buffer_destroy(jitter->packets);
24 speex_bits_destroy(&jitter->current_packet);
25 }
26
27 void speex_jitter_put(SpeexJitter *jitter, char *packet, int len, int timestamp)
28 {
29 JitterBufferPacket p;
30 p.data = packet;
31 p.len = len;
32 p.timestamp = timestamp;
33 p.span = jitter->frame_size;
34 jitter_buffer_put(jitter->packets, &p);
35 }
36
37 void speex_jitter_get(SpeexJitter *jitter, spx_int16_t *out, int *current_timestamp
)
38 {
39 int i;
40 int ret;
41 spx_int32_t activity;
42 char data[2048];
43 JitterBufferPacket packet;
44 packet.data = data;
45
46 if (jitter->valid_bits)
47 {
39
B Jitter Buffer for Speex
40
C IETF RTP Profile
AVT G. Herlein
Internet-Draft
Intended status: Standards Track J. Valin
Expires: October 24, 2007 University of Sherbrooke
A. Heggestad
April 22, 2007
Copyright Notice
41
C IETF RTP Profile
Abstract
42
C IETF RTP Profile
Editors Note
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4
2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 5
3. RTP usage for Speex . . . . . . . . . . . . . . . . . . . . . 6
3.1. RTP Speex Header Fields . . . . . . . . . . . . . . . . . 6
3.2. RTP payload format for Speex . . . . . . . . . . . . . . . 6
3.3. Speex payload . . . . . . . . . . . . . . . . . . . . . . 6
3.4. Example Speex packet . . . . . . . . . . . . . . . . . . . 7
3.5. Multiple Speex frames in a RTP packet . . . . . . . . . . 7
4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9
4.1. Media Type Registration . . . . . . . . . . . . . . . . . 9
4.1.1. Registration of media type audio/speex . . . . . . . . 9
5. SDP usage of Speex . . . . . . . . . . . . . . . . . . . . . . 11
6. Security Considerations . . . . . . . . . . . . . . . . . . . 14
7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 15
8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 16
8.1. Normative References . . . . . . . . . . . . . . . . . . . 16
8.2. Informative References . . . . . . . . . . . . . . . . . . 16
Authors’ Addresses . . . . . . . . . . . . . . . . . . . . . . . . 17
Intellectual Property and Copyright Statements . . . . . . . . . . 18
43
C IETF RTP Profile
1. Introduction
Speex is based on the CELP [CELP] encoding technique with support for
either narrowband (nominal 8kHz), wideband (nominal 16kHz) or ultra-
wideband (nominal 32kHz). The main characteristics can be summarized
as follows:
o Free software/open-source
o Variable complexity
44
C IETF RTP Profile
2. Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC2119 [RFC2119] and
indicate requirement levels for compliant RTP implementations.
45
C IETF RTP Profile
Payload Type (PT): The assignment of an RTP payload type for this
packet format is outside the scope of this document; it is
specified by the RTP profile under which this payload format is
used, or signaled dynamically out-of-band (e.g., using SDP).
Marker (M) bit: The M bit is set to one to indicate that the RTP
packet payload contains at least one complete frame
The RTP payload for Speex has the format shown in Figure 1. No
additional header fields specific to this payload format are
required. For RTP based transportation of Speex encoded audio the
standard RTP header [RFC3550] is followed by one or more payload data
blocks. An optional padding terminator may also be used.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| RTP Header |
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
| one or more frames of Speex .... |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| one or more frames of Speex .... | padding |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
46
C IETF RTP Profile
octets and the total number of Speex frames SHOULD be kept less than
the path MTU to prevent fragmentation. Speex frames MUST NOT be
fragmented across multiple RTP packets,
An RTP packet MAY contain Speex frames of the same bit rate or of
varying bit rates, since the bit-rate for a frame is conveyed in band
with the signal.
The encoding and decoding algorithm can change the bit rate at any 20
msec frame boundary, with the bit rate change notification provided
in-band with the bit stream. Each frame contains both "mode"
(narrowband, wideband or ultra-wideband) and "sub-mode" (bit-rate)
information in the bit stream. No out-of-band notification is
required for the decoder to process changes in the bit rate sent by
the encoder.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| RTP Header |
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
| ..speex data.. |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ..speex data.. |0 1 1 1 1|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
47
C IETF RTP Profile
Speex codecs [speexenc] are able to detect the bitrate from the
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| RTP Header |
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
| ..speex frame 1.. |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ..speex frame 1.. | ..speex frame 2.. |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ..speex frame 2.. |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
48
C IETF RTP Profile
4. IANA Considerations
This section describes the media types and names associated with this
payload format. The section registers the media types, as per
RFC4288 [RFC4288]
Required parameters:
None
Optional parameters:
Encoding considerations:
Interoperability considerations:
None.
49
C IETF RTP Profile
Restrictions on usage:
This media type depends on RTP framing, and hence is only defined
for transfer via RTP [RFC3550]. Transport within other framing
protocols is not defined at this time.
Change controller:
50
C IETF RTP Profile
Note that the RTP payload type code of 97 is defined in this media
definition to be ’mapped’ to the speex codec at an 8kHz sampling
frequency using the ’a=rtpmap’ line. Any number from 96 to 127 could
have been chosen (the allowed range for dynamic types).
The value of the sampling frequency is typically 8000 for narrow band
operation, 16000 for wide band operation, and 32000 for ultra-wide
band operation.
If for some reason the offerer has bandwidth limitations, the client
may use the "b=" header, as explained in SDP [RFC4566]. The
following example illustrates the case where the offerer cannot
receive more than 10 kbit/s.
51
C IETF RTP Profile
Examples:
a=fmtp:97 mode=any;mode=1
The offerer may indicate that it wishes to send variable bit rate
frames with comfort noise:
52
C IETF RTP Profile
In the example below the ptime value is set to 40, indicating that
Note that the ptime parameter applies to all payloads listed in the
media line and is not used as part of an a=fmtp directive.
Care must be taken when setting the value of ptime so that the RTP
packet size does not exceed the path MTU.
53
C IETF RTP Profile
6. Security Considerations
54
C IETF RTP Profile
7. Acknowledgements
The authors would like to thank Equivalence Pty Ltd of Australia for
their assistance in attempting to standardize the use of Speex in
H.323 applications, and for implementing Speex in their open source
OpenH323 stack. The authors would also like to thank Brian C. Wiles
<brian@streamcomm.com> of StreamComm for his assistance in developing
the proposed standard for Speex use in H.323 applications.
The authors would also like to thank the following members of the
Speex and AVT communities for their input: Ross Finlayson, Federico
Montesino Pouzols, Henning Schulzrinne, Magnus Westerlund.
55
C IETF RTP Profile
8. References
[speexenc]
Valin, J., "Speexenc/speexdec, reference command-line
encoder/decoder", Speex website http://www.speex.org/.
56
C IETF RTP Profile
Authors’ Addresses
Greg Herlein
2034 Filbert Street
San Francisco, California 94123
United States
Email: gherlein@herlein.com
Jean-Marc Valin
University of Sherbrooke
Department of Electrical and Computer Engineering
University of Sherbrooke
2500 blvd Universite
Sherbrooke, Quebec J1K 2R1
Canada
Email: jean-marc.valin@usherbrooke.ca
Alfred E. Heggestad
Biskop J. Nilssonsgt. 20a
Oslo 0659
Norway
Email: aeh@db.org
57
C IETF RTP Profile
Intellectual Property
The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary
rights that may cover technology that may be required to implement
this standard. Please address the information to the IETF at
ietf-ipr@ietf.org.
Acknowledgment
58
C IETF RTP Profile
59
D Speex License
Copyright 2002-2007 Xiph.org Foundation
Copyright 2002-2007 Jean-Marc Valin
Copyright 2005-2007 Analog Devices Inc.
Copyright 2005-2007 Commonwealth Scientific and Industrial Research
Organisation (CSIRO)
Copyright 1993, 2002, 2006 David Rowe
Copyright 2003 EpicGames
Copyright 1992-1994 Jutta Degener, Carsten Bormann
- Neither the name of the Xiph.org Foundation nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
60
E GNU Free Documentation License
Version 1.1, March 2000
Copyright (C) 2000 Free Software Foundation, Inc. 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA Everyone
is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
0. PREAMBLE
The purpose of this License is to make a manual, textbook, or other written document "free" in the sense of freedom: to assure
everyone the effective freedom to copy and redistribute it, with or without modifying it, either commercially or noncommer-
cially. Secondarily, this License preserves for the author and publisher a way to get credit for their work, while not being
considered responsible for modifications made by others.
This License is a kind of "copyleft", which means that derivative works of the document must themselves be free in the
same sense. It complements the GNU General Public License, which is a copyleft license designed for free software.
We have designed this License in order to use it for manuals for free software, because free software needs free documen-
tation: a free program should come with manuals providing the same freedoms that the software does. But this License is not
limited to software manuals; it can be used for any textual work, regardless of subject matter or whether it is published as a
printed book. We recommend this License principally for works whose purpose is instruction or reference.
61
E GNU Free Documentation License
2. VERBATIM COPYING
You may copy and distribute the Document in any medium, either commercially or noncommercially, provided that this
License, the copyright notices, and the license notice saying this License applies to the Document are reproduced in all copies,
and that you add no other conditions whatsoever to those of this License. You may not use technical measures to obstruct
or control the reading or further copying of the copies you make or distribute. However, you may accept compensation in
exchange for copies. If you distribute a large enough number of copies you must also follow the conditions in section 3.
You may also lend copies, under the same conditions stated above, and you may publicly display copies.
3. COPYING IN QUANTITY
If you publish printed copies of the Document numbering more than 100, and the Document’s license notice requires Cover
Texts, you must enclose the copies in covers that carry, clearly and legibly, all these Cover Texts: Front-Cover Texts on the
front cover, and Back-Cover Texts on the back cover. Both covers must also clearly and legibly identify you as the publisher
of these copies. The front cover must present the full title with all words of the title equally prominent and visible. You may
add other material on the covers in addition. Copying with changes limited to the covers, as long as they preserve the title of
the Document and satisfy these conditions, can be treated as verbatim copying in other respects.
If the required texts for either cover are too voluminous to fit legibly, you should put the first ones listed (as many as fit
reasonably) on the actual cover, and continue the rest onto adjacent pages.
If you publish or distribute Opaque copies of the Document numbering more than 100, you must either include a machine-
readable Transparent copy along with each Opaque copy, or state in or with each Opaque copy a publicly-accessible computer-
network location containing a complete Transparent copy of the Document, free of added material, which the general network-
using public has access to download anonymously at no charge using public-standard network protocols. If you use the latter
option, you must take reasonably prudent steps, when you begin distribution of Opaque copies in quantity, to ensure that this
Transparent copy will remain thus accessible at the stated location until at least one year after the last time you distribute an
Opaque copy (directly or through your agents or retailers) of that edition to the public.
It is requested, but not required, that you contact the authors of the Document well before redistributing any large number
of copies, to give them a chance to provide you with an updated version of the Document.
4. MODIFICATIONS
You may copy and distribute a Modified Version of the Document under the conditions of sections 2 and 3 above, provided
that you release the Modified Version under precisely this License, with the Modified Version filling the role of the Document,
thus licensing distribution and modification of the Modified Version to whoever possesses a copy of it. In addition, you must
do these things in the Modified Version:
• A. Use in the Title Page (and on the covers, if any) a title distinct from that of the Document, and from those of previous
versions (which should, if there were any, be listed in the History section of the Document). You may use the same title
as a previous version if the original publisher of that version gives permission.
• B. List on the Title Page, as authors, one or more persons or entities responsible for authorship of the modifications in
the Modified Version, together with at least five of the principal authors of the Document (all of its principal authors, if
it has less than five).
• C. State on the Title page the name of the publisher of the Modified Version, as the publisher.
• D. Preserve all the copyright notices of the Document.
• E. Add an appropriate copyright notice for your modifications adjacent to the other copyright notices.
• F. Include, immediately after the copyright notices, a license notice giving the public permission to use the Modified
Version under the terms of this License, in the form shown in the Addendum below.
• G. Preserve in that license notice the full lists of Invariant Sections and required Cover Texts given in the Document’s
license notice.
• H. Include an unaltered copy of this License.
62
E GNU Free Documentation License
• I. Preserve the section entitled "History", and its title, and add to it an item stating at least the title, year, new authors, and
publisher of the Modified Version as given on the Title Page. If there is no section entitled "History" in the Document,
create one stating the title, year, authors, and publisher of the Document as given on its Title Page, then add an item
describing the Modified Version as stated in the previous sentence.
• J. Preserve the network location, if any, given in the Document for public access to a Transparent copy of the Document,
and likewise the network locations given in the Document for previous versions it was based on. These may be placed
in the "History" section. You may omit a network location for a work that was published at least four years before the
Document itself, or if the original publisher of the version it refers to gives permission.
• K. In any section entitled "Acknowledgements" or "Dedications", preserve the section’s title, and preserve in the section
all the substance and tone of each of the contributor acknowledgements and/or dedications given therein.
• L. Preserve all the Invariant Sections of the Document, unaltered in their text and in their titles. Section numbers or the
equivalent are not considered part of the section titles.
• M. Delete any section entitled "Endorsements". Such a section may not be included in the Modified Version.
• N. Do not retitle any existing section as "Endorsements" or to conflict in title with any Invariant Section.
If the Modified Version includes new front-matter sections or appendices that qualify as Secondary Sections and contain no
material copied from the Document, you may at your option designate some or all of these sections as invariant. To do this,
add their titles to the list of Invariant Sections in the Modified Version’s license notice. These titles must be distinct from any
other section titles.
You may add a section entitled "Endorsements", provided it contains nothing but endorsements of your Modified Version by
various parties–for example, statements of peer review or that the text has been approved by an organization as the authoritative
definition of a standard.
You may add a passage of up to five words as a Front-Cover Text, and a passage of up to 25 words as a Back-Cover Text,
to the end of the list of Cover Texts in the Modified Version. Only one passage of Front-Cover Text and one of Back-Cover
Text may be added by (or through arrangements made by) any one entity. If the Document already includes a cover text for
the same cover, previously added by you or by arrangement made by the same entity you are acting on behalf of, you may not
add another; but you may replace the old one, on explicit permission from the previous publisher that added the old one.
The author(s) and publisher(s) of the Document do not by this License give permission to use their names for publicity for
or to assert or imply endorsement of any Modified Version.
5. COMBINING DOCUMENTS
You may combine the Document with other documents released under this License, under the terms defined in section 4
above for modified versions, provided that you include in the combination all of the Invariant Sections of all of the original
documents, unmodified, and list them all as Invariant Sections of your combined work in its license notice.
The combined work need only contain one copy of this License, and multiple identical Invariant Sections may be replaced
with a single copy. If there are multiple Invariant Sections with the same name but different contents, make the title of each
such section unique by adding at the end of it, in parentheses, the name of the original author or publisher of that section if
known, or else a unique number. Make the same adjustment to the section titles in the list of Invariant Sections in the license
notice of the combined work.
In the combination, you must combine any sections entitled "History" in the various original documents, forming one section
entitled "History"; likewise combine any sections entitled "Acknowledgements", and any sections entitled "Dedications". You
must delete all sections entitled "Endorsements."
6. COLLECTIONS OF DOCUMENTS
You may make a collection consisting of the Document and other documents released under this License, and replace the
individual copies of this License in the various documents with a single copy that is included in the collection, provided that
you follow the rules of this License for verbatim copying of each of the documents in all other respects.
You may extract a single document from such a collection, and distribute it individually under this License, provided you
insert a copy of this License into the extracted document, and follow this License in all other respects regarding verbatim
copying of that document.
63
E GNU Free Documentation License
8. TRANSLATION
Translation is considered a kind of modification, so you may distribute translations of the Document under the terms of
section 4. Replacing Invariant Sections with translations requires special permission from their copyright holders, but you
may include translations of some or all Invariant Sections in addition to the original versions of these Invariant Sections. You
may include a translation of this License provided that you also include the original English version of this License. In case
of a disagreement between the translation and the original English version of this License, the original English version will
prevail.
9. TERMINATION
You may not copy, modify, sublicense, or distribute the Document except as expressly provided for under this License. Any
other attempt to copy, modify, sublicense or distribute the Document is void, and will automatically terminate your rights
under this License. However, parties who have received copies, or rights, from you under this License will not have their
licenses terminated so long as such parties remain in full compliance.
64
Index
bit-rate, 33, 35
CELP, 6, 26
complexity, 7, 8, 32, 33
constant bit-rate, 7
discontinuous transmission, 8, 17
DTMF, 7
echo cancellation, 20
error weighting, 28
fixed-point, 10
in-band signalling, 18
Levinson-Durbin, 27
libspeex, 6, 15
line spectral pair, 30
linear prediction, 26, 30
narrowband, 7, 8, 30
Ogg, 24
open-source, 8
patent, 8
perceptual enhancement, 8, 16, 32
pitch, 27
preprocessor, 19
RTP, 24
sampling rate, 7
speexdec, 14
speexenc, 13
standards, 24
tail length, 20
ultra-wideband, 7
65