You are on page 1of 79




Jorge Castellanos

A Thesis Submitted to the Graduate

Faculty of University of California, Santa Barbara
in Partial Fulfillment of the
Requirements for the Degree of
Master of Science
Major Subject: Media Arts and Technology, Concentration in
Multimedia Engineering

Approved by the
Examining Committee:

Curtis Roads, Thesis Adviser

JoAnn Kuchera Morin, Member

Stephen T. Pope, Member

Tobias Höllerer, Member

Xavier Amatriain, Member

University of California, Santa Barbara

Santa Barbara, California

September 2006
(For Graduation September 2006)

c Copyright 2006
Jorge Castellanos
All Rights Reserved


LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . viii

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation and Goals . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

I Foundations 8
2. Sound Properties and Perception . . . . . . . . . . . . . . . . 9
2.1 Sound Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Digital Representation of Sound . . . . . . . . . . . . . . 12
2.2 Sound Perception . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Sound Localization . . . . . . . . . . . . . . . . . . . . . 14 Distance Cues . . . . . . . . . . . . . . . . . . . 16

3. Spatialization Techniques . . . . . . . . . . . . . . . . . . . . . . 18
3.1 Sound Positioning . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.1 Amplitude Panning . . . . . . . . . . . . . . . . . . . . . 18 VBAP . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.2 Ambisonic Rendering Technique . . . . . . . . . . . . . . 20
3.1.3 Binaural Audio Rendering Technique . . . . . . . . . . . 21
3.1.4 WFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.5 Hybrid Techniques . . . . . . . . . . . . . . . . . . . . . 23
3.2 Additional Spatialization Properties . . . . . . . . . . . . . . . . 24
3.2.1 Distance Modeling . . . . . . . . . . . . . . . . . . . . . 24
3.2.2 Object Size and Radiation Pattern . . . . . . . . . . . . 24
3.3 Room Acoustics Modeling . . . . . . . . . . . . . . . . . . . . . 25
3.3.1 Geometric Modeling Methods . . . . . . . . . . . . . . . 25

3.3.2 Wave Based Methods . . . . . . . . . . . . . . . . . . . . 27
3.3.3 Statistical Methods . . . . . . . . . . . . . . . . . . . . . 27
3.3.4 Hybrid Modeling Methods . . . . . . . . . . . . . . . . . 27
3.3.5 Reverberation . . . . . . . . . . . . . . . . . . . . . . . . 27

4. Object Oriented Programming and Design . . . . . . . . . . 29

4.1 Object Oriented Programming . . . . . . . . . . . . . . . . . . . 29
4.1.1 The Unified Modeling Language . . . . . . . . . . . . . . 30
4.2 Objet Oriented Design Patterns . . . . . . . . . . . . . . . . . . 31
4.3 Software Engineering and Design . . . . . . . . . . . . . . . . . 33
4.3.1 Usability Engineering . . . . . . . . . . . . . . . . . . . . 33
4.3.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

II System Analysis & Design 35

5. User Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.1 Target Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.1.1 Primary Users . . . . . . . . . . . . . . . . . . . . . . . . 36
5.1.2 Secondary Users . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 Use Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.4 System Requirements . . . . . . . . . . . . . . . . . . . . . . . . 42

6. Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.1 Interaction Models . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 System Meta-Model . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.3 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.3.1 Spatial Data . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.3.2 Spatial Processing . . . . . . . . . . . . . . . . . . . . . 47
6.3.3 Speaker Layout . . . . . . . . . . . . . . . . . . . . . . . 47
6.3.4 Panner . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.3.5 Distance Simulator . . . . . . . . . . . . . . . . . . . . . 50
6.3.6 Acoustic Modeler . . . . . . . . . . . . . . . . . . . . . . 51
6.4 High-level Model . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.4.1 Spatializer . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.4.2 Auralizer . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

III Implementation 55
7. System Implementation . . . . . . . . . . . . . . . . . . . . . . . 56
7.1 CSL Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.1.1 CSL Core . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.1.2 Connections . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.1.3 Input / Output . . . . . . . . . . . . . . . . . . . . . . . 58
7.2 Spatial Audio Subsystem . . . . . . . . . . . . . . . . . . . . . . 58
7.2.1 Position . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.2.2 Spatial Sound Sources . . . . . . . . . . . . . . . . . . . 58
7.2.3 Loudspeaker Setup . . . . . . . . . . . . . . . . . . . . . 59
7.2.4 Panning . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Vector Base Amplitude Panning . . . . . . . . . 60 Binaural Panner . . . . . . . . . . . . . . . . . 61 Ambisonic Panner . . . . . . . . . . . . . . . . 62
7.2.5 Distance Simulation . . . . . . . . . . . . . . . . . . . . 63
7.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

8. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

LITERATURE CITED . . . . . . . . . . . . . . . . . . . . . . . . . . . 67


1.1 The Allosphere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 As sound propagates it covers a larger area, reducing the energy at

a rate that follows the inverse square law. . . . . . . . . . . . . . . 9

2.2 Sound reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Sound diffraction: waves wrap around objects smaller than their
length. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Doppler Effect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5 Equal loudness curves. . . . . . . . . . . . . . . . . . . . . . . . . 14

2.6 Cone of confusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1 Three dimensional VBAP [Pulkki, 1997] . . . . . . . . . . . . . . . 19

3.2 Huygens Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3 Sound Cones used to represent sound radiation of objects [Mutanen, 2002] 25

3.4 Image Source Method . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.1 Example of a UML Class Diagram . . . . . . . . . . . . . . . . . 30

4.2 A simplified class diagram of the Observer Design Pattern . . . . 31

4.3 Class diagram of the Facade Design Pattern . . . . . . . . . . . . 32

6.1 Basic elements in 4MS: Processing objects, represented by boxes,

are connected through ports (round connection slots)– defining a
synchronous data flow from left to right, or through controls (square
slots)– defining a vertical event-driven mechanism [Amatriain, 2004]. 45

6.2 System Model Overview. SpatialProcessing and SpatialData (top)

constitute the base of the Model. . . . . . . . . . . . . . . . . . . 46

6.3 SpeakerLayout class diagram. . . . . . . . . . . . . . . . . . . . . 48

6.4 ActiveSpeakerLayout can be used in the graph as a Processing,

accounting for different audio configurations. . . . . . . . . . . . . 48

6.5 Panner class diagram. Panner inherits from both SpatialProcessing

and Observer (top) and handles any number of SpatialData objects. 49

6.6 Concrete subclasses of the Panner base class. . . . . . . . . . . . 50

6.7 Distance Simulator Class Diagram . . . . . . . . . . . . . . . . . 51

6.8 Acoustic Modeler Class Diagram . . . . . . . . . . . . . . . . . . . 52

6.9 Spatializer Class Diagram . . . . . . . . . . . . . . . . . . . . . . 53

6.10 Auralizer Class Diagram . . . . . . . . . . . . . . . . . . . . . . . 54

7.1 BinauralPanner Class Diagram . . . . . . . . . . . . . . . . . . . . 61

7.2 Ambisonic Subsystem Class Diagram. . . . . . . . . . . . . . . . . 62


Thanks to my parents for supporting me on whatever I do. To my brother for

sharing his view of life. Special thanks to my girlfriend, Ioana Pieleanu, for the
help proofreading this thesis, sending me numerous corrections and suggestions,
and for sharing her life with me. Thanks to all my colleagues at Media Arts and
Technology program for the support, comments, ideas, and friendship, included
but not limited to: Graham Wakefield, Will Wolcott, Melissa DeBartolomeo,
Sara Stuckey, Daniel Mintz, Lance Putnam, Alex Norman, Ryan Avery, Eric
Newman, August Black, Bo Bell. Finally, I would like to thank Curtis Roads,
Xavier Amatriain, JoAnn Kuchera Morin, Tobias Höllerer and Stephen T. Pope
for their guidance during my studies at UCSB.


The increasing prevalence of multimedia systems has led to the development of

multiple spatial audio rendering techniques. Their primary goal is to provide
accurate localization of sounds for a realistic and immersive acoustical experi-
This work elaborates on the design principles and practical implementa-
tion of a general-purpose spatial audio rendering framework. The integration
of different audio spatialization techniques into one engine or framework would
provide a single, simple interface to any system, regardless of the loudspeaker or
room setup. Such system would silently adapt itself to an existing audio setup,
improving and facilitating the users experience. The objective of such frame-
work is to facilitate the implementation of audio and/or multimedia applications
used as research tools, art, and/or entertainment.


The use of virtual reality and immersive reality systems, and multimedia in gen-
eral has grown and continues to grow rapidly. Soon these systems will constitute
part of the everyday life of most people. This fact calls for the development of
many applications that will facilitate the use and understanding of this technol-
ogy. An important component of such systems is the audio rendering engine.
As a consequence of the increased attention toward multimedia systems,
multiple spatial audio rendering techniques (also commonly called 3D audio)
have been developed. Their primary goal is to allow for accurate localization of
sounds, for an improved and possibly realistic immersive acoustical experience.
So far, none of the existing techniques perform well under all circumstances.
Quoting Dave Malham [Malham, 1998]:

Even the best systems in use today for sound spatialization are rela-
tively crude, allowing for little more then the creation of an illusion,
sometimes very good, more often poor.

The different techniques complement each other to some extent, each ad-
dressing particular aspects related to spatial hearing under different physical
setups and space limitations. Ideally, a single technique should work for any
audio system configuration and in any environment; such system is not achiev-
able. However, under certain constraints one system could solve most issues.
As V. Pulkki wrote in relation to spatialization techniques [Pulkki, 1997]:

A natural improvement would be a virtual sound source positioning

system that would be independent of the loudspeaker arrangement
and could produce virtual sound sources with maximum accuracy
using the current loudspeaker configuration.

The design proposed in this thesis attempts to solve such need, by inte-
grating many of the current audio spatialization techniques into one system.


The integration of all of these techniques into one engine or framework would
provide a single, simple interface to any system, regardless of the loudspeaker or
room setup. Such system would silently adapt itself to an existing audio system
setup, improving and facilitating the users experience.

1.1 Motivation and Goals

Current virtual reality/multimedia systems usually choose a single spatial
audio rendering technique, limiting the possible equipment setups in the physical
space. It is generally difficult for the users to adapt their audio system to each
specific audio spatial technique. Therefore, the users end up using different
variations of such systems, which in turn produce poor spatial audio results.
Manufacturers and developers place constraints on audio systems in order
to facilitate development, and to provide a controlled environment, so that users
experience proper audio rendering. In turn, users do not necessarily pay much
attention to the specifics of installing and locating the systems components,
causing poor system performance. To this end, there is a strong need for spatial
audio techniques that allow for flexibility in the system setups. A solution to
such problem has motivated most of the research and development presented in
this document. In particular, one system was considered throughout the design:
The Allosphere.
The Media Arts and Technology (MAT) department at the University of
California, Santa Barbara (UCSB) has been developing a spherical, fully immer-
sive, interactive virtual reality audio/video system that will serve as an advanced
research instrument -The Allosphere. Such an instrument consists of an audio
reproduction system with an array of 400 to 500 loudspeakers [Pope, 2005].
While this work will not elaborate on the specifics of the Allosphere, additional
information can be found at [all, 2006].
An integrated and adaptive spatial audio interface would be very bene-
ficial for such a system. Composers could use it without being experts in the
field. Software developers would enjoy the benefits of a framework with a single
interface to the spatial audio engine. Scientists could perform tests for their

Figure 1.1: The Allosphere

research using any available surround sound techniques without having to pro-
gram or be familiar with the sound reproduction system/technique. Spatial
audio researchers would be able to extend the system, by just adding new al-
gorithms to the already implemented ones (i.e., the system will provide sound
source management, simplifying the implementation of new techniques).
The overall goal is to ease multimedia software development by designing
a system model of a 3D audio engine. Such description should act as a model
from which concrete systems, libraries, or frameworks are implemented. As a
proof of concept, a basic implementation of the design will be discussed in the
third section of this document.
Finally, music composers, have many different tools for multi-channel au-
dio reproduction, but each of them has proved to be difficult to use, inflexible,
not scalable, and require in depth knowledge of the different spatial audio meth-
ods. Most 3D audio rendering systems implement up to a couple of methods,
but none of them provides an integrated interface to the many possibilities.
This system should serve as a step closer to a simple tool for composers.

1.2 Scope
This work elaborates on the design principles and practical implementa-
tion of a general-purpose spatial audio rendering framework. The design should
fully describe how the different pieces of the framework would interact if it were
to be implemented. Such description is presented more as a specification, avoid-
ing strong ties to a particular programming language. This attempt to making
it language independent allows for future implementations in other languages
and/or platforms.
The objective of the framework is that of facilitating the implementation
of audio and/or multimedia applications used in research, art, and/or enter-
tainment. We are looking to provide an engine so that any type of users, such
as artists, scientists, software engineers and possibly others can focus their at-
tention on their projects without spending any time re-implementing or un-
derstanding a complicated 3D audio engine. Hopefully such an engine will
encourage the use of multi-speaker setups.
This project was divided into two stages: System Design and System
Implementation. The system design was the primary focus for the project.
A solid design is a key component for future improvements. The system was
implemented as part of a larger C++ synthesis framework. The implementation
should be taken as an example of a concrete instance of the model described.
The different implementation issues encountered are also discussed in chapter
7, serving as a guide for future implementations.
Only a basic system was implemented, and full documentation will be
provided facilitating its usage as well as making clear implementation details
for future improvements. Contributions from future students are expected in
order to build a robust framework for multimedia development.

1.3 Related Work

A very large number of multimedia and virtual reality systems exist al-
ready, with varied ease of use, quality, efficiency, etc., depending on the intended
target application. The vast majority of these libraries have been developed for

video games, while a few others are non-real-time, with the goal of accurately
simulating the acoustical properties of an arbitrary space.
The most used spatial audio libraries for the development of multimedia
applications are Microsoft’s DirectX, OpenAL, Java3D and X3D. These APIs
(Application Programmers Interface) comply with a set of guidelines defined
by the Interactive Audio Special Interest Group (IASIG). The first document
these guidelines [I3D, 1998] ”3D Audio Evaluation and Rendering Guidelines”
(I3DL1) defines the minimal functionality to be implemented in a 3D audio
platform. The document also explains and describes what techniques are or
aren’t considered 3D audio. The second document [I3D, 1999] ”Interactive 3D
Audio Rendering Guidelines” (I3DL2) define a more robust and complete fea-
ture set (room acoustics, sound occlusion, sound obstruction), allowing for an
improved 3D experience. Following is a short description of some of the libraries
mentioned above. For a more detailed summary of the APIs described below,
refer to R. Väänänen’s PhD thesis [Väänänen, 2003].

• OpenAL: A cross-platform software library for 3D sound. The imple-

mentations for each computer platform have to conform to the OpenAL
specification [OpenAL, 2005], which sets the minimum requirements to
consider. OpenAL’s API similarities to OpenGL make it an attractive
interface for game or computer graphics development.

• DirectSound3D: A component of Microsoft’s DirectX API. It provides a

single interface to working with the different sound cards. It also provides
sound processing effects, as well as 3D audio capabilities. The 3D audio
capabilities can be extended with Creative’s EAX (see below).

• Java3D API is an extension of the Java programming language to create

interactive 3D applications. The Java3D API makes the programming
similar to that of scene graphs. Its spatial sound properties are more
advanced than those offered by other APIs, enabling reverberation, air
absorption, and frequency dependent directivity definitions. Java3D 1.3,
and later conform to the I3DL2 guidelines.

• X3D The successor of VRML 2.0 (Virtual Reality Modeling Language

version 2). The specification has some spatial sound functionality, like
sound source position, orientation and direction. Recent specifications
like MPEG-4 have borrowed ideas from VRML.

The 3D audio software libraries described above use a simplistic approach

to sound source modeling. This is partly because the quality is good enough
for the target applications (like video games), where accuracy is not mandatory.
The capabilities of these libraries can be improved via a set of extensions, such
as EAX (Environmental Audio Extensions) by Creative.
EAX is an extension to OpenAL and DirectSound. It is designed for
game development, to add realism to the scenes. The primary mechanism for
enhancing the virtual experience is by use of reverberation models, as well as
object occlusion and obstruction. The API conforms to the guidelines in I3DL2
[I3D, 1999].
One of these libraries weakness is the limited support for multi-loudspeaker
setups, other than the standard configurations (e.g., stereo, 5.1, 7.1). The
design we propose overcomes this limitation, offering a flexible and customizable
loudspeaker configuration.
Other 3D audio rendering systems have been developed at universities and
research centers. Most of these are developed as systems, and not as a library or
framework to be reused, thus being less flexible than the ones presented above.
Below is a sample of the different systems that have been developed.

• IRCAM Spat: A real-time tool for sound spatialization. 3D audio is

done using binaural and transaural techniques. An interesting aspect is
that positions and room acoustics are attached to musical events, and
not to particular objects that might produce them. To a certain extent,
this spatial audio approach could be closer to what actually happens in
reality. Objects do not formulate the sound, it is the event that does it. In
a sense, the spatial information is the result of the event, not the object.
A drawback of this approach is that every event has to be tagged with the
spatial info, making inconvenient the creation of events.

• blue-c: A fully immersive reality system [Naef et al., 2002]. The audio
engine was designed to support live audio input, room simulation and
accurate localization.

• DIVA (Digital Interactive Virtual Acoustics): is a real-time environment,

developed at the Helsinki University of Technology, for full audio-visual
experience. The system integrates all processes, from sound synthesis, to
room acoustics and spatial audio reproduction [Savioja, 2000].

In addition to the systems mentioned above, very precise acoustic software

and hardware systems exist. These tend to be expensive, often require special
hardware and might not run in real-time (live input). The design proposed in
this document is meant to be a layer below these applications, allowing for the
easy creation of tools by borrowing the flexibility of low level software libraries,
and the pre-defined behaviors of end-user systems.

1.4 Organization
This document is divided into three parts. The first part, consisting of
Chapters 2 to 4, serves as a review and introduction to the basic subjects needed
to understand the later chapters of the thesis. Chapter two introduces the ba-
sic properties, digital representation, and perception of sound. Chapter three
presents a survey of the current techniques used for audio spatialization. Chap-
ter four provides a brief introduction to Object Oriented Programming (OOP)
and Design Patterns.
Part two contains Chapters 5 and 6, describing the design of the frame-
work. Chapter five presents the results of user analysis performed to better
understand the system. Chapter six presents the design of the different mod-
els that make up the system. These chapters constitute the main focus of the
Finally, in the third section, Chapter 7 provides an overview of CSL (The
CREATE Signal Library), followed by a description of the system implemented.


Sound Properties and Perception

This chapter introduces the basic properties, digital representation, and per-
ception of sound. This knowledge is required to understand some of the design
decisions presented later in the document. The introduction below is written
in a simple language, avoiding technical terms and using the least mathematics
possible, making it a simple tutorial into the world of sound. For more detailed
information the reader should consult the references given in the text.

2.1 Sound Properties

As sound waves propagate away from the source, the sound energy is
distributed over an increased area (see Figure 2.1), and therefore the sound level
in a particular direction is proportionally decreased with distance. Under free
field conditions, given a sound power level of the source Lw , the sound pressure
level (SPL) measured at a distance r (in meters) from the omnidirectional source
is equal to SP L = Lw + 10 log( 14 ∗ π ∗ r2 )dB. Therefore, we note that in a free
field situation, the point source sound energy loss is inversely proportional to the
square of the distance. In addition, sound energy is further attenuated, during
sounds propagation, through friction with the air molecules (particularly high
frequency sounds).

Figure 2.1: As sound propagates it covers a larger area, reducing the

energy at a rate that follows the inverse square law.


In an enclosed space, sound waves hit the perimeter surfaces and are
either absorbed, reflected back into the space, or a combination of both. Sound
absorption consists in the property of the surface material to transform sound
energy into a different energy type, such as heat, or mechanical. The sound
energy can be absorbed partially or totally, as function of the properties and/or
dimensions of the material. The absorption coefficient (α) is expressed as a
percentage which indicates how much of the total energy will be attenuated
and is expressed as the total incident/direct sound energy minus the reflected

α = 1 − |R|2 (2.1)

Sound reflections can be either specular, diffusive or a combination of

both, depending on the dimension of the physical irregularities of the surface
the sound hit compared to the sounds wavelength. At one extreme, if the sound
hits a surface that is infinitely large and smooth, it will reflect specularly, where
the angle of reflection from the wall equals the angle of incidence (see Figure
At the other extreme, if sound hits
a surface that has texture irregularities
which are fairly large compared to the
wavelength (but not much larger than the
wavelength), the sound will scatter in dif-
ferent directions, as opposed to resulting
in one specular reflection. Note that un-
der these conditions, the sound energy in
each particular direction is considerably
Figure 2.2: Sound reflection
diminished, as the energy of the incident
sound is not reflected back exclusively to that single direction anymore. How-
ever, in general, sound is reflected as a combination of both conditions, partly
specularly and partly scattered.

When sound waves encounter an object whose dimensions are relatively

small compared to their wavelength, the sound diffracts around the object (see
Figure 2.3).

Figure 2.3: Sound diffraction: waves wrap around objects smaller

than their length.

To summarize, the sound-obstacle relationship/behavior varies with the

frequency of the sound waves, and depends on the dimensions and material of
the surfaces/objects encountered.
Depending on the physical conditions of an enclosure (as per our previous
description), the sound waves bounce around the rooms surfaces until the sound
energy is entirely attenuated or transformed into a different energy type. The
time necessary for the sound to decay to inaudibility in a given space (by stan-
dard equaling a 60 dB level decay) is known as reverberation time (RT or T60 ).
The reverberation time can vary from approaching 0 seconds in an anechoic
chamber (where theoretically there are no reflections from the rooms surfaces),
to reaching values as long as 8 seconds or more, in large hard surface spaces
such as baroque churches, for instance.
Sound speed in air is dependent on temperature (and independent of
frequency), as per the following formula: c = 331.4 + 0.6 ∗ T emperature ms
[Kuttruff, 1979]. Slight speed differences cannot be avoided due to temperature
changes; however, for demonstration purposes we generally consider a temper-
ature of 20◦ Celsius, which results in a speed of sound of 343 ms . When a sound

Figure 2.4: Doppler Effect.

source is moving relative to the receiver, the time necessary for the sound to
arrive to the listener also varies continuously. These changes cause an apparent
change of pitch, commonly known as Doppler effect (see Figure 2.4). If the
source and receiver approach each other, the apparent pitch increases, while
if they move away from each other the pitch is perceived as decreasing. The
perceived frequency can be obtained as follows:

fapparent = fsource (c + vobserver )/c (2.2)

where fsource is the frequency of the sound source and vobserver is the speed at
which the observer is moving.

2.1.1 Digital Representation of Sound

Sound is converted to digital form by capturing periodic samples of the
sound pressure level. The frequency at which these samples have to be captured
depends on the highest frequency to be stored. In order for a certain frequency
to be subsequently reproduced, a minimum of two samples of its cycle are nec-
essary. The simplest way of thinking about it is that the sampling frequency
has to be at least twice the frequency of the sound to be captured. Otherwise,
when reconstructing the sound it becomes ambiguous (known as aliasing).
Therefore, for a computer, sound is just an array of numbers representing
the pressure offset at fixed time intervals. The importance of this fact is that any
process applied to the audio is actually a mathematical operation to a sequence

of numbers. For example a simple filter works by averaging a sample with the
previous one. More in depth introductory texts on this topic can be found at
[Roads, 1996] and [Pohlmann, 2005].

2.2 Sound Perception

Many physical phenomena happen that we, humans, are unaware of, since
our senses do not perceive them. The area that studies sound perception is
called psychoacoustics. When developing new technologies, this field of science
can be advantageous, allowing us to reproduce only those aspects humans will
perceive. The spatialization techniques described later in this thesis take fully
advantage of the such knowledge.
Differentiating loudness from sound intensity, and understanding their re-
lationship is very important when manipulating sound. Loudness is the term
that describes our perception of sound intensity. For example, sounds of differ-
ent frequencies might have the same intensity, but the auditory system does not
necessarily perceive them as equally loud. Loudness is frequency dependent.
Measurements showing this relationship have been performed for a long time.
The most famous studies are those by Fletcher and Munson, resulting in the
Fletcher-Munson curves. These equal-loudness contours consist of a graph rep-
resenting frequency on the x axis and sound pressure level on the y axis. The
contours marked on this graph indicate the pressure level at different frequen-
cies, which are perceived equally loud (see Figure 2.5).
When multiple ”copies” of the same sound arrive to a listener almost
simultaneously, the very first sound to reach the ears is the one used by the
brain to determine the location of the sound source 1 . This effect is called the
Precedence Effect (also known as Haas effect or law of the first wavefront).
This is a simplified description. In reality, depending on the delay between sounds, the
localization can become unstable and provide erroneous results.

Figure 2.5: Equal loudness curves.

2.2.1 Sound Localization

The auditory system makes use of different mechanisms to localize sound.
The two main cues are the sound level difference reaching the two ears, known
as Inter-aural Level Difference (ILD), and the difference in arrival time of the
sound to each ear, known as Inter-aural Time Difference. Often times, these two
cues conflict with each other, providing ambiguous localization. In such cases,
we make use of other cues. The localization cues are briefly described below:
ILD, also known as IID (Inter-aural Intensity Difference), is thought to
be the primary localization cue for sounds between approx. 1500 Hz, and 6
kHz. Low frequency waves (those with a wavelength larger than the human
head) diffract around the head, thus not being affected by the shadowing effect
caused by the head being in the way.
ITD occurs when a sound reaches one ear first. This condition occura any
time a sound source is not located dead-center in front of the listener. With low
frequency sounds ITD is the primary cue2 .
In reality, the main cue for low frequencies appears to be the phase difference, and not
the time difference.

ILD and ITD cues do not always provide the necessary localization in-
formation to the auditory system. A particular issue, known as the Cone of
Confusion occurs towards the sides of the listener, consisting of a conical area
somewhat normal to the ears (see Figure 2.6) where the ILD and ITD are almost
identical due to the (quasi) symmetry of the head. A common manifestation of
this problem is front -back reversal. Studies have shown that head movement
can contribute to improve localization. Such dynamic adjustments permit the
brain compare different ITD and ILD, which helps disambiguate conflicting cues
[Begault, 1994].

Figure 2.6: Cone of confusion.

Spectral Cues: Just as sound reflects off the objects present in a room,
the listener also interacts with incoming sound; depending on the angle of inci-
dence sound will reflect off of different surfaces of the listeners body. In addition,
the complexity and shape of the different ear cavities (concha, auditory canal
and the fossa create resonances at different frequencies, providing information
about the origin and type of sound. The spectral cues are primarily used for
high frequency sounds (above 6 kHz), and when the ITD and ILD are ambigu-
ous. Spectral cues are also the primary cues for discerning to some extent the
elevation of sounds.
16 Distance Cues

The localization cues mentioned above do not provide any distance infor-
mation. The primary distance cue is intensity loss with increasing distance (see
Section 2.1 of this chapter). In addition to the attenuation in intensity, the air
particles also absorb some energy, further reducing the amplitude (as described
in Section 2.1).
NOTE: However, keep in mind that these cues are useless if the listener
has no prior knowledge of the sound levels at the source (see Cognitive cues
Reverberation, on the other hand, does not decay in level as fast as the
direct sound. In general, because of the reflections buildup filling the room, the
reverberation level is similar at any place of the room, with an average loss of 3
dB per doubling of distance [Gardner, 1992] (this value would obviously change
depending on the geometry and composition of the space). The ratio of direct
to reverberant levels is an important distance cue. The farther away a sound is,
the reverberation level will increase in relation to the level of the direct sound.

Lrev = 1 − ( )2 (2.3)
Ds + Dref
Cues obtained with senses other than auditory can also help localizing
sound. Having visual contact with the emitting object, or the environment, will
substantially help anticipate expectations regarding the sound character and
construct a more accurate localization. The tactile perception of vibration can
also assist the auditory system.
Cognitive cues also play a primary role in sound localization. Previous
knowledge of a space, and or sound will allow comparison to previously known
sounds with certain characteristics. John Chowning presents a great example
in [Chowning, 2000]:

A listener faces two singers, one at a distance of 1m and the other

at a distance of 50m. The closer singer produces a pp [whispering]
tone followed by the distant singer who produces a ff [loud] tone.
Otherwise the tones have the same pitch, the same timbre, and are

of the same duration. The listener is asked which of the two tones
is the louder

In spite of the fact that the distant tone would be louder than the closer one, the
listener is capable of discerning which one is closer thanks to previous knowledge
of the sound properties.
Spatialization Techniques

This chapter presents a survey of the current techniques used for audio spatial-
ization. Even though the main focus is on sound positioning techniques, room
modeling is also considered, as it constitutes a very important aspect of spatial
audio. These explanations should provide the reader with the knowledge needed
to understand the design decisions taken.

3.1 Sound Positioning

The techniques for positioning sound into space take different approaches.
The first one attempts to reproduce the signal at the listeners ears (e.g. Bin-
aural technique). A second approach attempts to recreate the full soundfield
over a larger listening area (e.g. Wavefield Synthesis technique). This second
technique requires a loudspeaker- based system. A third approach aims to intro-
duce just the elements that the auditory system needs to perceptually determine
the location of sound (e.g. Ambisonics). Florian Holleweger’s Master’s thesis
[Hollerweger, 2006] is an excellent, in depth review of spatial audio techniques.
Below each of them is briefly described:

3.1.1 Amplitude Panning

This technique allows changing the perceived position of a sound by mod-
ifying the amplitude of the same signal fed to the loudspeakers. A very common
and simple use of amplitude panning is on stereo audio systems. Each loud-
speaker receives a copy of the original sound, but with a different gain. The
perceived location of a sound (or location of the virtual sound source) can be
simulated by taking advantage of the fact that intensity level differences (ILD)
at the ears are one of the primary localization mechanisms.
The amplitude panning technique assumes that the listener is at a fixed
position, in the center of the coordinate system. For best results, loudspeakers


must be placed at no more than 60◦ apart. The farther a virtual source is from
a loudspeaker, the less stable the image created is. This condition is due to the
frequency dependency of ILD. VBAP
Vector Base Amplitude Panning (VBAP) is based on the amplitude pan-
ning technique described above, but formulated for multiple/varying number of
loudspeakers, and for two dimensional (i.e. only horizontal) or three dimensional
(i.e. periphonal) positioning. In addition to the regular amplitude panning pre-
viously described, this technique restates the problem using vectors and vector
bases, simplifying calculations, and making them more efficient computationally
[Pulkki, 1997].

Figure 3.1: Three dimensional VBAP [Pulkki, 1997]

Similar to stereo panning, only two loudspeakers are needed to position

a virtual source at any point between them (used generally for a pantophonic

rendering). In the same way, for the periphonic rendering, at least three loud-
speakers are needed. These loudspeakers form a triangle, and the virtual source
can be positioned anywhere inside it (see Figure 3.1). In order to position a
sound source using a multi-loudepeaker setup, first, the corresponding loud-
speaker pair or triplet has to be determined. Subsequently, the sound source
gains are calculated only for those two or three loudspeakers. The gains can be
obtained by multiplying the position (vector) of the sound source by the inverse
of the loudspeaker vectors:

 −1
lk1 lk2 lk3
gains = pT L−1
 
mnk = [ p 1 p 2 p 3 ]  lm1 lm2 lm3 
  (3.1)
ln1 ln2 ln3

The resulting gains have to be normalized before applying them to the

incoming signals.

3.1.2 Ambisonic Rendering Technique

Ambisonic was originally developed in the seventies as a recording tech-
nique [Gerzon, 1975]. The Ambisonic technique evolved into a recording or
simulating spatial sound encoding method, which encodes the signals into four
channels (for the first order ambisonic encoding), with the flexibility of decoding
them for playback to a flexible, multiple number of channels. Since its introduc-
tion, the Ambisonic format has evolved, and higher order encoding and decoding
are now common (also known as Higher Order Ambisonic HOA). Increasing
the encoding order increases the size of the listening area [Daniel, 2001] over
which the soundfield is accurately recreated, but at the same time it requires a
larger number of loudspeakers and more computing power.
Ambisonic can also be thought of as an amplitude panner, where the signal
fed to each loudspeaker is weighted. In the case of ambisonic, all loudspeakers
contribute and receive their own weight, so instead of producing virtual sound
sources (as Amplitude Panning techniques do), ambisonic atempts to recreate
the soundfield at the listening spot/area.

In Ambisonic, the sound signals, together with the directional information

of the wave are encoded, by decomposing the soundfield in spherical harmon-
ics. The encoded signals can then be manipulated (in the Ambisonic domain),
transferred or stored, while the directional information is carried throughout
the process. By applying a transformation matrix, these encoded signals can be
decoded to any number of loudspeakers3 , preferably placed regularly around the
listener. An advantage of the encoded signals is that they can be manipulated
as one unit (rotated, tilted, etc.), independently of the number of audio signals

3.1.3 Binaural Audio Rendering Technique

The Binaural technique works by using filters that process the sound,
which simulate the filtering created by the human hearing system. These filters
are obtained by recording an impulse at different angles from a model of a
human head 4 . The resulting audio files are known as Head Related Impulse
Responses (or HRIR).
These recordings (or impulse responses) capture the sound filtering due
to the presence of the human body as an obstacle in the sound’s path (in-
cluding head, torso, pinnae etc.). This transformation/processing of the sound
has different characteristics depending on the sounds direction relative to the
listener’s position. Anechoic sound files can be later convolved with such an im-
pulse, thus embedding the frequency dependent directional characteristics into
the new sound. Equations 3.2 and 3.3 show the process of convolution of the
input signal x(t) with the impulse response h(t):

y(t) = x(t) ∗ h(t) (3.2)

For optimal decoding, the number of loudspeakers should be equal or larger than the
number of encoded channels.
Customized recordings provide better results (i.e. recordings are performed by inserting
microphones in the ear canal of the subject).

When dealing with discreet (sampled) signals, equation 3.2 can be ex-
pressed as follows:
X −1
y(n) = x(n − m) · h(n) (3.3)

The Binaural audio rendering technique works best when rendered to

headphones. In fact, most video games use Binaural audio to produce 3D au-
dio effects when using headphones. Binaural techniques can also be translated
to loudspeaker based systems (known as Transaural audio). This technique
involves an extra step, consisting of removing the cross-talk from each loud-
speaker to the opposite ear. The different cross-talk cancellation methods have
been summarized by W. Gardner in [Gardner, 1997].

3.1.4 WFS
Wave Field Synthesis, often referred to as Holophony (as the acoustical
analogy to holography), aims to fully reconstruct the physical properties of a
soundfield. In other words, instead of trying to reproduce the elements related to
the perceptual characteristics of the auditory system in terms of sound location
(such as the VBAP), WFS attempts to reconstruct the soundfield should the
source be present at the specified physical location. This can be explained
through the Huygens principle, which states that each point on a wave front
may be regarded as the source of secondary waves and that the surface that is
tangent to the secondary waves can be used to determine the future position of
the wave front.
WFS replaces with loudspeakers those secondary waves (see Figure 3.2).
Thus, any soundfield can be reconstructed by replacing each point of the wave-
front with a loudspeaker. In reality, the number of loudspeakers is limited,
quantizing the wavefront into discrete points. Just as with digital audio sam-
pling, to avoid aliasing, a minimum number of points is needed. In the case
of WFS, the number of loudspeakers required to truly reconstruct a soundfield
that covers the human audible range is very large, in the order of thousands. M.
Gerzon proposed that about 400,000 loudspeakers would be needed in a listen-
ing space with a 2 meter diameter [Gerzon, 1974]. Current systems make use

Figure 3.2: Huygens Principle

of a smaller number of loudspeakers, limiting them to work within a reduced

frequency bandwidth (usually with an accurate spatial resolution of up to 1
kHz) and for just a few degrees of coverage around the listener.
Under ideal conditions, WFS appears to be the most accurate option for
audio spatialization. In practice, current technology is not ready yet for its
widespread use. Recent research has shown possible solutions for the imple-
mentation of large amounts of loudspeakers needed for WFS, by using speaker
panels with exciters that radiate sound [Boone, 2004]. Computing power is also
an issue, although it is likely that this will not be an issue in the near future.

3.1.5 Hybrid Techniques

In an attempt to solve the deficiencies of the techniques just presented, dif-
ferent combinations and extensions to the main techniques have been proposed.
For instance, a Binaural system can make use of the Ambisonic technique to
position sounds, as well as perform rotation and other manipulations on the en-
tire soundfield (e.g., for head tracking). The output of the Ambisonic decoder
is then processed by the Binaural engine, which is freed from dynamic changes
to the HRTFs.

3.2 Additional Spatialization Properties

Sound localization is not all there is to spatialization. Sound events have
other characteristics that can also be modeled and reproduced. Following are
described some of the important properties that should be considered if an
accurate representation of a sound is desired.

3.2.1 Distance Modeling

Distance is usually modeled by applying filters and delays to the signal,
to simulate the different distance cues used by the auditory system. For exam-
ple, air absorption can be modeled as a low-pass filter that changes its cutoff
frequency and attenuation based on the distance between the source and the
listener. For more accurate models of air absorption, an FIR (Finite Impulse
Response) filter is computed with the desired frequency response. The Doppler
effect (which also contributes to distance simulation of moving sound sources)
is usually implemented using a variable delay line, where the time of the delay
is a function of distance.

3.2.2 Object Size and Radiation Pattern

The positioning techniques described above ignored the size of the object
emitting the sound, as well as the directivity pattern of the source. In other
words, all objects were considered omnidirectional point sources, where om-
nidirectional signifies that sound is equally propagated from the source in all
directions, while point source signifies that the source is small enough relatively
to the sound waves. Ignoring such properties allows for computationally inex-
pensive systems, and works well whenever a fine and accurate representation of
the physical conditions is not necessary.
The sound radiation pattern of an object can be represented and modeled
in many different ways. Simplistic approaches have been used for real-time ap-
plications [OpenAL, 2005], [I3D, 1998], where the radiation is represented by
an inner cone and an outer cone, from which a simple cross-fade is applied
3.3. There are other approaches, the Ambisonic-O format for instance, en-

codes the radiation pattern following the principles of the Ambisonic B Format
[Menzies, 2002].

Figure 3.3: Sound Cones used to represent sound radiation of objects

[Mutanen, 2002]

3.3 Room Acoustics Modeling

The spatialization techniques presented above only attempt to place the
sound at a particular point in space. The geometry and surface properties of the
room where the sound event is assumed to occur, are mostly expressed through
the character of the reverberation (see Chapter 2, Section 2.1). Accurate mod-
eling of the sound propagation in a room is very intensive computationally. To
this end, different methods have been developed with different degrees of effi-
ciency versus accuracy. Some are suitable for real-time use, while others are
meant to be used off-line, whenever applications with a high degree of accuracy
are needed. The following methods are the most common:

3.3.1 Geometric Modeling Methods

These methods approach sound propagation as if it were rays of light that
reflect off the surfaces of the space. Sound rays that reach the ”listener” are

used to build an impulse response of the room. Most real-time systems use
geometric methods for performing acoustical simulation of spaces, because of
their simplicity and generality.
Image Source Method works by finding only those rays that actually
will reach the listener. This can be achieved by assuming that an exact im-
age (thus the name) of the room is reflected at the other side of each wall.
Subsequently, the vector that points from the source to the listener has the ac-
tual direction and length a ray would take (see Figure 3.4). The Image Source
method is efficient and simple but not necessarily the most convincing; it un-
derperforms especially at low frequencies.

Figure 3.4: Image Source Method

Ray Tracing works by following the path of a very large number of rays
(or particles) that travel away from the sound source. When a ray hits a surface,
it reflects and continues its path, until it finds a receiver.
Beam Tracing works by following the path of sets of rays (a beam)
and creating transmission and reflection beams at polygon intersections. This
method is more complex than the previous ones, but it needs less virtual sources
than Source Image method, and does not suffer from the sampling problems
specific to ray tracing [Funkhouser et al., 1998].

3.3.2 Wave Based Methods

Finite Element Modeling (FEM) and Boundary Element Mod-
eling (BEM) These two methods are based on solutions to the wave equation.
They are primarily used to compute low frequency components in simple en-
vironments, as the computing time increases rapidly with increased frequency
[Funkhouser, 2001]. Even though these models are superior (more accurate)
[Begault, 1994] to the Geometry based methods, their use is usually limited to
non real-time processing calculations.

3.3.3 Statistical Methods

This method is generally as a complement to geometrical techniques, pro-
viding a fast and efficient computation of the late portion of a room response.
The density of reflections is usually modeled with a Gaussian (or similar) ex-
ponentially decaying random signal. The decay is usually modeled with a fre-
quency dependent function. Numerous methods have been proposed, ranging
from banks of low-pass and comb filters, to wave-guide meshes and Feedback
Delay Networks (FDN). A good summary of these algorithms can be found in
[Gardner, 2001].

3.3.4 Hybrid Modeling Methods

Hybrid models have been proposed [Rindel, 2000], where one method (e.g.
Image Source) is used for the very first reflections, while another (e.g. ray
tracing) can be used for later stages . Also, a very common approach is to use
statistical methods for the late portion of the reverberation.

3.3.5 Reverberation
Simulating the effect of reverberation can be approached either from a
physical or a perceptual point of view. Reverberation can be approached ei-
ther from a physical or a perceptual point of view. Numerous algorithms have
been proposed [Gardner, 2001]. In recent years, convolution reverberators have
gained popularity. These work by recording an actual impulse response of a
real space and then convolving it with the audio signal. An advantage to this

approach is that it is very realistic. A disadvantage over other approaches is

that changes to the space are not possible
The physical simulation approach attempts to compute an impulse re-
sponse that would result from a particular space, based on the computer model
of the spaces geometry. Usually, these calculations are performed using one of
the geometric techniques presented in Section 3.3, in particular Ray Tracing, or
Beam Tracing. The Image Source method can also work for simple rooms; for
more complex rooms, it works only if calculating the lower order reflections.
The perceptual approach gained popularity in real-time applications be-
cause of the efficiency and simplicity of the algorithms. Room properties are
expressed as a set of subjective parameters, like dry/wet, damp level, etc.
A common approach to reverberation is modeling it in two separate stages:
early reflections, and late, or diffuse reverberation. The early reflections are usu-
ally geometrical calculations, while the late reverberation is generally computed
using a statistical approach.
The late reverberation is a cloud of reflected sound energy traveling in
all directions. The sound intensity of the late reverb (or diffuse reverberation)
is partially independent from the location of the sound source in the room
[Gardner, 1992].

Even convolution reverberators have added controls that modify parameters of the room
by modifying the impulse response
Object Oriented Programming and Design

This chapter provides a brief introduction to Object Oriented Programming

(OOP) to be used as a reference and clarify technical terms used in later chap-
ters. Some of the Software Design Patterns used are also described. The infor-
mation in this chapter is by no means a complete guide to these subjects, so, if
the reader is unfamiliar with them, consulting the works cited in the text would
be beneficial.

4.1 Object Oriented Programming

The common use of the word ”object” refers to anything that is tangible,
e.g. a book, bicycle, table, etc. In software engineering lingo, the meaning is
not much different, except for the fact that anything can be thought of as an
object. A musical note, for example, could be an object.
Class: In OOP, a class can be thought of as a stencil from which new
objects take their properties and behavior. Objects that share properties and
behavior are said to belong to the same class.
Inheritance: When an object inherits from another object (the parent),
the characteristics (state) and behavior of the parent object will also be in the
child. Any object can have children and each of these children have more chil-
dren. Therefore, objects are hierarchically classified from general to particular.
In the inheritance tree, not all classes are meant to be instantiated. In-
stead, their role is to provide a general, abstract representation of a group or
class of objects. Oftentimes they are use to provide a homogenous interface.
These classes are referred to as Abstract.
The following analogy might better illustrate the concept of objects, classes
and inheritance: a Road Bicycle and a Mountain Bicycle can both be subclasses
(or inherit from) class Bicycle. The mountain bike Samantha has is an instance
(an actual object) of the Mountain Bike class. Class Bicycle can be modeled


as an Abstract class providing a common interface and behavior common to all

Composition: Most objects are made out of many smaller objects, e.g.,
a bicycle is composed of wheels, saddle, frame, cranks, breaks, etc. Following
that line of thought, we could say that an Audio Reproduction System is an
object composed of other objects, like a player, loudspeakers, etc., and each of
those might be composed of other objects.

4.1.1 The Unified Modeling Language

The concepts explained in the previous section can be represented in many
different ways. A common and simple way to communicate and design models
is by using a Modeling Language. UML (Unified Modeling Language) is a
well known and widely used language. Since its appearance in 1997, UML
has grown, gaining complexity. UML is used in many different ways, ranging
from software design to using it as a programming language (known as Model
Driven Architecture or MDA) [Fowler, 2003]. The use of UML in this work is
limited to Class diagrams (one of many diagram types described in the UML 2
specification), as a way to better illustrate the described models. The diagram
below shows the basic notation used in class diagrams.

This is a note! Inheritance
Class name in italics ClassB derives from ClassA
represents an
Abstract Class

ClassB Aggregation
Composition + public_attribute : type 1 ClassB contains zero or
ClassB is made of one more ClassY objects.
- private_attribute : type
or more ClassX objects.
+ operation(parameter : type) : returnType

1...* *
ClassX ClassY

Figure 4.1: Example of a UML Class Diagram


4.2 Objet Oriented Design Patterns

Design patterns describe a common engineering problem, and present a
design solution that is proven to work. Recurring design issues are not neces-
sarily identical, but oftentimes very similar. The Design Patterns are usually
written in such a way that allow for variations of the problem. Knowledge of
the most common Patterns is a very useful tool for any software engineer. Ad-
ditionally, they serve as an easy way to communicate a design without the need
to explain it (assuming both sides are familiar with the particular pattern).
The design patterns described below are used in the system design pro-
posed in this thesis. A detailed view on these and other patterns, can be found
in [Gamma et al., 1995].
Observer (Used in the Speaker Layout class) Allows an object to inform
or notify any number of other objects about something. The object that sends
the notification, which we will refer to as the Model (and often called Subject),
needs to know nothing about the receivers of the notification (the Observers).
This is possible by letting the objects, that want to be notified, to register
themselves as Observers to the Model object. Subsequently, when the Model
wants to tell something to the Observers, it steps one by one through the list
of Observers sending the notification.

Model Observer
+ add(observer : Observer) update()
+ remove(observer : Observer)
+ notifyObservers()
call update()
on every observer

ConcreteModel ConcreteObserver

notify() update()

Figure 4.2: A simplified class diagram of the Observer Design Pattern

Strategy (Used in the Panner classes) Provides a unified interface to a

group of related algorithms, thus making them interchangeable. A client can
have an algorithm that takes different forms (or behaves differently) if desired.
To accomplish this, a Strategy (abstract) class acts as the parent of concrete

subclasses that implement an algorithm. The client can then hold a reference
to the Strategy (the base class) allowing to change the algorithm dynamically.
Facade(Used in Spatializers) Simplifies the interaction with a subsystem
by defining an interface that otherwise would be complex to use. When a
particular task requires that the client deals with many different classes that
can be put together to do something, a Facade can provide a better interaction.
The Facade class deals with all those many classes, and lets the client deal
only with the Facade. Oftentimes this Pattern is used as layering mechanism
providing a single entry point to a subsystem (see Figure 4.3.



Figure 4.3: Class diagram of the Facade Design Pattern

Decorator The Decorator design pattern presents a simple way to add

functionality dynamically to any object. A decorator works by wrapping another
objectand posing as though it were that object. The functionality of the original
object is still performed by this, as the decorator forwards any requests to this

4.3 Software Engineering and Design

Software development can be approached in many different ways, and none
is ideal for every project or situation. Some of the most popular development
processes are the waterfall and iterative approaches. In the Waterfall process
each phase is dependent on the previous one. The results of one phase or stage
are used as the starting point of the new stage. A common project could be
divided into the following phases: Functional Analysis, Design, Implementation
and Testing.
The Iterative Design process is based on the idea of a repeating cycle of
phases, refining the design with each iteration. A consequence to the cyclic
process, the requirements appear and change during development. Instead of
working up to a final product, preview versions of the system are built (known
as Prototypes) allowing further analysis and testing for further refinement in the
next iteration. Documentation is also generated (or modified) at each iteration,
building up to a complete design.
Prototyping is one of the main pros iterative design processes offer. The
prototypes can be used for testing, quantitative analysis, user analysis, etc.,
providing feedback. The results of these analysis can then be used in the next
iteration, improving the design.

4.3.1 Usability Engineering

The term Usability Engineering refers to the different methods for finding
the usage of a system. The resulting documents are used during design and de-
velopment with the goal of building a better system. In software engineering, us-
ability has been approached in a variety of ways: a) performance and efficiency,
b) cognition, mental models and c) collaborative [Rosson and Carroll, 2001].
This work payed most attention to the cognitive approach.
A common approach to user analysis are Use Scenarios. A use scenario is
like a story that describes a situation in which the system would be used. It’s
also common to have Problem Scenarios, where the story presents a situation
where a particular problem has to be addressed. By analyzing these scenarios,

system requirements can be obtained based on user needs and usage, and not
just guessed.
Use Cases are also commonly used in the analysis phase and consist of
a step-by-step description of the user interaction with the system. Use Cases
are used for finding the functional requirements of a system. These are often
derived from the Use Scenarios.

4.3.2 Models
As already mentioned, the analysis stages help produce a list of system
requirements for designers to build a model or models of the system. These
models are then used to build the actual system. In this context a model refers
to a general description of the system.
A system can have many models, each with a different purpose. Users
have their own models of how a system should work. These mental models
could be classified in two: those that model how the system works (system
model) and those that model how people interact with the system (interaction
models). Oftentimes is better to separate the interaction model from the system
model. In this way, the interaction model can be designed keeping in mind the
mental model of the user. When these two models match, the interaction with
the system is intuitive and easy to use as it allows us to predict the effects of
our actions.
Donald Norman [Norman, 2002] captures the importance of making the
distinction between the system and interaction models in the following sentence:

”... after all, scissors, pens and light switches are pretty simple
devices. There’s no need to understand the underlying physics or
chemistry of each device we own, simple the relationship between
the controls and the outcome”

System Analysis & Design

User Analysis

This chapter presents the results of a thorough user analysis performed to bet-
ter understand the system. One of the primary objectives in the development
of the proposed system was designing it with usability in mind. Most soft-
ware libraries are designed considering only functionality and then efficiency,
oftentimes ignorin or barely considering the target developers as users. The
assumption is that the program would be used by a scientist, therefore usability
does not matter. Such assumption has generated many non-friendly libraries
and programming languages.
Ordinary Usability Engineering analysis was performed, borrowing only
those aspects that applied to framework development. For example, Use Scenar-
ios cannot be written the standard way, as there is no real time interaction with
the system. The information described in this section is later used to generate
a set of requirements for the system design, so that user needs are satisfied.

5.1 Target Users

The design of our framework is mainly targeted to software developers
that work with multimedia content. As primary users, most decisions will be
weighted towards their needs. However, this does not mean that other possible
users are left unattended. This section aims to describe the needs of the different
possible user types that were considered in the design. The users that were
considered as the main target are classified as Primary, while other potential
users are classified as Secondary.

5.1.1 Primary Users

Users included in this category are the most likely to find an adaptive
spatial audio framework useful. Considering that the author’s primary goal
is to facilitate multichannel audio software development, the primary target


users will be multimedia developers. Strictly speaking, only those with software
programming knowledge and an interest in developing an audio application
would constitute the user base. The list below shows the primary target users,
followed by a description for each item.

• Software Developers: e.g. Audio Tool Development, etc.

• Spatial Audio Researchers: e.g. Adding new components, or improving
the system..

Software Engineers / Developers Software Developers tend to look

for software libraries that are simple to use, but at the same time are modular,
flexible, efficient, powerful and up-to-date with the changes of technology. In the
author’s vision, a high level API which successfully hides the kernel information,
is usually preferred. Having said this, we should also note that every developer
might need to interact with the system at different levels, depending on the
programs use.
Software development, in this context, covers a broad range of uses. On
one hand, video game programmers need a real-time system that does not have
to provide the most accurate rendering. At the other extreme, an acoustician
might need a very accurate representation of a space. The framework’s design
should not favor any side, by offering different levels of interaction as needed.
Spatial Audio Research The background and knowledge of this type
of user can vary significantly. Spatial audio research has been approached by
physicists, computer scientists, musicians, mathematicians, or any combination
of the above. Nonetheless, it is assumed that an individual in search of extending
a spatial audio framework will have basic mathematics knowledge, programming
experience, and probably music and/or physics knowledge. Such a user would
need a flexible, modular system. Improving, extending or manipulating the
system should not require familiarity with its entire functionality, and changes
to one area should be possible without affecting other areas.

5.1.2 Secondary Users

Even though limiting the group of users to those mentioned in the previous
section could simplify the framework’s design, we believe that the system could
prove helpful to other users as well. The user groups presented below will
probably have little programming knowledge and a basic or no understanding
of the different spatialization techniques. Regardless, due to the flexibility of
software programming vs. using a commercial application, more and more users
in the following list are likely to write custom programs in order to achieve their
Classifying these users as Secondary does not mean they are given less
consideration that the primary users. The design will attempt to satisfy all
target users, but in the event of a tradeoff, primary users will have precedence.
One of the reasons these user sectors are classified as Secondary is due to the
fact that most of the times they prefer to use pre-built software applications,
just as opposed to building them themselves.

• Acousticians e.g. For room simulation

• Scientists: e.g. For auditory displays (data sonification).
• Musicians: e.g. Spatial manipulation in contemporary music.
• Media Artists: e.g. Installations using audio.

In general, all these users would require the following:

• Real-time positioning control, possibly at the cost of spatial accuracy

(with the exception of acoustical simulations where accuracy is key).

• A simple programming interface. Most of these users have very little

programming knowledge.

• Most of the inner-working of the spatialization techniques should be hid-

den. It should act as a black box that magically distributes sound to the

• Portability and cross-platform. This framework would be useless if it

cannot be integrated into other projects. Users should be able to use it
as a Spatialization Library.

5.2 Use Scenarios

The following Problem Scenarios represent a range of possible situations
where the framework proposed in this document would be of great use. These
scenarios were considered during the design and development stages. Each
scenario possesses different requirements that are to be addressed during the
design process.

• Scenario 1: Video Game Development

Ramon, a software developer at a video game company, just started work-
ing on what will be the next generation game. 3D audio is a key component
for a true immersive experience. The game is about the revenge of Vruz
Lea. During the day, the character has to fight on the streets for money,
while at night he fights underground, in search of the person that killed his
father. The night scenes have a specific need of accurate representation of
sound location. Enemies could attack from behind and in the dark, and
the main guide Vruz Lea has is his hearing.

• Scenario 2: A new algorithm for spatial audio has to be implemented and

Bob, a PhD student, developed a new technique for audio spatialization.
In theory, this technique is more efficient and provides better quality than
previous methods. In order to prove his theory, Bob has to run a set of
perceptual tests, comparing previous techniques with the new one being

• Scenario 3: Contemporary Music

Lancelot, a music composer of electronic music, owns a studio where he
spends most of the time composing. The audio system in the studio
consists of seven speakers around the listener, one above the head, and a

sub-woofer. Lancelot has been commissioned to write a piece for a new

48 channel audio system. He will not have access to such audio system
until a day before the premiere.

All these scenarios have solutions with current technology, but the solution
is not always the most intuitive or most convenient one.

5.3 Use Cases

The following use- cases are written assuming that programming would be
an interactive task, where calling a function would perform the action at that
instant in time (much like a programming language). Such assumption does
not affect the design for those languages that have to be compiled before seeing
Each use case incrementally adds functionality to the basic task presented
in the very first use case. Thus, more complex system requirements can refer
to simpler cases to perform the necessary tasks. The end goal (which indeed
makes use of simpler, more basic goals) is simulating the acoustics of a physical
Use Case 1:
The following use case shows the steps a developer has to take in order
to successfully modify a sound, so that it appears to come from a particular
Goal: Place a sound at a fixed position around the listener.

• Developer: Creates a ”sound object”

• System: Sound file is loaded.
• Developer: Specifies the position of the sound
(e.g. mySound.setPosition(x, y, z))..
• Developer: Triggers or plays sound.
• System: Loads the default Speaker configuration.
• System: Sound starts playing at the specified location.

Result: The sound should be perceived from the specified position.

The use case presented above is simple and intuitive. It reduces the usage
of the framework to specifying the position of a sound. This case is analogous
to moving in real space an object capable of producing sound. Stationary sound
sources (as demonstrated above) could be helpful in some situations, but most of
the times, sound sources move. Cars, birds, people, etc., constantly change their
position. The following use case variations describe the program interaction in
such situations.

Variation to Use Case 1 Alter the position of a sound by use of an

external device, or a on-screen control. The Developer would then have
to set the position every time the setting of the external control changes.

Use case 2: In addition to positioning sound in space, some users might be

interested in specifying the shape, size, or structure of the sound source, which
translates in different sound radiation patterns. In the previous use cases we
assumed that sound sources were omnidirectional.
Goal: Perform a simulation of the acoustics of a space.

• Developer: Specifies the position and number of the speakers of the

audio system.
• Developer: Creates a ”sound file object”
• System: Sound file is loaded.
• Developer: Sets the position of the object.
• Developer: Creates an Auralizer providing a file with the description
of a room.
• System: Loads and parses room file (scene).
• System: Runs an analysis of the user’s audio system setup and loads the
appropriate spatializer.
• Developer: Connects the sound-file to the Auralizer.
• Developer: Starts the simulation.
• System: Starts playing the sound.

In conclusion, the goal of the proposed project is to design a framework

that looks and behaves along the lines of the use cases presented in this section.

5.4 System Requirements

A set of system requirements was generated based on the user analysis
results. These requirements define the interaction models and helped make
informed design decisions for the system models. Some of the requirements
considered include:

• Simple, easy to use interface

• Flexible - should accommodate to the different uses.
• Extensible - sound spatialization technology is still an area of continuous
development, requiring frequent updates to the system. These updates
should be simple to perform, and should not affect unmodified sectors of
the system.
• Modular - in addition to promote flexibility, modularity offers a high de-
gree of control over a system.
• Real-time and non real-time processin
• Different levels of accuracy (directly related to efficiency)
• Efficient

Flexibility and ease of use, two major requirements, conflict with each
other. Flexibility oftentimes adds complexity, producing a hard to use system.
Chapter 6 presents a solution to the problem, based on the idea of providing
different interaction interface levels. The high level interaction model should
hide the system model, presenting an intuitive and simple interface. Lower
level interaction models would provide increasing levels of control, sacrificing
some of the simplicity of the higher level models.

The models presented below show different system representations designed to

satisfy the different user needs. [Lidwell et al., 2003] classifies mental models in
two: those that model how the system works (system model) and those that
model how people interact with the system (interaction models). In most cases,
users should only have to deal with the interaction model (or the implementation
of it, which should be as close as the ideal model as possible).

6.1 Interaction Models

The interaction model was designed keeping in mind the ease of use and
intuitiveness at all times. The model aims to follow [Norman, 2002] goal, in that
well designed objects should be easy to interpret and understand, and contain
visible clues to their operation. When designing a framework, the visible clues
cannot yet be implemented (as users handle code, not actual objects), but a
good mapping of real behaviors of objects to the abstract representation in the
software framework should promote a good design.
Intuitively, people expect to interact with a computer system just as they
do with objects in their everyday life. The most natural interaction between a
human and any object is direct manipulation. One generally expects to associate
the localization of a sound with the physical presence of its source. In a virtual
space created by a computer system, the most intuitive action for locating a
sound in the virtual 3D space is to manipulate a virtual/imaginary object that
is considered its source. Therefore, setting the virtual objects position in space
should be the only operation for assigning the sounds location.
Most users, are likely to be comfortable using the aforementioned inter-
action model, as it is very close to their everyday interaction with real objects.
The problem is that due to limitations of current spatialization techniques when
positioning sounds (see Chapter 3), some users might want to specify a partic-


ular technique. Depending on the intended use of the framework and the users
background, some of them will expect a different type of interaction.
The user that has no audio knowledge might prefer the most natural
model, where setting the objects position takes care of rendering the sound
from that position. To achieve this, an expert system, behind the scenes might
run an analysis of the users setup, and choose the most adequate spatialzation
More experienced users could choose a particular rendering technique; the
proposed model does not account for this condition. In such situation, users
will have to deal directly with the system model.

6.2 System Meta-Model

During the first design stages, the CREATE Signal Library (CSL) was
taken as the primary model on which the Spatial Audio framework proposed
in this document would be built upon. It wasn’t until some time after, that
it was brought to the attention to the author that considering or defining, a
system meta-model would be useful. The obvious choice was considering the
meta-model upon which CSL was built.
Just months prior to this project, a group of Media Arts and Technology
(MAT ) students at the University of California, Santa Barbara (UCSB) (the au-
thor of this thesis included), together with and under the guidance of professors
Stephen Pope and Xavier Amatriain, revised CSL [Pope et al., 2006], incorpo-
rating many of the concepts proposed by X. Amatriain in [Amatriain, 2004].
Amatriain proposes an Object Oriented Meta-Model for Digital Signal
Processing (4MS). The goal of 4MS is to offer a generic system-meta-model
that can be used to design multimedia processing frameworks. 4MS combines
the advantages of the object-oriented paradigm, system engineering techniques
and graphical models of computation [Amatriain and Pope, ].
4MS classifies signal processing objects into two categories: Processing
objects that operate on data and control messages, and Processing Data objects
that hold the data to be processed. Communication between Processing objects

is done using Ports (the ports mechanism is analog to the input and output ports
found in a physical device). The state of the Processing objects is handled by
an asynchronous Control mechanism.

Figure 6.1: Basic elements in 4MS: Processing objects, represented

by boxes, are connected through ports (round connec-
tion slots)– defining a synchronous data flow from left to
right, or through controls (square slots)– defining a verti-
cal event-driven mechanism [Amatriain, 2004].

Processing Data Processing Data objects passively hold the data that
Processing objects modify, offering ”a homogeneous interface to media data”
[Amatriain, 2004]. These objects are further classified, defining different types
of data. In this particular design, Processing Data objects would cary audio
data (i.e. a buffer of samples).
Processing Processing objects are the main building blocks in a 4MS-
based system. Processing objects encapsulate an algorithm, meant to perform
an action on the given data, transforming its state/essence. Data processing is
done synchronously as opposed to control data, which is sent as asynchronous
events. The state of the Processing Object is handled by a Configuration mech-

Processing objects are designed as composites, being able to hold other

processing objects, which in turn can have other Processing objects inside (see
figure 6.1. Ultimately, complex signal graphs can be encapsulated in a single
processing object.

6.3 System Model

This project’s design can be easily described with the 4MS meta-model,
where SpatialData is a kind of ProcessingData with added information about
sound’s spatial properties (like position, orientation, etc.), Panners and Spatial-
izers are Processing objects and other classes (such as the SpeakerLayout) act as
objects for their Configuration. Figure 6.2 below should serve as an overview of
the whole system model. Detailed descriptions of each component are presented
in the subsequent sections.

Processing ProcessingData

SpatialProcessing SpatialData

Speaker Layout

Spatializer Simulator



Auralizer Modeler

Figure 6.2: System Model Overview. SpatialProcessing and Spatial-

Data (top) constitute the base of the Model.

6.3.1 Spatial Data

The SpatialData is a special kind of ProcessingData that in addition to
storing the audio data, carries extra information required by the SpatialProcess-
ing objects to perform their processing. The extra data stored could include
information about Position, Orientation, Shape and Size of the object that pro-
duces the sound. When no information is specified, a set of default values should
be used. For example, if a Panner does not receive any information regarding
the objects position, a front center location would be assumed.

6.3.2 Spatial Processing

SpatialProcessing is a special kind of Processing, differing in that it accepts
only SpatialData as input for processing, and not any kind of ProcessingData.
Limiting the connections to SpatialData assures that the data being processed
has the needed information to be spatialized.

6.3.3 Speaker Layout

The Speaker Layout class keeps and manages a list of all the loudspeakers
that constitute the user’s audio reproduction system. Each of these loudspeakers
is represented by the Loudspeaker class, which stores the information about the
device. The minimum information needed for a basic spatial audio system is
the position of the loudspeakers, relative to a center (0 azimuth, 0 elevation)
position. Additionally, speakers could hold their frequency response, directivity
pattern or other properties.
The SpeakerLayout also keeps track of the channel (or speaker index) to
which the loudspeaker is connected. Clients of the class (e.g., Panner, see below)
can ask for the number of speakers in the system, as well as the properties of
each loudspeaker. The figure below shows the collaboration diagram of the
SpeakerLayout class.
SpeakerLayout clients can register as observers in order to receive notifi-
cations of any changes to the layout. They can thus update any information
that depends on the audio system setup. This means that any clients of the

- position : Position
+ position() : Position
SpeakerLayout 1 + azimuth() : number
defaultLayout : SpeakerLayout + elevation() : number
setDefaultSpeakerLayout() : void + distance() : number
defaultSpeakerLayout() : SpeakerLayout
+ numSpeakers() : number
+ addSpeaker() : void
+ positionOfSpeaker(index) : Position

Figure 6.3: SpeakerLayout class diagram.

SpeakerLayout have to tell the layout they want to know when and if anything
changes in the layout (register themselves as observers).
To simplify usage, a default layout is created when instantiating any object
that uses the speaker layout, so that if not layout is set, the default layout is
employed. Any layout can be set as user-default, so that later instances of
objects can reuse it without the need of setting it manually per each instance.
The use of multiple layouts is allowed, by assigning the desired layout to the
object that will use it. In that case, this object will use this layout instead of
the default.

SpeakerLayout Processing

Active SpeakerLayout

Figure 6.4: ActiveSpeakerLayout can be used in the graph as a Pro-

cessing, accounting for different audio configurations.

NOTE: Speaker layout should have an optional ”optimal” layout, compen-

sating for any delay due to radius dissimilarities between the speakers physical
location and the center position. When asked for the speaker position, it re-
turns the position with the necessary delay.. The delay is obtained by finding
the farthest Loudspeaker, and moving (i.e. delaying) all speakers far back. This
operation is performed by the DistanceSimulator.

6.3.4 Panner
Sound positioning received particular attention during the analysis and
design phases as it plays a mayor role in the design. The traditional, already
familiar, panning model, where the sound position is set using the pan pot of a
sound mixing console, proved to be an inconvenient model for sound positioning
when dealing with multimedia content. The model proposed, already described
in the Interaction Model Section wasn’t welcome by all users, as expected. Still,
sometimes is better to have people learn a new model that is clear and consistent,
rather than to use a familiar model that does not fit [Lidwell et al., 2003].
The Panner abstract class is the base of any SpatialProcessing object capa-
ble of modifying a signal so that it appears to be placed at a particular position
in the listening space. This name (Panner ) was chosen due to the familiarity
most people have with the concept of stereo panning. However, in this model,
the panners capabilities were extended to include a full 3D periphery, as opposed
to only the 60◦ in the horizontal plane covered by a typical stereo panner. As
with conventional 2D panners, distance information is not necessarily calculated
at this level (i.e. by the panner) 6 .

SpatialProcessing Observer

Panner 1
speakerLayout : SpeakerLayout
+ setSpeakerLayout(layout : SpeakerLayout) : void
+ addSource(soundSource : SpatialData) : void
+ removeSource(soundSource :SpatialData) : void
+ numSources() : number Panners can process
multiple sources at a time.

Figure 6.5: Panner class diagram. Panner inherits from both Spatial-
Processing and Observer (top) and handles any number
of SpatialData objects.

Concrete subclasses should implement the different spatialization tech-

niques (Vector Base Amplitude Panner, Binaural Panner, Ambisonic Panner,
A conventional stereo or surround panner does not take in consideration the distance.
In such case, panning is assumed to happen at the distance of the speakers.

etc.). Keeping a common base class allows for dynamic substitution of panning
techniques, as described in the Strategy design pattern [Gamma et al., 1995].
The classes described below (see section 6.4.1) are implemented using this pat-
tern. As a consequence to allowing such dynamism, the distance processing had
to be decoupled from panning, and performed by a separate Processing object.


Ambisonic Amplitude Binaural

Panner Panner Panner

G Format Stereo Surround Quad

Panner Panner Panner Panner

Figure 6.6: Concrete subclasses of the Panner base class.

The Panner is just the Abstract class from which particular spatialization
techniques can be implemented. For example, a VBAP implementation inherits
from the Panner, adding its particular algorithm to be performed on the sound
For simplicity, a Panner should be able to process multiple sources, and
not just one, as other Processing objects do. In a way, a Panner can be seen as
a multichannel mixer of spatial sources, in that it receives any number of input
data to process, and it outputs a ”mixed down” version.

6.3.5 Distance Simulator

Even though it makes sense, as a system, to have a single processor to
simulate each single distance cue, it is simpler for users to think of it as one single
process. In general, all distance cues are encapsulated into a single distance
simulator. Unless a particular effect is needed, it makes no sense to use separate
distance cues.
Distance Cue The processes required to simulate the effects distance
produces on sound are just common filters and other simple manipulations.
Each of the factors that affect the sound wave in its path to the listener has

1...* DistanceCue

+ compute(distance : number) : void

+ process(data : Buffer) : void

AirAbsorptionCue DistanceAttenuationCue ConcreteCues . . .

Figure 6.7: Distance Simulator Class Diagram

to be modeled as a separate process. These processes do not have anything in

common other than the fact that the operations in function of the distance.
A DistanceCue is a Processing that sets the parameters of other proces-
sors based on the distance value. For example, air absorption is a frequency
dependent attenuation phenomenon that can be implemented using a low-pass
filter. The only role of the ”Air Absorber” Processing is to translate the distance
effect into the cutoff frequency value for the low-pass filter.
Flyweight pattern: The number of sound sources to be rendered could
be as little as one, and up to thousands. Distance Cues are Flyweights, which
means that in order to accomplish their function, they need to be given any
information required.
If desired, a Processing can be easily built from a DistanceCue object. The
Processing version would just wrap around the Distance Cue, and every time
it is called, it would supply the appropriate information. In other words, Dis-
tance Cues can be treated as ”passive” algorithms, which do not have memory.
remember anything, it just does it’s process and that’s about it.

6.3.6 Acoustic Modeler

AcousticModeler provides a unified interface to calculating the acoustic
response of a space. RayBasedModeler, WaveBasedModeler and StochasticMod-
eler are AcousticModeler subclasses representing the different approaches to
calculating the acoustic response of spaces.
An AcousticModeler keeps the description of a space (room, scene graph,
etc.), used to perform calculations. Additionally, the positions of the sound

source and the listener must be known. Each AcousticModeler can represent
one listener.
sceneGraph : Room
listener : Position
+ setListener(listener : Position) : void
+ setRoom(sceneGraph : Room) : void
+ reflectionInfoForSource(source SpatialData) : ReflectionData

Wave Based Stochastic Ray Based

BEM other. . . other. . . RayTracing other. . .

Figure 6.8: Acoustic Modeler Class Diagram

AcousticModeler differs from previously described classes in that it is not

a SpatialProcessing. Instead of modifying an audio signal, it provides the data
needed for further processing. Concrete subclasses of AcousticModeler could
also inherit from Processing, allowing it to be used as a Processor. For example
an AcousticModeler that calculates the late reverberation (would be a subclass
of StochasticModeler Class) could also have a version that is a Processing, thus
providing a passive version, and an active version that does modify a Process-
ingData stream.

6.4 High-level Model

The objects presented in this section are meant to simplify the use of the
underlying system. Processes, which are meant to work together, are grouped
so that clients access them as one entity. Thus, clients do not have to handle
and configure lots of small elements; all the logic is performed by these higher
level objects. These objects can also be seen as Interface Layers that effec-
tively provide the interaction models described in Section 6.1, offering the right
balance of flexibility / customizability versus simplicity.

6.4.1 Spatializer
The Spatializer acts as a Facade, hiding the underlying system from the
clients. In this case, the user does not need to manage panning and distance
simulation for each audio source; the Spatializer accomplishes these tasks. Each
time a sound source (SpatialData) is added to the Spatializer, this will internally
create a DistanceSimulator and add it to the Panner. Another advantage is that
it frees the client from knowing about the concrete panning techniques. by using
an object (SpeakerLayoutExpert) that can employ some heuristics in finding the
most adequate panning technique, by analyzing the user’s loudspeaker setup (if
user doesn’t specify the desired Panner).


ner T
1 *
+ setPanningMode(type PannerType) : void

Panner SpatialData

Figure 6.9: Spatializer Class Diagram

The SpeakerLayoutExpert is one of the primary contributions of this work.

Given a speaker layout, it chooses the most appropriate panner (i.e. the one that
will produce the best perceptual results). Fine-tuning the system will require
many perceptual tests to be performed; however, with the current knowledge
of each technique, the object should at least perform a simple analysis, where,
depending on the number of loudspeakers and regularity of the layout a proper
technique can be chosen.

6.4.2 Auralizer
Auralizer is a special kind of Spatializer, with the extra functionality of
calculating room acoustics. An Auralizer has the knowledge of the properties of
a space (scene graph) and makes use of an AcousticModeler to get the acoustic

response of the modeled space. The room response is then spatialized, providing
a full acoustical experience. Only one listener can be specified per Auralizer,
requiring one Auralizer per listener when multiple listeners are needed.



+ setRoom(Room theRoom) : void 1

+ setRoom(String theSceneFile) : void
These methods delegate + setListenerPosition(Position listener) : void Room
the request to the

Figure 6.10: Auralizer Class Diagram

A different way to see an Auralizer is as a reverb that is calculated based

on the description of a physical space (e.g. a room). Auralizer subclasses can
implement different levels of Auralization. For example, a Real-Time Auralizer
contains a LateReverb as a member, so that only the very first reflections are
calculated based on the rooms properties, while the later section is calculated
stochastically. In this case, the parameters of the late reverb have to be set
according to the room properties.


System Implementation

This chapter provides an overview of CSL (The CREATE Signal Library), fol-
lowed by a description of the system implemented as proof of concept to the
aforementioned design. The code is carefully documented using the Doxygen
formatted comments. The documentation is included with CSL in HTML form.
This Chapter is not meant to replace the documentation, but to serve as a com-
plement, particularly for those that will want to improve or extend the system.
See Future Work at the end of this chapter for ideas on improvements.

7.1 CSL Overview

CSL is a C++ library for (audio) signal synthesis and processing. CSL was
designed to be used in several ways, such as for the development of stand-alone
interactive sound synthesis programs, as a plug-in library for other applications
or plug-in hosts, or as a back-end DSP library [Pope and Ramakrishnan, 2003].
CSL has evolved through several re-writes since 1998; version 4 was implemented
by UCSB graduate students in the 2005-06 academic year [Pope et al., 2006].

7.1.1 CSL Core

The two primary components in CSL are Buffers and Unit Generators,
which can be analogous to Processing Data and Processing concepts discussed
in Section 6.2. UnitGenerators periodically receive a Buffer (block) of audio
samples and it is expected to modify, or use the given data.
Audio Data in CSL is represented as a multi-channel sample buffer. Buffers
are passive (do not perform any operations on their data) with the exception
of being capable of allocating memory for the samples they hold. Buffers are
designed to store sample data in a convenient way allowing optimal communi-
cation among UnitGenerators.


A UnitGenerator acts as the parent to all objects that modify or generate

audio data. All objects that want to be part of the DSP graph have to inherit
from this class. UnitGenerators are aware of their sample rate and number of
channels. Additionally they keep a list of other UnitGenerators connected to
their outputs.
CSL is built around the void nextBuffer(Buffer *outputBuffer); me-

thod, which gets called by other UnitGenerators requesting to fill a buffer with
samples. UnitGenerator subclasses override this method implementing their
Processing objects in CSL are called Effect. An Effect is a UnitGenerator
with an input port (see Controllable below). Other UnitGenerators can be
connected to its input to get processed.
UnitGenerator has additional mechanisms for handling multiple outputs
and for notifying dependent objects (Observers) when new samples are com-
puted. More extensive documentation can be found with the CSL distribution.

7.1.2 Connections
Connections between UnitGenerators are hidden from the user. Class
Controllable takes care of making the connection from the input and output
ports. The user does not have to deal with Ports, instead connections are made
by setting a UnitGenerator as the input of another UnitGenerator. In most
cases the input can be given as a parameter to the constructor.

// UnitGenerator that loads a sound file

SoundFile cheesyGuitar( "/audio/mePlaying.wav" );

// Connects the guitar sound file to the reverb

Reverb myReverb( cheesyGuitar );

Class Controllable is a mix-in that adds the Ports mechanism to any

UnitGenerator. Controllable holds onto a map of port objects that represent
the inputs, and manages the naming and processing flow of dynamic inputs.
The network of interconnected modules is known as the DSP graph.

7.1.3 Input / Output

CSL handles audio data synchronously. The IO (Input / Output) objects
are in charge of scheduling and making these periodic calls at the appropriate
time. Class IO provides a common interface for all IO objects. For example,
csl::CAIO is a concrete derived class of IO that handles communication with
Apple’s CoreAudio, allowing communication with sound cards. Other subclasses
of IO include csl::FileIO for reading and writing to files, and csl::RemoteIO for
remote communication.

7.2 Spatial Audio Subsystem

CSL’s Spatial Audio Framework can be seen as a direct implementation
of the models presented in the previous chapter. Spatializers and Auralizers,
provide an intuitive and simple interface to users. Lower level components
provide flexibility.

7.2.1 Position
The position of any object is currently represented as a point in a three
dimensional space. The internal representation is in cartesian coordinates. A
better representation of a position would be if both cartesian and polar rep-
resentations were stored. The tradeoff is memory usage versus accuracy and
performance. For some operations cartesian representation is ideal, but often-
times the values have to be converted.

7.2.2 Spatial Sound Sources

The implementation of spatial data in CSL differs from the model pro-
posed in Chapter 6. Instead of using a SpatialBuffer as the model suggests,
the spatial attributes of the sound source are attached to the UnitGenerator
(thus, making it a Spatial UnitGenerator). This special kind of UnitGenera-
tor is named SpatialSource and it is implemented using the Decorator Design
Pattern, delegating its position and spatial information.

7.2.3 Loudspeaker Setup

The SpeakerLayout Class is implemented following the model described
in Chapter 6. The default layout and accessors are declared static, making sure
only one default can be set, so Panners access it when needed.
The distance from the center of the listening space to each loudspeaker
should be the same for optimal spatial audio rendering. If the loudspeak-
ers are placed at different distances, class SpeakerLayout has a method void
normalizeSpeakerDistances(float radius); that normalizes the loudspeaker
distances to the radius given, or if not provided it normalizes them taking as
guide the farthest loudspeaker in the setup. This method, however, does not
compensate for these distance modifications; it only changes the values of the
speakers, and holds the real radii in a separate array.
Common layouts were implemented as SpeakerLayout subclasses for con-
venience. Class StereoSpeakerLayout creates a layout with two speakers, each
positioned 30◦ left and right from the center. These layouts are not fixed, so
they can be used as a starting point for building a more complex layout.

// create a new SpeakerLayout [optionally a file path can be specified]

SpeakerLayout myAudioSetup();

// add two loudspeakers, at +30 and -30 degrees from the center of
the listening space, and no elevation.

Class csl::ActiveSpeakerLayout is the ”Active” version of class Speaker-

Layout. It can be inserted in the DSP graph, so that it compensates for speaker
distances and remaps channels based on speaker numbers.

7.2.4 Panning
Class Panner is a pure abstract base class that provides management ser-
vices to its subclasses. Adding and removing sound sources is handled by this
class. The Panner class inherits from both UnitGenerator and Observer. In

addition, it is intended to play the role of the Strategy of the design pattern
with the same name.
Panners, have a virtual void *cache(); method that has to be imple-
mented by subclasses, returning a pointer to an object that stores the state of
the spatial sources. In this way, if the position of an object does not change, the
concrete Panner can request the previous state and use it without re-calculating
parameters when not needed.
As an Observer, csl::Panner registers itself with class csl::SpeakerLayout,
which will send a notification every time the layout changes. The Panner class
receives the notification and calls void speakerLayoutChanged();. Panner
subclasses should implement speakerLayoutChanged(); and not update(void
All concrete Panners in CSL have some code, or are based on code, written
by previous students at the Media Arts and Technology program. The VBAP
Panner for example borrows from the work by Dough McCoy [McCoy, 2005];
The Binaural Panner is based on a VST plug-in written by Ryan Avery, and
the Ambisonic Subsystem was ported and extended from the Ambisonic li-
brary built by Graham Wakefield, Florian Hollerweger and Jorge Castellanos
[Hollerweger, 2006]. Vector Base Amplitude Panning

The VBAP technique requires to build a set of loudspeaker triplets (for
periphonic panning) or pairs (for pantophonic panning) based on the user’s
audio setup. This task has been assigned to a class (csl::SpeakerSetLayout)
which builds a layout of csl:SpeakerSet objects. SpeakerSet is a record class
that holds which speakers belong to the set, and the inverse of the loudspeaker
matrix (for performance reasons this is pre-calculated). The primary reason for
decoupling this task from the VBAP Panner was so that only one set of triplets
exist, regardless of the number of VBAP objects.
An additional feature is that of distinguishing between 2D and 3D audio
setups. The VBAP constructor accepts a optional parameter specifying if it

should render as 2D (kPantophonic), 3D (kPeriphonic). When not specified,

it automatically decides the rendering mode according to the loudspeaker posi-
tions. The Pantophonic mode is far simpler and more efficient, and its used by
VBAP subclasses like the csl::StereoPanner and csl::SurroundPanner. Binaural Panner

The BinauralPanner uses by default a StereoSpeakerLayout, assuming the
two loudspeakers are headphones (ignoring the actual loudspeaker position). A
TransauralPanner could extend the BinauralPanner if required, but currently it
is not implemented.
At creation time, the BinauralPanner uses the HRTFDatabase class to
load the HRTF data from a folder containing the HRIRs. The HRTFDatabase
currently loads a set of HRIR obtained from the IRCAM database [irc, ]. Each
HRTF in the database is represented by an HRTF object containing the HRTF
position and the HRTF in the frequency domain. A simple improvement to the
HRTFDatabase class would be allow loading files from any other database.

Panner HRTF Vector
numHRTFs() 0...1
Binaural Panner

Concrete Concrete
Database A Database B

Figure 7.1: BinauralPanner Class Diagram

In the DSP loop, after getting the current position of the Sound Source,
the corresponding HRTF has to be determined. Currently, the HRTF closest
to the Sound Source position is chosen, but ideally a set of them would be
interpolated in order to obtain a more accurate result. Finally, the HRTFs and
the input data are multiplied (when performed in the frequency domain this is
equivalent to convolution) using the low-latency block-wise FFT method.
62 Ambisonic Panner

The AmbisonicPanner is a container class that manages the Ambisonic
Encoding, Ambisonic Decoding, Mixing of Ambisonic Encoded audio, as well
other sound field operations performed in the Ambisonic domain. Each of these
processes is performed by the corresponding class, described below.
Ambisonic Subsystem Every class in the Ambisonic subsystem can be
used independently from the AmbisonicPanner class. They are designed to
work as an Effect, capable of processing any UnitGenerator. The use of the
AmbisonicPanner is still advised for most situations as it provides a simpler
AmbisonicUnitGenerator, superclass to all classes in the Ambisonic sub-
system, simplifies the interaction and manipulation of Ambisonic data by adding
the information of the encoded Ambisonic Order. Thus, the order has to be
specified only once (usually at the encoding stage) and further manipulations
to the signals will not require the user to specify the order again. The Ambison-
icUnitGenerator class also provides some utility methods (see Figure 7.2).


AmbisonicUnitGenerator AmbisonicOrder

AmbisonicEncoder AmbisonicDecoder AmbisonicMixer AmbisonicRotator

Figure 7.2: Ambisonic Subsystem Class Diagram.

AmbisonicEncoder acts as an Effect, taking a UnitGenerator as input. It

encodes the given audio to the specified Ambisonic order. The current imple-
mentation goes up to third order.
AmbisonicDecoder implements different decoding methods and flavours
(described in [Daniel, 2001]) which can be specified in the constructor. The
speaker layout is also an optional argument in the constructor. If no speaker
layout is specified, the default layout is used. If an AmbisonicUnitGenerator is

given as input, the order is obtained automatically from it. If a UnitGenerator

is to be decoded, the order must be specified by the client.
AmbisonicMixer is a utility class that merges (or mixes) any number of
AmbisonicUnitGenerators (UnitGenerators whose output data is in the Am-
bisonic domain). When encoding multiple sound sources, the output of each
encoder has to be passed through an AmbisonicMixer, and then it can be de-
AmbisonicRotator performs operations to ambisonically encoded audio,
manipulating the entire sound field. The implemented operations are Rotation,
Tilt and Tumble. The Ambisonic Rotator is usually placed right before the
Ambisonic Decoder, allowing global changes to the sound field.

7.2.5 Distance Simulation

DistanceSimulator is a SpatialUnitGenerator that manages a list of Dis-
tanceCues. A DistaceCue represents an algorithm to simulate a particular
distance. All DistanceCue derived classes must implement the following two
void compute(float distance);
void process(Buffer &inputBuffer);
The DistanceSimulator, when called to process a block of samples, will it-
erate through its list of DistanceCue objects, calling compute first (recalculating
data based on distance if needed) and then process.
Additionally, a DistanceSimulator is capable of turning a simple Unit-
Generator into a SpatialUnitGenerator. In other words, DistanceSimulators
can accept a simple UnitGenerator as input, and the position is set directly to
the DistanceSimulator.

7.3 Future Work

The following list presents some of the areas that need attention in the
current CSL implementation. The items are not presented in any particular
order and have no dependencies among themselves.

• WFS Wavefield Synthesis is one of the primary techniques that any good
spatialization framework should have. Even though in most practical sit-
uations is useless, future improved technology could make it widely avail-

• HRTF decimation A processor friendly version of binaural panner that

reduces HRTF size at instantiation.

• Interpolation of HRTFs The BinauralPanner does not interpolate HRTFs.

The current implementation selects the HRTF closest to the provided po-
sition. Simple interpolation should be easy to implement, and it has been
thoroughly documented in the literature [Duraiswami et al., 2004].

• Arbitrary HRTF Databases The HRTF Database Class shouldn’t be

limited to any particular HRTF set. Any set of files should serve as an
HRTF database.

• Higher Order Ambisonic The current Ambisonic implementation works

up to third order. For most practical situations third order is sufficient,
but for research and perceptual evaluations it is important to have higher

• NFC-HOA Implementation of the Near Field Compensated Ambisonic

technique [Daniel, 2003], [Daniel and Moreau, 2004].

• Ambisonic O format An approach to sound radiation by encoding an

object radiation pattern [Menzies, 2002].

• Distance Cues with different accuracy and different techniques.

• Cross-talk Cancelation Transaural is limited by the reduced sweet spot.

In spite of that, when used in a controlled environment, the results can be
of very high quality. It would be even better if the canceler adjusted itself
dynamically to the position of the listener (as proposed by W. Gardner
[Gardner, 1997].

• Room de-convolution If the response of a room is known a priori, it is

possible to de-convolve this response from the output signals, allowing for
a clean reproduction of the simulated room.

• Adaptive Spatial Audio Rendering System The system proposed

takes into consideration the number of loudspeakers and their positions
in order to choose the most appropriate spatialization technique. Other
elements could be of much importance for choosing the right technique.
Further investigation into this matter would be of much use.

The design principles and practical implementation of a general-purpose spatial

audio rendering framework were presented in previous chapters of this docu-
ment. Different audio spatialization techniques were integrated into a framework
providing a single, simple interface to any system, regardless of the loudspeaker
or room setup. The system can silently adapt itself to an existing audio setup,
loading the most appropriate spatialization technique for such setup, improv-
ing the user’s experience. The objective of such framework is to facilitate the
implementation of audio and/or multimedia applications used as research tools,
art, and/or entertainment.
The design proved successful in providing a framework that simplifies mul-
timedia application development. Writing a program that accurately positions
sound takes just a few minutes and less than twenty lines of code.The current
implementation is capable of positioning sound sources using the most current
audio spatialization techniques. Simple distance models were implemented, but
sound radiation was left unattended. The adaptive engine implementation is
simple and would benefit from perceptual tests performed on the different spa-
tialization techniques. The code and documentation has been added to the
current CSL distribution and can be obtained from the CSL site [Pope, ].
The current model is ready to be used in any Multimedia framework, pro-
viding a simple yet powerful interface to spatial audio rendering. Nevertehless,
further testing and possible redesign of certain aspects of the framework would
be beneficial for future implementations.

The End


[irc, ] Ircam, hrtf database.

[I3D, 1998] (1998). 3D Audio Rendering and Evaluation Guidelines, Level 1.0.

[I3D, 1999] (1999). Interactive 3D Audio Rendering Guidelines, Level 2.0.


[all, 2006] (2006). Allosphere@cnsi.

[Amatriain, 2004] Amatriain, X. (2004). An Object Oriented Metamodel for

Digital Signal Processing - with a focus on Audio and Music. PhD thesis,
Universitat Pompeu Fabra.

[Amatriain and Pope, ] Amatriain, X. and Pope, S. T. An object-oriented

metamodel for multimedia processing systems. To be published at the ACM
Transactions on Multimedia Computing, Communications and Applications.

[Begault, 1994] Begault, D. R. (1994). 3D Sound for Virtual Reality and

Multimedia. NASA, National Aeronautics and Space Administration.

[Boone, 2004] Boone, M. M. (2004). Multi-actuator panels (maps) as

loudspeaker arrays for wave field synthesis.

[Chowning, 2000] Chowning, J. M. (2000). Digital sound synthesis, acoustics,

and perception: A rich intersection. In Proceedings of the COST G-6
Conference on Digital Audio Effects (DAFX-00). DAFX, Digital Audio

[Daniel, 2001] Daniel, J. (2001). Représentation de champs acoustiques,

application à la transmission et à la reproduction de scènes sonores
complexes dans un contexte multimédia. PhD thesis, Université Paris 6.

[Daniel, 2003] Daniel, J. (2003). Spatial sound encoding including near field
effect: Introducing distance coding filters and a viable, new ambisonic
format. In Journal of the Audio Engineering Society: 23rd International
Conference, Copenhagen, Denmark.

[Daniel and Moreau, 2004] Daniel, J. and Moreau, S. (2004). Further study of
sound field coding with higher order ambisonics. In Journal of the Audio
Engineering Society, Berlin, Germany.


[Duraiswami et al., 2004] Duraiswami, R., Zotkin, D. N., and Gumerov, N. A.

(2004). Interpolation and range extrapolation of hrtfs. IEEE ICASSP.

[Fowler, 2003] Fowler, M. (2003). UML Distilled: A Brief Guide to the

Standard Object Modeling Language. Addison-Wesley, third edition.

[Funkhouser, 2001] Funkhouser, T. (2001). A beam tracing method for

interactive architectural acoustics.

[Funkhouser et al., 1998] Funkhouser, T., Carlbom, I., Elko, G., Pingali, G.,
Sondhi, M., and West, J. (1998). A beam tracing approach to acoustic
modeling for interactive virtual environments. In on Computer Graphics, S.
A. S. I. G. and Techniques, I., editors, International Conference on
Computer Graphics and Interactive Techniques.

[Gamma et al., 1995] Gamma, E., Helm, R., Johnson, R., and Vlissides, J.
(1995). Design Patterns - Elements of Reusable Object-Oriented Software.

[Gardner, 1992] Gardner, W. G. (1992). The virtual acoustic room. Master’s

thesis, Massachusetts Institute of Technology.

[Gardner, 1997] Gardner, W. G. (1997). 3-D Audio Using Loudspeakers. PhD

thesis, Massachusetts Institute of Technology.

[Gardner, 2001] Gardner, W. G. (2001). Applications of Digital Signal

Processing to Audio and Acoustics. Springer.

[Gerzon, 1974] Gerzon, M. (1974). Surround-sound psychoacoustics. Wireless


[Gerzon, 1975] Gerzon, M. (1975). Ambisonics part two: Studio techniques.

Studio sound and broadcast engineering, 17(8).

[Hollerweger, 2006] Hollerweger, F. (2006). Periphonic sound spatialization in

multi-user virtual environments. Master’s thesis, Institute of Electronic
Music and Acoustics (IEM).

[Kuttruff, 1979] Kuttruff, H. (1979). Room Acoustics. Applied Science

Publishers Ltd, London, U.K, 2nd edition.

[Lidwell et al., 2003] Lidwell, W., Holdern, K., and Butler, J. (2003).
Universal Principles of Design: 100 Ways to Enhance Usability, Influence
Perception, Increase Appeal, Make Better Design Decisions, and Teach
Through Design. Rockport Publishers.

[Malham, 1998] Malham, D. G. (1998). Sound spatialisation. In Proceedings

of the First COST-G6 Workshop on Digital Audio Effects (DAFX98).

[McCoy, 2005] McCoy, D. (2005). Ventriloquist: A performance interface for

real-time gesture-controlled music spatialization. Master’s thesis, University
of California, Santa Barbara.

[Menzies, 2002] Menzies, D. (2002). W-panning and o-format, tools for object
spatialization. In AES 22 International Conference on Virtual, Synthetic
and Entertainment Audio.

[Mutanen, 2002] Mutanen, J. (2002). I3dl2 and creative

R eax. Technical
report, Creative.

[Naef et al., 2002] Naef, M., Staadt, O., and Gross, M. (2002). Spatialized
audio rendering for immersive virtual environments. ACM.

[Norman, 2002] Norman, D. (2002). The Design of Everyday Things. Basic


[OpenAL, 2005] OpenAL (2005). Openal 1.1 specification and reference.

[Pohlmann, 2005] Pohlmann, K. (2005). Principles of Digital Audio.

McGraw-Hill, 5th edition.

[Pope, ] Pope, S. T. Csl (create signal library).

[Pope, 2005] Pope, S. T. (2005). Audio in the ucsb cnsi allosphere. Technical
report, Media Arts and Technology, University of California, Santa Barbara.

[Pope et al., 2006] Pope, S. T., Amatriain, X., Putnam, L., Castellanos, J.,
and Avery, R. (2006). Metamodels and design patterns in csl 4. To be
presented at the International Computer Music Conference.

[Pope and Ramakrishnan, 2003] Pope, S. T. and Ramakrishnan, C. (2003).

The create signal library (sizzle): Design, issues, and applications. In
Proceedings of the 2003 International Computer Music Conference (ICMC
2003). International Computer Music Association.

[Pulkki, 1997] Pulkki, V. (1997). Virtual sound source positioning using

vector base amplitude panning. Audio Engineering Society, 45(6).

[Rindel, 2000] Rindel, J. H. (2000). The use of computer modeling in room


[Roads, 1996] Roads, C. (1996). The computer music tutorial. The MIT Press.

[Rosson and Carroll, 2001] Rosson, M. B. and Carroll, J. M. (2001). Usability

Engineering: Scenario-Based Development of Human Computer Interaction.
Morgen Kaufmann, 1rst edition.

[Savioja, 2000] Savioja, L. (2000). Modeling techniques for virtual acoustics.

Master’s thesis, Helsinki University of Technology: Laboratory of Acoustics
and Audio Signal Processing.

[Väänänen, 2003] Väänänen, R. (2003). Parametrization, Auralization, and

Authoring of Room Acoustics for Virtual Reality Applications. PhD thesis,
Helsinki University of Technology.