Вы находитесь на странице: 1из 48

Romain Boonen, AEDS 911 SAE Institute Brussels September 2013

Three-Dimensional Hearing: The Body, The Brain And The Machine

TABLE OF CONTENTS
0! FOREWORD ................................................................................................................................... 5!

1! THE BODY: FUNDAMENTALS OF THE PHYSICS OF SOUND ............................................................. 7! 1.1. THE PHYSICS OF SOUND : A BRIEF INTRODUCTION ........................................................................ 7! 1.1.1.! THE NATURE OF SOUND........................................................................................................... 7! 1.1.2.! THE IMPEDANCE OF A MEDIUM ................................................................................................ 8! 1.1.3.! THE FOURIER ANALYSIS .......................................................................................................... 8! 1.1.4.! LINEARITY OF A SYSTEM ......................................................................................................... 9! 2! THE BODY: PHYSIOLOGY OF THE EAR ........................................................................................ 10! 2.1. THE OUTER EAR .......................................................................................................................... 10! 2.2. THE MIDDLE EAR ......................................................................................................................... 10! 2.3. THE INNER EAR AND THE COCHLEA ............................................................................................. 13! 2.3.1.! ANATOMY OF THE COCHLEA.................................................................................................. 13! 2.3.1.1.GENERALITIES ........................................................................................................................ 13! 2.3.1.2.!!ESSENTIAL MECHANICS OF THE COCHLEA ............................................................................ 14! 2.3.1.3.!THE ORGAN OF CORTI ............................................................................................................ 14! 2.3.2.! PHYSIOLOGICAL FUNCTIONING OF THE COCHLEA .................................................................. 14! 2.3.3.! SCALAE AS FLUIDS COMPARTMENTS: PERILYMPH AND ENDOLYMPH .................................... 16! 3! THE BRAIN: THE CENTRAL AUDITORY NERVOUS SYSTEM AND FOCUS ON HUMAN SOUND SOURCE LOCALIZATION ................................................................................................... 17! 3.1. ASCENDING PATHWAYS OF THE AUDITORY NERVE ..................................................................... 17! 3.1.1.! THE AUDITORY NERVE .......................................................................................................... 18! 3.1.2.! COCHLEAR NUCLEI ................................................................................................................ 19! 3.1.2.1.!THE VENTRAL COCHLEAR NUCLEUS ..................................................................................... 19! 3.1.2.2.!THE DORSAL COCHLEAR NUCLEUS ....................................................................................... 20! 3.1.3.! THE SUPERIOR OLIVARY COMPLEX ....................................................................................... 20! 3.1.4.! THE LATERAL LEMNISCUS ..................................................................................................... 21! 3.1.5.! THE INFERIOR COLLICULUS ................................................................................................... 22! 3.1.6.! THE THALAMUS MEDIAL GENICULATE BODY ...................................................................... 23! 3.1.7.! THE AUDITORY CORTEX ........................................................................................................ 23! 3.2. SOUND SOURCE LOCALIZATION ................................................................................................... 24! 3.2.1.! THE HORIZONTAL PLAN ........................................................................................................ 24! 3.2.2.! THE VERTICAL PLANE ........................................................................................................... 27! 3.2.3.! DISTANCE FROM THE SOURCE ............................................................................................... 28!

4! THE MACHINE: REQUIRED BACKGROUND .................................................................................. 30! 4.1. INTRODUCTION TO THE MACHINE ................................................................................................ 30! 4.2. HEAD-RELATED TRANSFER FUNCTIONS....................................................................................... 30! 4.3. CHANNELS VS OBJECTS ................................................................................................................ 31! 4.4. CONVOLUTION ............................................................................................................................. 32! 4.5. INTRODUCTION TO DIGITAL FILTERS............................................................................................ 34! 5! THE MACHINE: IMPLEMENTATION STRATEGY ............................................................................ 37! 5.1. DESIGN GOALS............................................................................................................................. 37! 5.2. OVERVIEW OF THE CHANNEL-BASED MODEL .............................................................................. 38! 5.3. OVERVIEW OF THE OBJECT-BASED MODEL ................................................................................. 39! 3

5.4. CHANNELS VS. OBJECTS : CONCLUSION ....................................................................................... 40!

0 F OREWORD
When SAE Brussels presented this thesis project and requirements to their students, the goal was clearly stated that the research was meant for them to specialize in a specific field by allowing them to do their research on a topic that was studied in class, and from there take their knowledge to the next level in order to help them establish a specialty that would be a helpful skill for initiating a career. This project was extremely appealing to me, because I saw in it a great opportunity to gain some level of expertise in a field I had been attracted to: binaural 3D sound. As I was starting to gather basic information about this field of study, it did not take me a very long time to realize that the complexity of the matter resided in the fact that it required at least a certain degree of expertise in several different technical fields, the most important ones namely being digital signal processing, psychoacoustics, programming, mathematics and physics. At that time, all I had was my current (basic) expertise of sound engineering but, through the motivation I had, I felt empowered and determined to stick with binaural 3D sound, enough to patiently gather all of the required knowledge and eventually tackle that complexity in order to make it my own field of expertise. This paper is the result of nine months of dedicated work and is composed of three main parts that I decided to respectively entitle The Body, The Brain and The Machine. The majority of the time spent on this paper was dedicated to learning the skills necessary to create the models presented in the last part of the paper, and to interconnect them. In this context I attended two days of conferences in May 2013 about binaural 3D sound for broadcasters at the EBU headquarters in Geneva, Switzerland. During that time a presentation was given by Poppy Crum (Dolby) about the neuroscience behind spatial binaural sound, which was entitled Neural Sensitivities HRTF Representations In The Auditory Pathway . From then on, I grew so intrigued and fascinated about the way humans hear (and more generally, the way the brain works) that this presentation marked a turning point in my research. In fact, I decided to include two large sections dedicated to the human hearing physiology and psychoacoustics (respectively entitled The Body and The Brain), which were notably meant to provide my readers with the background knowledge necessary to gain insights into the binaural implemention models presented further on. Indeed, the ultimate purpose of the last part of this work (The Machine) is to present and compare binaural implementation models respectively based on the two current digital sound technologies, namely the channel-based model and the object-based model. The purpose is to theoretically assess their qualities and drawbacks to binaurally reproduce 3D sound. It is worth mentioning that The Machine is the subject to an AES Convention paper that I shall present on October 17, 2013 in New York City. First of all, I would like to thank our team of helpful supervisers as well as our inspirational teachers (and really the entire SAE Institute Brussels) for giving their students the wonderful opportunity to let them do their thing. As far as I am concerned, I know that every day spent at this school has contributed to making me a more inspired individual and in the end a better person. What else could I have asked for? I would like to give a special thank you to Robin Reumers for the time he invested in supporting the last sections of this work. I would like to acknowledge his deep

understanding of all things audio (and more) as well as his creative technical problemsolving skills that turned out to be very useful more than once. I would also like to thank my family, my lover Gracie and my six brothers for the support they have provided me with throughout these several long months that actually passed quicker than I ever thought. I feel blessed!

1 T HE B ODY : F UNDAMENTALS OF THE P HYSICS


OF

S OUND

1.1. THE PHYSICS OF SOUND : A BRIEF INTRODUCTION


The purpose of this section is to give a small introduction to some fundamentals of sound physics that will turn out to be quite useful in our understanding of the human hearing physiology. Let us dive in 1.1.1. THE NATURE OF SOUND

What is a sound wave? A sound wave is a vibrational movement of air molecules around their initial positions. It is important to realize that the propagation of sound waves is very different from the wind phenomenon where molecules are flowing over large distances! A sound wave is defined by two main attributes: its frequency and its amplitude (or intensity). The frequency of a sound wave refers to the number of waves to pass a specific point in space every second. Frequency, which is commonly specified in Hertz (Hz) or cycles per second (c/s), holds the subjective correlate of pitch when perceived by most of the living organisms. The amplitude of a sound wave refers to the magnitude of the vibrating movement of an air molecule, and ends up bringing the subjective correlate of loudness.1 However, other parameters depend directly upon those two main attributes: a sound wave frequency defines its period (i.e. the length of time necessary for the wave to execute one full cycle of air molecules pressure decompression) and its wavelength (the distance in meters for the execution of one period), whereas a sound waves amplitude dictates the consequent air pressure variation, air velocity and air displacement. The air pressure relates to the level of compression of the air molecules. When speaking of pressure, it is important to clarify that it is the atmospheric pressure of the free field, which the sound wave is travelling through that varies. However, a sound wave travelling in a free field has proportionally very little impact on the average atmospheric pressure of this free field; indeed, a level as high as 140dBSPL makes the general atmosperic pressure vary by only about 0,6%. The air velocity relates to the rate of change of position of the air molecules, whereas air displacement relates to the distance of displacement of those molecules around their equilibrium position. Sound waves can be described using other parameters as well, and it is possible to easily make them relate to each other using simple equations. For example, in the case of a sinusoid, its peak pressure (p) above the mean atmospheric one and its velocity (v) relate to each other in the following way:

p = z.v
Where z represents the impedance of the medium.
1

Both the pitch and loundess perceptions will be discussed further on in this work, as such phenomena belong to the realm of the brain and not the physical world per se.

1.1.2.

T HE I MPEDANCE

OF A

M EDIUM

The impedance of a medium is an appropriate concept to address at this point because it will show to be of great importance in the physiology of hearing, in the form of impedance jumps . The impedance of a medium can be thought of as its resistance. For example, water holds much higher of an impedance than air, because the pressure required in order to produce a sound at a given intensity in water is much higher than the pressure required to produce a sound with the same intensity in air, which is simply due to the fact that the density of the water molecules is much higher than the the density of the air molecules. It will thus require proportionally more energy to give those water molecules some velocity. In the SI system, the impedence (z) is measured in (N/m2)/(m/s) or N.sec/m3. In order to gain some understanding of our water/air example that will matter later on in this work, let us associate it with some figures The impedance of air at room temperature (20C) is about 413 N.s/m3 whereas the impedance of water is of about 1,5x106 N.s/m3, which means that the water resistance is about 3632 times higer than the airs. When a sound wave propagates in the air at 20C and meets a water surface, the change of impedance is such that only 1/3632 of the incident waves intensity is transmitted into the aquatic middle. The result of this impedance jump is that most of the sound wave is reflected against the water surface according to the laws of physics. The following formula allows to calculate the proportion of an incident wave propagating in a middle of impedance z1 that will be transmitted into a second middle of impedance z2. 4 z1z 2 ( z1 + z 2 )2 When pluging our air and water impedance values into the equation, the output translates into a propagation of our incident sound wave into the water middle of only about 0,11%. Converted into ! decibels, it equals a 30dB attenuation. In a further section, we will address how the middle ear manages to transmit a signal to the brain despite such attenuation. 1.1.3. T HE F OURIER A NALYSIS

A further concept to be introduced is the Fourier transform, which not only will help us understanding the hearing physiology in the scope of the cochlea, but its implications are so broad that it will be mentioned in the fifth part of this work (The Machine: Implementation Startegies), that concerns signal processing. Joseph Fourier showed that it was possible to decompose a complex signal into a sum of sine waves. That Figure 1. Decomposition of a square wave ys into a process is called the Fourier analysis and it series of sine waves y1, y2, y3 etc. using Fourier allows to bridge the gap between a signals analysis. Retrieved from : time and frequency domain. When it is http://ffden2.phys.uaf.edu/212_spring2011.web.dir/daniel_ran plotted with RMS value on axis y and dle/ on Sept. 18, 2013. frequencies on x (as it is generally the

case), it is possible to easily know which frequencies compose a complex signal, as well as their respective intensity (see Figure 1). The analysis of an infinite sine wave represented in the time domain will result in a single line in the frequency domain, indicating that waves frequency and intensity. Very similarily, the analysis of an infinite square wave will show the fondamental with its odd harmonics. However, this model is obviously purely theoretical, as no signal is infinite. What happens with finite signals? The indication of the different frequencies will broaden and turn into bands, whose breadths are inversely proportional to the length of the input signal. The longer the signal, the better resolution we are able to get in the frequency domain. One may be tempted to ask: why sine waves? . For several reasons: sine waves are quite easy to handle mathematically. They also happen to represent the oscillation of quite a few physical systems, therefore being very present in natural phenomena. But probably the most interesting reason why Fourier analysis shows to be of great importance in hearing physiology is because our ear actually performs this process constantly, although to a limited extend. As mentioned above, this feature will be discussed in the cochlea section. The reversed process of analysis (taking many sinusoids and adding them together to form a complex signal) is called the synthesis, but will not be as useful as the Fourier analysis in the scope of this work and will therefore not be addressed. 1.1.4. L INEARITY
OF A

S YSTEM

The last concept to be introduced in this section is linearity. Such notion will be useful to describe and properly understand the different stages of the auditory system. A system is referred to as linear when it verifies two properties: superposition and homogeneity. A system is non-linear when one of those conditions is not fulfilled. Mathematically, those properties are respectively defined as follows. The superposition property states that for two different input x and y, both belonging to the domain of the function f:

f ( x + y) = f ( x) + f ( y)
Put in plain words, this equation tells us that the result of two or more inputs plugged in at the same time is the same as the addition of the results from the inputs plugged into the system separately. ! The homogeneity property states that for any input x in the domain of function f and for any real number k:

f ( kx ) = kf ( x )
This equation tells us that if the input is affected a factor k, the output will be affected the same factor k as well. The fact that a system is linear implies one more important property: the ! frequencies contained in the output of the system were present in the input signal in the first place! Indeed, a linear system does not generate new frequency components.

2 T HE B ODY : P HYSIOLOGY O F T HE E AR
The auditory system is the sensory system that allows humans to perform the mechanoelectrical transduction of sound waves into neural action potentials. This highly complex system is situated outside (for the pinna) and inside the temporal bone (shown in red in Figure 2). The human hearing system comprises three main parts: the outer, middle and inner ears. Their anatomies Figure 2. The temporal bone in represented in red. and roles will be investigated in this Retrieved from : section. Nonetheless, it is worth noting http://commons.wikimedia.org/wiki/File:Temporal_bone. that the great complexity of the png in July, 2013. physiological side of human hearing will only allow us to scratch the surface of this most fascinating topic in the scope of this work.

2.1. T H E O UTER E AR
The outer ear consists of a partially cartilaginous shape called the pinna that comprises a resonant cavity called the concha, which forms the entry of the ear canal (also called the meatus) that leads to the tympanic membrane (also referred to as the eardrum). The outer ear fulfills two main roles: it helps localizing sound sources and increases the intensity of the incoming sound waves. The pinna holds a paramount role in this paper because it is one of the main actors in our ability to localize sounds. Indeed, the pinnas shape (which is very individual and can be quite different from a person to another) allows to spectrally modify the incoming sound waves in order to give the brain the necessary cues needed to assess the sound sources positions on the vertical plane. The second important aspect of the pinna is to funnel the waves reaching the pinna into the ear canal. This process allows to increase the intensity of the sound waves reaching the eardrum by about 15 to 20dB in the 2,5kHz range (Wiener and Ross, 1946) in the form of resonances produced either by the association of the concha and the meatus (2,5kHz resonance), or the concha alone (5,5kHz resonance). It is possible to measure the influence of the pinna on the waves coming from a sound source at a known azimuth, elevation and distance. This information, which is extremely valuable in the scope of this paper, is called the Head-Related Transfer Function (HRTF) and will be discussed in further sections.

2.2. T H E M ID D LE E AR
The middle ear consists of the ossicles (malleus, incus and the stapes, which is also known as the stirrup) and acts as an intermediate step between the eardrum and the cochlea in the way of an impedance transformer. Indeed, the purpose here is to turn

10

acoustical energy into mechanical energy. Being attached to the ear drum, the malleus, which is attached itself quite rigidly to the incus, vibrates at the same rate as the tympanic membrane and the association of those two bones transmits the force to the stapes (about the size of a grain of rice), which is connected to the cochleas oval window that will be discussed in the next section. Interestingly enough, those three small bones stop growing very early in a newborns life, making them the same size as an adults.

Figure 3. Cross section of the temporal bone, revealing the main parts involved in the outer, middle and internal ears. Retrieved from:http://www.directhearingaids.co.uk/index.php/33/how-hearing-balance-work-together/ in August, 2013.

As mentioned in the previous paragraph, the role of the middle ear is to transform the impedance from the large, low-impedance eardrum to the small, highimpedance oval window. Without this middle ear section, the reflections due to the impedance jump would be so high that only a fraction of the incident wave would manage to enter the oval window, and the subsequent perceived level would be much lower. The ossicles thus allow to substancially reduce this energy attenuation. At this point, it is worth mentioning that the actual functioning of the impedance transforming process is quite complex and since a thorough explaination of it would not substancially help the proper understanding of the following sections, it shall remain superficially covered. However, it can be noted that this impedance-transforming process is supported by two principles. The first one is that since the stapes footplate in the oval window is much smaller than the ear drum where the vibrations are coming

11

from, it is logical to state that the energy is going to concentrate in a smaller area, thus effectively increasing the pressure at the oval window. The actual increase is calculated by the ratio of the two areas. The second principle, though less prominent, is caused by the lever action of the incus. Being smaller than the malleus, the incus allows to increase the force and decrease the velocity transmitted to the stapes. What about the linearity of transmission of the ossicles? Guinan and Peake (1967) found that the stapes movement increased proportionally to the input up to 130dBSPL for frequencies below 2kHz and up to about 140 to 150dBSPL for frequencies above. Those results thus seem to point towards the linearity of transmission in the ossicles up to those intensities and, although the system of measurement used in that specific research would have only allowed to see detect 1020% of odd harmonics, there is likely to be no significant harmonics or intermodulation products at lower intensities. However, it is worth mentioning that the suggested linearity of the middle ear may be affected by static pressures applied to the ear. Indeed, such pressures would make the joint connecting the malleus and the incus more rigid and stretch the ligament connecting the stapes to the oval membrane. Another element also influences the linearity of the middle ear beyond 75dBSPL: the middle ear muscles. Two main striated muscles attached to the ossicles act as protections to damages in the inner ear. The tensor tympani is attached to the malleus (on the eardrums side) whereas the stapedius muscule is attached to the stapes. When sound pressure levels of frequencies below 1-2kHz become too important, the inner ear muscles contract and allow to increase the rigidity of movement of the ossicles. However, their action is quite complex and they have shown to have repercussions in high frequencies as well. It would therefore be correct to say that humans are equipped with multiband compressors right in their ears Wever and Vernon (1955) actually showed that this muscle contraction reflex allows to keep quite a constant intensity of stimulus reaching the cochlea for low frequencies beyond the reflex threshold (around 75dBSPL), effectively acting as a multi-band brickwall limiter!

Figure 4. Detail of the middle ear. Retrieved from : http://cueflash.com/decks/PHYSIOLOGY_OF_AUDITION_-_54 2013

in

August,

12

2.3. T H E I N N ER E AR

AND TH E

C O C HLEA

After the middle ear comes the inner ear, which is composed of the cochlea and the bony labyrinth containing itself the vestibular system. The vestibular system is responsible for the sense of spatial orientation and balance. We shall focus on the cochlea, which is the central piece to our auditory system and by far the most complex one. Its intrinsic role is to convert the physical vibrations received from the action of the ossicles into electrical information that the brain can recognize as sounds and its basic understanding will require some chemical and electrical explainations. 2.3.1. A NATOMY
OF THE

C OCHLEA

2.3.1.1. GENERALITIES Anatomically the cochlea is a coiled tube separated lengthways into three sections known as the scala vestibuli, the scala media and the scala tympani. Those three scalae spiral together from the base of the cochlea (the larger side) to the apex (the narrower, pointy side), keeping their proportions throughout their turns. The cochleas size is about 1cm in width and 5mm in height. The proportion of the scala media being smaller than the ones of the outer scalae, the outer scalae are led to have a common separation, which is an osseous surface called the spiral lamina. This surface is situated close to the modiolus, which consists of the spongy bone around which the scalae turn approximately two and a Figure 5. Cross section of the cochlea, providing a good view of the half times. The modiolus three scalae as well a detailed view of the contents of the scala media. contains the spiral ganglion Retrieved from: see image. that shall be mentioned again later on. The Reissners membrane separates the scala vestibuli from the scala media whereas the basilar membrane divides the scala media from the scala tympani. The basilar membrane notably serves as the surface on which lays the organ of Corti, which contains the auditory transducers that are called hair cells . The scalae contain fluids called the perilymph (outer scalae) and endolymph (scala media). The two outer scalae meet at the apex of the cochlea in an opening called the helicotrema, allowing for the perilymph to connect. The scala media is a closed cavity whose endolymph does not directy interact with the exterior.

13

2.3.1.2. ESSENTIAL MECHANICS OF THE COCHLEA When the stapes vibrations are transmitted to the oval window, it produces a displacement of the fluids within the scala vestibuli, which is transmitted to the scala tympanic through the helicotrema. This phenomenon allows the basilar membrane to be displaced in a wave-like movement, along with the organ of Corti that is attached to it in the scala media, effectively allowing for the hair cells to be stimulated and to transmit electrical impulses onto the brain. 2.3.1.3. THE ORGAN OF CORTI Those organ of Cortis hair cells amount to about 15,000 in each of a humans ear and can be declined in two kinds: the inner hair cells (IHC), on one row situated on the modiolus side of the cochlea (i.e. toward the inside) and the outer hair cells (OHC), ranging from three to five rows (increasing toward the apex). The hair cells are found inside the reticular lamina. From each hair cell sticks out the stereocilia2, which is the part of the hair cell that acts as the initial sensory transducers. Stereocilia is made out of long filaments whose stiffness allows them to stand on the lamina and act as levers in response to mechanical deflections. The longer of the OHCs stereophilia are embedded in the undersurface of a gelatinous body called the tectorial membrane. Being attached on one side only (toward the modiolus) above the organ of Corti and the basilar membrane, the tectorial membrane allows to create a deflecting movement the hair cells according to movements of the basilar membrane. On inner hair cells, the stereocilia is composed of three to five nearly straight rows, while on outer hair cells it is composed of three to five V-shaped rows. 2.3.2. P HYSIOLOGICAL F UNCTIONING
OF THE

C OCHLEA

When sound waves reach the eardrum, its vibrations are transmitted to the oval window through the ossicles and the stapes. When vibrating, the membrane of the oval window initiates a wave of movement of cochlear fluids, transmitting the fluids to the round window. This phenomenon causes the cochlear partition (i.e. the basilar membrane and the organ of Corti) to move according to this transmitted waves position and patterns, effectively revealing the frequency content of the stimulus to the brain once the hair cells are stimulated and the information is sent over to the ascending pathway. G. von Bksy was the one that pioneered the cochlear research and a lot of the current knowledge on this matter is owed to his studies and experiments described in Bksy, 1960. He analysed the movements of the cochlear partition on human cadavers, was able to plot the travelling wave-patterns and drew conclusions from them. As can be seen on the scheme, the amplitude of movement of the cochlear partition is contained within an amplitude envelope, never exceding it. Some of Bksys important findings could be summerized as follows: (1) As we have seen, vibrations from the stapes at any frequency allow for a specific travelling wave to be initiated within the cochlear fluids. The travelling waves pattern and its peak location in the cochlear duct depend on the frequency of the stimuli brought by the stapes.
2

It is important to mention that the specialized literature speaks of the stereocilia both as stereocilia and hair cell. We can comprehend the intended meaning according to the context, either evoking the whole cell (stereocilia plus the part contained in the reticular lamina), or only the actual stereocilia.

14

(2) If the stapes is vibrating at a given frequency, the wave pattern evolving in the cochlear fluids will always have an amplitude contained in a sort of envelope that gradually increases from the base to a certain point in the cochlea (called the resonance point), and then rapidly decreases from that resonance point to the apex. (3) The peak of a travelling wave induced by the vibration of a high-frequency stimulus will be localized near the base of the cochlea, whereas peaks from low-frequency stimuli will be situated near the apex. It is also worth mentioning that Bksy experimented with what happened when he opened the cochleae at certain points to measure the vibration at those points as the frequency induced in the stapes was varied (at constant peak-to-peak membrane displacements3). What he found is that the slopes of the amplitudes near the apex (where low frequencies peak) were shallower than the ones near the base (where high frequencies peak) This observation allows to draw the conclusion that the cochlea effectively acts as a low-pass filter, because, for the same amount of low- and high-frequency energy transmitted to the cochlear fluids, the cochlear partition seems to react with more movement to lowfrequency stimuli. (4) Lastly, The Bksy findings show that, as previously stated, the cochlea analyzes incoming stimuli in a very similar way as a fast Fourier analysis, effectively acting as a prism for stimuli, decoding the stapes time-domain information into frequency-domain information to be transmitted to the brain through the organ of Corti and the auditory nerve (that shall be mentioned later on). Let us exemplify Bksys findings. If the incoming stimulus is a single pure sine wave, the fluid vibration will have a sharp peak confined to a narrow region of the basilar membrane and whose peak amplitude across the membrane will depend on its frequency. When the basilar membrane is stimulated by the travelling waves, the tectorial membrane, which sits above the hair cells, is also moving. Its movement helps for the deflection of the organ of Cortis hair cells, to which the tectorial membrane is partially attached. The consequent hair cells deflections allow to transmit electrochemical impulses to the auditory nerve, which will then transmit the information to the appropriate parts of the brainstem. Summerizing, it would be correct to say that the stimulation of certain patterns of specific haircells would send specific frequency information to the brain, that would then eventually be decoded as pitch. Moreover, the more intense the hair cells deflection, the stronger the signal of intensity is transmitted to the brain, which would eventually be decoded as loundness. The range of frequencies that is transmitted to the listeners brain depends on physical limitations notably contained in the cochlear partition (which shall be evoked in the next paragraph), the cochlear fluids, the hair cells and the complex interactions of those elements. The product of those interactions raises the question of the linearity of response of the cochlea. Indeed, it was found that when the stimulus intensity is raised, the response of movement of the cochlear partition could increase in smaller proportions. Much in the same manner as the middle ear muscles, the non-linearity of the cochlea actually allows it to act as a multi-band compressor, most effective in the
3

Constant peak velocities correspond to constant sound pressure levels. This means that, no matter the frequency at which the stapes is vibrating, the same amount of energy is transmitted to the cochlear fluids.

15

peak regions of the travelling waves of frequencies of 10Khz and higher. As a side note, it is interesting to note that this feature seems to vanish after death. Let us go back to this question of physical limitation of movement amplitude of the cochlea. In other words, why is it that, physically, the cochlear partition is not able to be stimulated in a linear way? It is due to the common action of two rather simple phenomena that can incidentally be repetitively found in Nature: the stiffness and mass limitations. In order to explain them, let us refer back to the cochlear amplitude envelope described by Bksy. The first phenomenon, referred to as the stiffness limitation, explains the reason why the cochlear partition is not able to be fully stimulated from the base to this resonance point situated closer on the apex side. The cochlear partition was actually shown to be relatively rigid near the base, gradually becoming more compliant as it progresses to the apex. This stiffness is the main factor preventing the cochlear partition to move freely according to the way it is stimulated. The second phenomenon, called the mass limitation, gives the reason why the cochlear partitions amplitude potential rapidly decreases from that resonance point on to the apex: although the cochlear partition is now more compliant than it was at the base, its larger mass and inertia limit its amplitude of movement. 2.3.3. S CALAE
AND AS

F LUIDS C OMPARTMENTS : P ERILYMPH

E NDOLYMPH

The two outer scalae, the scala vestibuli and the scala tympani, contain perilymph whereas the scala media contains endolymph. Chemically, those two extracellular fluids are quite different from each other. Let us give them a bit of a closer look. Contained in the outer scalae, the perilymph is very similar to most other extracellular fluids because of being mainly composed of cation sodium (Na+) and, to a much lesser extend, cation potassium (K+). Its electric potential is positive, and was reported by Johnstone and Sellick (1972) to be of +7mV in the scala tympani and of +5mV in the scala vestibuli, i.e. close to ground potential. The endolymph is contained in the scala media. Oppositely to the perilymph, its chemical composition is mainly K+ and, to a lesser extends, Na+. Endolymph is a very unique kind of extracellular fluid. For two reasons: (1) given its general composition, endolymph is very comparable to intracellular fluids and (2) its very high positive potential referred to as the endocochlear potential and varying from +100mV to +83mV in a declining fashion from the base to the apex has not been found in any other extracellular fluid. Its chemical and electrical uniqueness therefore point at a very specific role played within the cochlea. Indeed, according to the investigations that have been undertaken, the endolymphs mentioned characteristics were found to play important roles in mechanotransduction as well as mechanical amplification of the travelling waves propagating in the cochlea.

16

3 T HE B RAIN : T HE C ENTRAL A UDITORY N ERVOUS S YSTEM A ND F OCUS O N H UMAN S OUND S OURCE L OCALIZATION
3.1. A SC EN D IN G P ATH W AYS
O F TH E

A UD ITO RY N ERVE

Once the haircells deflections have produced electrochemical impulses travelling through the auditory nerve fibres in the spiral ganglion, those impulses are now sent through several different parallel pathways that shall be introduced in this section. Such operating way allows the brain to simulateously extract multiple features of the stimulus that will show to be of great importance in order for it to create a representation of the so-called auditory object . For example, sound localization on the horizontal plane relies mainly on interaural time and level differences (respectively ITD and ILD), while sound localization on the vertical plane notably requires the complex analysis of stimulis spectra. However, the proper analysis of spectral information prevents from a reliable analysis of time information, thus showing the need for parallel pathways of stimuli analysis. This section is meant to give an idea of those different pathways used by the brain to interpret the electrochemical stimuli received by the haircells, as well as presenting the different important brain areas where the information is treated, mentioning their cells compositions and their purposes. It is well worth mentioning that is it difficult to explain each sections functions individually because the auditory nervous system is organized hierachically. Indeed, the information analysed in the lower stages of the process is sent over to higher stages that basically analyse the data and only send over to the next stages the relevant information at the time in order for the auditory cortex to eventually represent this auditory object we hear. The resolution of representation of this auditory object thus increase as the different information analysed in the lower stages is put together and made sense of in the higher stages. For reference, a very simplified plan of the different stages of the ascending pathways would go as follows: (1) After haircells deflections in the left cochlea the electrochemical impulses are transmitted to the left cochlear nerve (auditory nerve) situated in the modiolus. (2) The output fibres of the cochlear nerve branch. One end enters the left ventral cochlear nucleus (VCN), while the second end enters the left dorsal cochlear nucleus (DCN). (3) The outputs of the left VCN enter both the left and right superior olivary complexes (SOC). This fiber pattern is referred to as the trapezoid body; the left DCN outputs directly enter the right lateral lemniscus nucleus (LLN). (4) The outputs of the left and right SOC respectively enter the left and right LLN. (5) From that point on, the left and right parts of the brain do not any longer connect contraleterally (with the other side of the brain). The left LLN connects with the left inferior colliculus (IC). (6) The left IC connects to the left medial genuticulate body (MGB) in the thalamus.

17

(7)

The left MGB connects with the auditory cortex. Both parts are able to communicate back and forth.

Figure 6. Representation of the ascending pathways of the central auditory nervous system. Retrieved from: http://origin-ars.els-cdn.com/content/image/1-s2.0-S1527336908001347-gr3.jpg in September 2013 3.1.1. T HE A UDITORY N ERVE

18

The auditory nerve is situated in the modiolus, on the inner side of the cochlea. Its afferent fibres are situated at the base of the hair cells, transporting the electrical impulses from the cells to the auditory nerves then onto the brainstem. The efferent fibres are placed around the same places, and they allow the brainstem to influence the cochlea. Both the efferent and afferent fibres lead to the spiral ganglion in the modiolus. Inner hair cells and outer hair cells are innerved completely differently. Shortly, we can say that there are two types of afferent fibres: Type I (also called radial fibres, comprising 90 to 95% of them) and Type II (also referred to as outer spiral fibres, comprising the remainder of the afferent fibres). Each inner hair cell receives about 20 to 30 Type I fibres (according to Liberman et al., 1990) whereas each outer hair cell receives about six Type II fibres. Every Type I fibre is connected to only one hair cell, but Type II fibres branch and end up innervating about ten outer hair cells. However, outer hair cells are not only connected to those few Type II fibres (compared to the innervation of inner hair cells and Type I fibers), but they are linked to other synapses coming from different afferent fibres as well. 3.1.2. C OCHLEAR N UCLEI

The cochlear nerve sends information to two entities: 3.1.2.1. THE VENTRAL COCHLEAR NUCLEUS Because of its specialization of analysis of time and intensity information, the ventral cochlear nucleus contributes mainly to the pathway of binaural localization (on the horizontal plane). Other contributions are given to the pathway of sound identification. The ventral cochlear nucleus is itself composed of two areas: The Anteroventral Cochlear Nucleus (AVCN):

The AVCN contains a type of cells called bushy cells (named for the bushy patterns of their dendrites), known for their effectiveness to rapidly and reliability transmit the impulses they receive to the next stage. There are spherical and globular bushy cells. Spherical bushy cells transmit to the superior olivary complex the information of the stimulus time of arrival. There, this time information will be compared to the information of time of arrival coming from the other ear. On the other hand, the globular bushy cells handle intensity information. Just like the spherical bushy cells, globular bushy cells send this information to the superior olivary complex, where the intensity information from both ears is to be compared. The AVCN is thus responsible for sending information to the higher stages that will turn out to be very useful in the scope of binaural sound localization in the horizontal plane. The Posteroventral Cochlear Nucleus (PVCN):

The PVCNs structure is slightly more complex than the AVCNs in the sense that it comprises four types of cells: globular bushy cells, octopus cells and two types of stellate cells (T-stellate and D-stellate in the 95% - 5% proportions).

19

Octopus cells are useful for two main reasons. Firstly, they have a pattern of response called the onset response because of their ability to fire very strongly at the onset of a new stimulus. Secondly, they have an extremely high resolution of response for transcients in ongoing stimuli (they can detect more than 500 transcients per second!). Moreover, their spectral range of action is very wide. Therefore, it is thought that octopus cells are specialized in the extraction of temporal fluctuations in complex broadband stimuli such as the human voice. T-stellate cells fire repetitively when they receive stimuli corresponding to a sustained tone burst. However, their firing rate is not related to the frequency of the tone. They send this information to several different areas that are part of this ascending pathway. D-stellate cells shall not be described here. Summerizing, we can say that the PVCN therefore gives contributions to two pathways: binaural sound localization (on the horizontal plane specifically) as well as sound identification. 3.1.2.2. THE DORSAL COCHLEAR NUCLEUS The dorsal cochlear nucleus gives great contributions to the pathways of sound identification as well as of binaural sound localization (but on the vertical plane this time). It is composed of three layers, but we shall only focus on the second, most important, pyramidal cell layer. The pyramidal cells (also called fusiform) project primarily to the contralateral inferior colliculus (i.e. on the other side of the brain) through the lateral lemniscus nucleus. Unfortunately, studying them is a complex endeavour because of their strong vulnerability to anaesthesia. However, we do know that their response patterns give contributions to the sound identification pathway. Since this work tends to focus on the localization of sound, those responses will not be presented and we shall focus a bit more on the pyramidal cells contributions to the binaural localization pathway. It is known that notches in the spectral content of the stimuli can strongly drive the pyramidal cells, if the frequency of this notch is close to the frequency at which the cell is tuned. Those notches are produced by the pinna and their frequencies are strongly influenced by the elevation angle of the sound source. Evidence for this explaination were found for cats that had lesions in those parts of the brainstem, when we realized they could not any longer make reflex orientations of their heads upwards towards the position of the sound source (Sutherland et al. 1998a, b and 2000). Indeed, it is thought that pyramidal cells play an important role in the unlearned action, as it was still possible for cats to learn to discriminate between sound sources at different elevations using a behavioural conditioning task. Therefore, the dorsal cochlear nucleus certainly plays a role in the binaural localization of sound sources on the vertical plane, but it must not be the only one. 3.1.3. T HE S UPERIOR O LIVARY C OMPLEX

Two trends can be discriminated from the output streams coming out of the cochlear nuclei: the binaural localization pathway is served by the ventral stream, which is itself divided into one section relaying intensity information as well as a second one relaying time information, and the identification pathway is served by the dorsal stream. The dorsal stream is directly sent to the inferior colliculus (through the lateral lemniscus), while the ventral streams enter the superior olivary complexes on both

20

sides of the brain. The first one, conveying intensities information, enters the lateral superior olive (LSO) along with the same stream coming from the other ear, where the intensities information conveyed in the streams of both ears will be compared. Much in the same way, the time information from both ears will reach both medial superior olives (MSO), one on each side of the brain, where timing information will be compared. The seminal Jeffress model (Jeffress, 1948) suggests an explaination for this correlation process and will be presented in a further section. The LSO contains cells of the IE type. In this type of terminology, the first letter represents the response of the contralateral ear (I = Inhibitory4) and the second letter represents the response of the ipsilateral one (E = Excitatory5). In order to examplify this concept, rough trends can be given as follows. Firstly, the ipsilateral, excitatory ear alone is presented a tone. The IE cells firing rate is maximal. Then a tone is introduced to the contralateral, inhibitory ear. As the intensity of that second tone is raised, the firing rate decreases until it reaches a value close to zero, when the contralateral tone intensity equals the intensity of the ipsilateral one. As will be explained later on in this work, ILD are mostly relevant for high frequencies. That is the reason why the LSO is mostly reactive to high frequencies. On the other hand (as previously mentioned) the MSO receives streams coming from the bushy cells of the AVCN on both sides, conveying timing information. Thanks to the spherical bushy cells ability to fire almost instantly, the nucleus is able to very reliably compare both ears times of arrivals and thus retrieve valuable information for binaural localization on the horizontal plane. How does it work? The MSO has a very thin, sheet-like structure and is composed of a single layer of fusiform cells, most sensitive to low frequencies. Although we will not develop too much on this topic, we can simplify and say that the timing information is compared thanks to the fact that each fusiform cell is tuned to fire maximally at a given, characteristic delay between both times of arrival. The treatment of the information of localization on the horizontal plane in the higher stages of the afferent pathways is thus dependent upon the quantity of electricity fired by each fusiform cell. It should also be mentioned that a minority of EE cells is contained in the LSO, allowing it to not only analyze intensities information (its specialty), but timing information as well. Similarly, a minority of IE cells is contained in the MSO, allowing it to not only process timing information but also intensities information. 3.1.4. T HE L ATERAL L EMNISCUS

The lateral lemniscus is a tract through which run the ascending pathways from the superior olivary complex to the inferior colliculus. Two major nuclei are contained within the tract, known as the ventral and dorsal nuclei of the lateral lemniscus (respectively VNLL and DNLL). Although a majority of fibers are connected to one of the nuclei, some of them simply run through the tracts, entering the inferior colliculus directly. The VNLL is part of the monaural sound identification stream6, receiving its inputs from axons of the contralateral ventral cochlear nucleus as well as other nuclei that were not mentioned in this work for the sake of simplicity. Since it does not
4

Definition of inhibitory: slow down or prevent (a process, reaction, or function) or reduce the activity of (an enzyme or other agent). 5 Definition of excitatory: characterized by, causing, or constituting excitation. 6 The stream that deals with the identification of sounds retrieved by a single ear, as opposed to binaural information that would have been previously interpreted in the superior olivary complex.

21

receive inputs from the MSO nor the LSO, it does not appear to be playing any role in the binaural localization pathway. It projects ipsilaterally to the inferior colliculus in a complex pattern. Although experts are still unsure about the actual role of the VNLL, Langner (2005) actually speculated that it could potentially be able to extract harmonic relations between stimuli. The DNLL is part of the binaural sound localization pathway, receiving outputs from the ipsilateral MSO, LSOs from both sides and contralateral cochlear nucleus. Its role in mainly inhibitory, allowing to eventually enhance the lateralization of sound sources that was previously created in the lower stages of the ascending pathway. It is interesting to note that, due to the lasting effect of its inhibitory projections (that will not be explained here) its role also allows to enhance laterization by suppressing echoes in echoic environments (Pecka et al., 2007).

3.1.5.

T H E I N FERIO R C O LLIC ULUS

The inferior colliculus could be seen as the most important data center from the lower parts of the brainstem. Here, the vast majority of information previously treated from the different pathways (mainly sound localization, sound identification, and their respective sub-pathways ) is connected and the image of the auditory object that will eventually be perceived in the auditory cortex is starting to strongly refine. Of course, this paramount stage of synthesis of basic elements entails a whole new level of complexity. Situated close to the superior colliculus (which is itself the important integrative reflex center of the visual nervous system), the inferior colliculus is composed of three divisions: the central nucleus, the external nucleus and the dorsal cortex. The central nucleus (ICC) is innerved mainly by fibres running through the lateral lemniscus, while the external nucleus and dorsal cortex receive fibres that do not run through the tract. Those two are in charge of treating information surrounding the auditory system, only indirectly bringing an improvement to the eventual auditory object. Instead, the extra lemniscus pathway (as it is referred to) also comprises multisensory stimuli. The ICC receives information from all four sources of binaural localization: LSO (center of analysis of intensities differences), MSO (center of analysis of timing differences), the DNLL (responding to both cues) and the DCN (dorsal cochlear nucleus retrieving information needed for the proper localization of sounds on the vertical plane). The ICC is tonotopically organized in laminae (thin layers of organic tissues). Said differently, all of the fibres carrying different information related to a common characteristic frequency will meet on the same layer of the ICC. Studies carried in the recent years have been able to suggest some of the interactions of the four sources of binaural localization in the IC. Indeed, Loftus et al. (2004) showed that the lowfrequency laminae (where ITDs dominate) receive inputs from the ipsilateral MSO (processing of ITDs) but also, interestingly, receive inputs from the ipsilateral LSO (processing of ILDs). On the other hand, the high frequency laminae were shown to receive inputs mainly from the DCN (which, as a reminder, process high frequency notches used in the localization on the vertical plane) and, of course, the LSO. It is worth mentioning that, thanks to recent anatomical evidence researchers have grown to believe in the existence of further maps of information processing (apart from the spectral one). Indeed, the laminae are two-dimensional and the spectral organization covers only one axis. Some have suggested that this second dimension of the laminae would be home to a map of periodicity detection, but we have yet to prove 22

that claim. Another option (that shall be investigated further on) points at a topological map dealing with phase correlation of stimuli in the process of creating the perceived auditory space. The external nucleus receives inputs from the contralateral cochlear nucleus (including its DCN), the ICC, the auditory cortex (on a descending pathway), as well as somatosensory7 input from the dorsal columns and the trigeminal nuclei. The dorsal cortex receives information from the contralateral inferior colliculus as well as descending inputs from the auditory cortex. Although we do not know for sure the roles of those two nuclei, many have suggested that the nature of the input received in the external nucleus seems to point at an auditory and somatosensory integrative area allowing to launch the required reflexes triggered by certain sounds. This feature of the auditory system is part of the so-called diffuse or extra-lemniscal system that was previously mentioned. 3.1.6. T HE T HALAMUS M EDIAL G ENICULATE B ODY

The medial geniculate body is the last auditory relay before the stimuli enter the auditory cortex, and, within the scope of descending pathways, acts as an intermediary between this auditory cortex and the rest of the subcortical nuclei. Moreover, those ascending and descending connections point at a grouping of the medial geniculate body and auditory cortex as a functional unit. Divided in three different units, the medial geniculate body only has one section that seems to be involved in the lemniscal auditory pathway: the ventral section. We will not be focusing too much on the other two, less specific areas of the medial geniculate body. The ventral section mainly collects information from the ICC (just previously seen). Similarily to the ICC, the ventral section of the medial geniculate body is tonotopically organized in a laminar structure, and it was suggested that a further functional organization was underlying the specific range of frequencies (the functional groups were termed as slabs). The purpose of this ventral section is said to further sharpen frequency resolution. The other two sections of the medial geniculate body are the medial and dorsal divisions. Part of the extra-lemniscal pathway receiving visual as well as somatosensory information, it is worth mentioning that their responses can change as a result of learning. 3.1.7. T HE A UDITORY C ORTEX

The auditory cortex is the functional unit where all of the previously gathered information will be assembled in order to form an auditory object in the listeners mind. The very large complexity of this unit barely allowing us to scratch its surface in the scope of this work, the main information here will be covered less precisely and more abstractly. The auditory cortex consists of a core unit, surrounded by a belt and a para-belt. The core unit, mainly receiving inputs from the specific, lemniscal system, is itself composed of three main sections: the primary receiving area (AI), a secundary area (AII) and further association areas. Although the information integration processes in the auditory cortex are the same for every human, the actual neural responses will be
7

Definition from Oxford American Dictionaries: relating to or denoting a sensation (such as pressure, pain, or warmth) that can occur anywhere in the body, in contrast to one localized at a sense organ (such as sight, balance, or taste).

23

notably a function of the listeners genetics and previous exposure to this stimulus. Moreover, more activity is detected for stimuli of current significance to the listener in his environment. Once the information is analyzed by the core unit, it continues on to the belt and para-belt for further examination. If we summerize the stimulus course along the brainstem (from the auditory nerve to the actual representation of the auditory object) in terms of what (sound identification) and where (sound localization) pathways, we notice that at first those two streams separate right before entering the cochlear nuchlei (for better analysis of the involved cues), then progressively reunite in the lemniscal tract up to AI included. There, as mentioned, the association of spatial and identity specific information allows to drive the neural activity in a unique way, effectively representing the auditory object to the listener. Indeed, different objects are represented through different (though overlapping) patterns of neural activity in the auditory cortex. We hear! Coincidences of previously experienced patterns of neural activity facilitate the integration of the known stimuli. Again, the where and what streams segragate into discrete pathways: on one hand, both the identity and localization of the sound are transmitted to a further dorsal pathway in the brain (enabling to prepare for a potential consequent motor response) forming a where or do stream, while a what stream continues on a ventral pathway on to several different parts of the brain.

3.2. S O UN D S O URC E L O C ALIZATIO N


As we discussed in the precedent section, the process used by the brain to create a sense of auditory space relies on several different features according to the type of signal presented to the ears. Indeed, on one hand localization on the horizontal plane relies on ITDs and ILDs, effectively using the superior olivary complexs ability to make sense of the interaural correlation. On the other hand, localization on the vertical plane as well as judgement of distance to sound source mainly rely on the analysis of the spectral content of those sound sources. As a reminder, this information is analyzed in the dorsal cochlear nucleus. Therefore, it would be correct to summerize a little and write that localization on the horizontal plane mainly uses signals in the time domain, while the vertical plane as well as the judgement of distance to the sound source use frequency domain signals. This section aims at briefly presenting those processes. 3.2.1. T HE H ORIZONTAL P LAN

Out of the three dimensions (horizontal, vertical and distance) localization on the horizontal plane is the best understood. The early findings of Lord Rayleigh in his Duplex Theory (1907) arguably form the core tenet of knowledge in binaural hearing. He is the first one to have given an explaination for the ILD and ITD phenomena, which contribute to intracranial images assimilated to lateralization of the sound source, i.e. movement of the sound source to the left or right of the listener. ILDs arise because of the physical dimensions of an incoming sound: very simply, high-frequency contents of sounds coming from the contralateral side of the ear are reflected by the head, creating an acoustic shadow. This reflection of high frequencies has the effect of diminishing the energy-content of the sound reaching the contralateral ear, thus creating a difference of level with the signal reaching the ear on the same side as the sound source. As we now know, those ILDs are correlated in the lateral superior olives. 24

On the other hand, ITDs processing relies on a model that has been held as the reference for over 60 years: the Jeffress model, first introduced in 1948. It aims to explain the manner time information in binaural signals are correlated in the bilateral medial superior olives. It consists of an array of neurons serving as coincidence detectors, firing maximally when reached by stimuli from both ears. Those detectors are innervated by axons of variable length that effectively create a system of delay lines, allowing for one stimulus to reach all of the detectors but at different times (as can be seen in Figure 7). This arrangement allows to create a topographic map Figure 7. Representation of the model of ITDs since, when simultaneously reached by the presented by Jeffress (1948). From stimuli from both ears, the firing of the given detector McAlpine (2005). corresponds to a given spatial position of the sound source. Interestingly, Stevens & Newman (1936) reported that human subjects showed fewest azimuth sound source localization errors for frequencies below 1.5kHz and above 5kHz, not only indicating that the brain must therefore use two localization machanisms (respectively ITDs under 1.5kHz and ILDs above 5Khz, thus backing up Rayleighs work in the process), but also that the confusion reported between 1.5kHz and 5kHz must indicate that those mechanisms must act simultaneously within that band. Those results show to be quite consistent with physiological reports made later on: the phase-locking of the stimuli in the auditory nerve declines for frequencies above 3kHz and are reduced to practically nothing around 4 or 5kHz, and the medial superior olive that was discussed earlier on within the scope of the ITD ascending pathway contains more low-best frequency neurones, while the lateral superior olive (ILD pathway) contains more high-best frequency neurones. Phase-locking? Interestingly, neurones in the MSO are actually not sensitive to time differences per se but rather they rely on interaural phase differences between the two ears inputs (McAlpine, 2005). As useful and intuitive as it is, the Jeffress model seems to be nothing but a model. Indeed, the researchers that applied themselves to find anatomical evidence of the delay lines concept presented by Jeffress (Smith et al. 1993; Beckius et al., 1999) never found anything convincing enough to validate the model as factual. However, ethological evidence (in the barn owl, whose hearing is incredibly developped and subject to extensive research) have encouraged many to believe that those interaural time correlations were actually the results of topological maps in which specialized neurones are tuned to fire maximally at a give phase (without the delay lines, that is), effectively giving the auditory object its azimuth. It is suggested that such map would be placed orthogonal or parallel to the tonotopic map that was previously discussed in the section dealing with the central nucleus of the inferior colliculus. However, some observations seem to deny this claim and it is an ongoing discussion among experts. As a side note, it has been stated that the brain relies on ILDs and ITDs correlation to assess the lateralization of a sound source. But what is meant by that? Interaural correlation actually refers to how similar or dissimilar the signals of a given sound source reaching the left and right ears are. Two equations are often encountered in specialized literature, both yielding what is referred to as an index of correlation. They are formally known as the normalized covariance and the normalized correlation. Basically, when two signals whose analyzed features are perfectly similar they hold a correlation index of 1.0. However, it is not my intention to burden this paper with mathematically complex computational models of interraural correlation so I shall not 25

dig any deeper. Other kinds of computational models for the binaural processing of 3D audio contents will follow soon enough in the Machine chapters. Let us discuss the Lord Rayleighs findings again for a bit. Apart from the idea of acoustical shadow formed by the head on the contralateral side of the sound source position, which is relevant in high frequencies, Rayleigh also introduced the concept of cone of confusion . The prestigious Oxford Reference website defines this cone of confusion as follows: A cone-shaped set of points, radiating outwards from a location midway between an organism's ears, from which a sound source produces identical phase delays and transient disparities, making the use of such binaural cues useless for sound localization. Any crosssection of the cone represents a set of points that are equidistant from the left ear and equidistant from the right ear. 8 The cone of confusion is thus this cone that could be drawn around a listeners ear that contains points whose ILDs and ITDs values are identical, such that the listener could get confused as far as the actual position of the sound source. A related psychoacoustical phenomenon that has left wandering a number of binaural simulation experts is the frontback confusion.9 What does it consist of ? Quite simply, front-back confusions consist in the listeners inability to decide is the sound source emanates from up front or behind Figure 8. Representation of Rayleigh's cone of him/her, or more so to localize a sound up confusion. 2007 howstuffwork. front when emanating from behind and vice versa. They are thought to be mostly produced by confusing ITDs values for sound sources belonging to this cone of confusion we just discussed. Indeed, for any azimuth up front, the same ITD value exists for a sound source placed at the back. However, the occurrence of such confusions can be greatly diminished when the listener is able to rotate his/her head. Indeed, this new dynamic cue is needed to modify the perceived ITD and ILD values, allowing to help the listeners brain in making a more informed decision as to where in space it should place the incoming stimulus. For example, if a sound source is presented to an azimuth of +20 on the center-right of a listener and that a front-back confusion occurs when the listener localizes the sound source at +160 then slightly turning his/her head towards the right will permit to decrease the perceived ITD and ILD values, effectively allowing the listener to localize the sound source at its actual position. However, as suggested in Wallach (1940) and shown in Wightman and Kistler (1999) the actual movement of the head is not necessary for diminishing front-back confusion. For example, if the listener is placed on a rotating platform while receiving stimuli from a static sound source, the listener does not need

Definition retrieved from: http://www.oxfordreference.com/view/10.1093/oi/authority.20110810104643902 on Sept. 17, 13. 9 It is worth mentioning that such confusions mainly happen in experimental conditions and rarely under normal conditions of the everyday life. However, it is important to make mention of it as it will show to be of great importance in binaural virtualization of content discussed in the Machine section.

26

to rotate his/her head in order to be able to extract the relevant information provided by the dynamic cue as long as he/she is aware of the direction of his/her relative movement. The correct positioning of sound sources situated at the rear of the listeners head also depend on another stationary factor that is only enhanced by this dynamic one we just discussed. I am speaking of the phenomenon happening when a sound source is situated behind the listener and the high-frequency content is not able to diffract around the pinna, resulting in a form of low-pass filtering. As suggested by Wightman and Kisteler (1997a) front-back differences are mostly indicated by level differences in the 4-6kHz region. 3.2.2. T HE V ERTICAL P LANE

A good starting point in the discussion of localization on the vertical plane would be to look at the results obtained in Butler and Humanski (1991). Listeners were sat by a vertical arch of seven loudspeakers, fixed to the beam from 0 to 90 and positioned by increments of 15. The testing was organized under six different conditions: in Conditions 1 and 2 the listeners were presented respectively with 3kHz low-pass then high-pass noise bursts originating in the LVP (lateral vertical plane), and they were able to localize sounds binaurally, i.e. using both their ears. In Conditions 3 and 4, the same noises were presented binaurally but this time originating from the MVP (median vertical plane). Conditions 5 and 6 were similar to Conditions 1 and 2, only the listeners localization abilities were tested monaurally, i.e. using one ear only10. The researchers found that in Condition 1 (when listeners were presented the low-pass noise in the LVP) they were very capable of localizing the sound sources. This result was expectable given the previous discussion we had: the listeners relied on the availability of binaural information. However, the listeners performed poorly at assessing the sound sources elevation in the MVP (Condition 3) with the same low pass noise. Indeed, no cue was available to relate to that elevation, as pinnaes filtering abilities that could have provided the necessary information only appear at higher stimulus frequencies (Searle et al., Figure 9. Apparent elevation of the 1975). sound sources plotted against their On the other hand, when the listeners were presented actual elevations. From Butler and the highpass noise bursts (in Conditions 2 and 4) they Humanski (1992). performed (substancially) better, especially in the MVP. Therefore, it seems clear that localization on the vertical plane depends mostly on the pinnas ability of distorting the stimulis high-frequency content in peaks and notches (mostly between 4kHz and 16KHz notably see Blauert, 1969) according to the sound sources elevation. This apply mostly in localization in the MVP, which is, as

10

Naturally, monaural testing allows to isolate the cue related to high-frequency contents from the ILDs and ITDs, which are binaural cues.

27

we shall see, a great area of potential improvement in binaural reproduction of 3D contents. 3.2.3. D ISTANCE F ROM T HE S OURCE

The ability of a listener in estimating distance from a sound source (without visual capture) depends on his/her ability to (mostly unconsciously) determine the way the original signal has changed through the propagation process, according to three main factors: the relative intensity of a sound, the damping effect (i.e. the relative intensity of high-frequency content) and the direct-to-reverberant energy ratio. This section holds the purpose of briefly presenting those three important parameters. We will then discuss the influence of visual capture of the sound source over the estimation of its distance to the subject, as well as how the accuracy of this estimate directly relates to the subjects previous exposure to the perceived auditory object and room acoustics. The first factor, which is the relative intensity of a sound, does not quite come as a surprise as it is well-known that sound waves propagating in free space loose 6dB in sound pressure every time they double their distance with their source. Therefore, judgments of distance increase systematically when the relative sound pressure reaching the eardrums is decreased. This feature points toward a system of internal reference of the expectation of intensity of a given auditory object compared to the actual occurence. The comparison of this occurrence on our internal scale allows us to estimate the distance to the sound source of the given auditory object. The second factor is very related to the first one, but is to be considered as distinct nonetheless. It deals with the damping effect, i.e. the amount of high-frequency energy that diminishes as a function of distance due to atmospheric absorption. Coleman (1968) supported this notion by showing that a low-pass-filtered signal (with a gentle slope) was consistently localized further away from the subject than the same signal unaltered. Finally, the third main cue used used by listeners to assess the distance from a sound source is the ratio of energies along the direct (i.e. direct field) and indirect (i.e. diffused field) paths to the receiver. This ratio can be called the direct-toreverberant energy ratio. The higher the ratio, the closer the estimated sound source position and vice-versa. Researches have shown that the estimation of distance to sound sources using a single modality (either auditory or visual) greatly vary when compared. For example, listeners tend largely underestimate those distances when the actual position of the sound source is more than about a meter away. When estimating the distance of a sound source, one would expect that combining vision and hearing would always improve the localization. Not so much. Indeed, Gardner (1968) found an effect (that he termed the proximity-image effect) that selects the closest rational visible location as the apparent sound source position, even though it might be meters further. It is worth mentioning that in Gardners study this effect was reported under anechoic chamber settings, thus preventing from reverberation to bring further information to the listener, but a subsequent research (Mershon et al., 1980) actually found that this proximityimage effect works almost as efficiently in reverberant environments as in anechoic ones, whereas another (Zahorik, 2001) concluded that this effect is not as definite as previously thought and methodology difference did not allow to draw scientific conclusions out of the comparison between the studies. This last study also notified that throughout the experiment, listeners seemed to have improved their localization

28

skills within the given environment, suggesting quite clearly that, as obviously expected, it is possible for the brain to learn from experiences in order to perform its tasks more accurately.

29

4 T HE M ACHINE : R EQUIRED B ACKGROUND


4.1. I N TRO D UC TIO N
TO TH E

M AC H INE

The ultimate purpose of the two Machine sections that are about to unfold are to be considered an attempt to outline two different ways to apprehend the binaural offline conversion of three-dimensional audio contents. The first section serves as an introduction of contents and holds the purpose of introducing several concepts that will show to be relevant in this technical endeavour, while the second section will actually present and compare both suggested models. The first model will rely on channels while the second one will rely on objects. Because the research was based on the assumption that both models were to reach the same design goals specified in section 5.1, the purpose of the present section is to explore means to reach the said design goals. However, the actual testing of the presented models is not encompassed in the range of this study and will potentially form the object of further work. Sound recording's history has taught us that ever since the appearance of the first sound-related technologies in the 19th century, the driving force behind this industry's evolution through the decades is marked by the effort to improve the sense of immersion brought to the listener. In that perspective, a highly non exhaustive list of seminal technical improvements would include the following technologies: the stereophonic sound, first patented by A.D. Blumlein in 1931 (see reference), the general improvement of analog circuit's linearity throughout the 20th century (transducers included), Disney's Fantasia technology "Fantasound" in the 40's, then the quadraphonic sound in the early 70's, that ushered the way to Dolby and DTS 5.1 surround sound systems, largely permitted by the digital technology revolution started in the early 70's, which eventually matured more profoundly in the 2000's. The 2000's have seen the democratization of higher sampling rates as well as a higher bit-depth resolution, permitting a more realistic experience of sound. The 2010's are now witnessing the "3D momentum" (with products such as Barco's Auro 3D and Dolby's Atmos) where the topic has been studied by experts for years though audiences are only starting to get acquainted with the recent, booming technologies. However, although the commercial success of the new generation of 3D sound products is increasing and a fair number of movie directors report to be satisfied with this new creative aspect of moviemaking, the money and space investments needed for the consumers to acquire surround (let alone 3D) sound systems in their homes seems to refrain them from investing in the immersive experiences. Are 3D sound technologies then dedicated to the movie theaters? Binaural technologies hold the potential to negate this idea, allowing consumers to not only enjoy 3D sound in the comfort of their homes, but also bringing it to the mobile experience.

4.2. H EAD -R ELATED T RAN SFER F UN C TIO N S


Head-Related Transfer Functions (HRTFs) are a mathematical attempt to isolate the transfer functions containing all of the previsouly seen cues necessary for the human brain to localize a sound source at a given point in space, in the form of a filter. They thus comprise both the ITD (encoded in the filter's phase spectrum) and the

30

IID (encoded in the filter's overall power), as well as the ear's frequency response corresponding to the position where the stimulus was played back relatively to the listener or mannequin. The measurement is made by playing some known stimulus through a loudspeaker placed in a free-field, whose position is stipulated at a given azimuth (!), elevation (") and distance (Cheng & Walefield, 2001). The impulse response is generally captured by small microphones placed in the listener's ears. HRTFs are oftentimes specified to be minimum-phase FIR filters. This characteristic becomes very useful notably in the case of HRTF interpolation, where an FIR filter can mimic the attributes of an HRTF and are reported to give perceptually acceptable results (Kulkarni et al., 1995). In the case of real-time processing, such practices are of paramount importance in order to be able to output immersive audio, and researches in that field have become increasingly important in the recent years. It is well worth mentioning that the quality of the immersive experience is going to depend directly on the quality of the HRTFs. Indeed, since each individuals pinnae and bodies are unique, personalized HRTF should always be used. However, although the measurement is quite fast and rather straightforward, it is not given to anybody to have their own HRTF measured, as the procedure notably requires specific gear as well as a calibrated multichannel system. That is the reason why several experts of the field have applied themselves to study the different physical factors influencing the HRTF measurements in order to gain understanding as to build general HRTF databases that would eventually allow listeners to choose HRTFs that best suit them. Indeed, it is not required for the listener to have his/her own HRTF in order to have perfect localization abilities, as some studies have shown that it is possible for humans to adapt to another way of localizing sounds.

4.3. C H AN N ELS

VS

O BJEC TS

The two current sound technologies are the channel-based model and the object-based one. Their purposes are similar and their outcomes in the immersion realm are relatively close. However, their perspectives are quite different from each other, and their suitabilities vary according to the application that they are submitted to. Unsurprisingly, the channel-based model holds channels as reference. In the recording and/or mixing process, each channel will be attributed one signal, which is meant to be reproduced over a speaker placed at the same relative position where it was first intended to be played back when the recorder/mixer approved the content. Therefore, the use of standard speaker layouts has become widely accepted in order to provide a reference system for the audience to enjoy the content the way it was meant to be. The inconvenience of this channel-based technology is that mixes approved in one given speaker layout can not translate into another one without using up- or down-mixing, thus a priori forcing the content providers to mix their materials several times, adding to the production costs (although some mixing techniques can be used to overcome this limitation to a certain extend, which will go unspoken of in the present paper). The object-based model does not rely on the same channels concept, but rather handles sounds as objects. Each object is assigned a position coordinate on axes X, Y, Z, which varies according to a timecode. The purpose of this system is to be able to automatically rescale the reproduced mix to the available system layout, thus allowing better flexibility. However, the reproduction of object-based spatial audio requires the use of decoders in order to render the sounds correspondingly to the current system setup. This characteristic of the object-based model can be problematic in some cases, as it raises the question of the absence of true referencial master. 31

4.4. C O N VO LUTIO N
The most common way used to translate spacial audio over headphones using HRTFs requires the use of convolution. Let us process a monaural signal x[i] so that a listener could localize it at a given azimuth ! and elevation ". The result of the processing yields yl[n] and yr[n], that are to be played back simultaneously on a pair of headphones, respectively by the left and right membranes. The HRTF database used in this research is the ARI database [6], which contains 256 samples-long HRTFs. Since HRTFs are often referred to as minimum-phase FIR filters, let dmin l," ,# and dmin r," ,# be the minimum-phase impulse responses measured from azimuth ! and elevation ".
M $1

y l [i] = y r [i] =

%d
j =0 M $1 j =0

min l," , #

[ j ] x[i $ j ] ! [ j ] x[i $ j ]

%d

min r," , #

Great But can we explain what convolution is in plain words? Just like addition is the mathematical operation of combining two numbers into a third one, convolution is the operation that allows to combine two signals (the input signal and ! into a third one (the output signal). Its symbol is the star *, the impulse response) which should not be confused with the multiplication symbol used in computer programs! Practically, the expression: x[n] * h[n] = y[n] could be translated into: signal x[n] is convolved with impulse response h[n] resulting in output y[n]. At this point, it is already worth mentioning that convolution is commutative, i.e. x[n] * h[n] = h[n] * x[n] = y[n]. An impulse is a signal whose points are all zeros, besides one. The delta function, expressed #[n], is a normalized impulse, i.e. its only nonzero sample is situated at index zero and has a value of one. When the delta function enters a given linear system the output file is called an impulse response, h[n]. However, any impulse can be expressed as a shifted and scaled version of the delta function; for instance, let us consider signal d[n], composed of a sample that has a value of -2 at index n=4, and whose other samples are all zeros. Signal d[n] is thus a delta function, only shifted to the right by 4 samples and multiplied by -2. Therefore, d[n] = -2#[n-4]. We say that an impulse response is the definition to a system of convolution, because when its identity is known, we know how any signal is going to react when passed through the system. Actually, the impulse response is the system. Convolution being a paramount building block of digital signal processing, it is worth noting that the term used to refer to the impulse response of a system can vary according to the application. Indeed, it is called a point-spread function in the field of image processing, or a kernel if the considered system is a filter. As previsouly mentioned, our HRTFs are considered to be filters to their input files, therefore kernel would be the right term to use in our case. In most practical cases, the input files of convolution are several thousands samples long, while the impulse responses are usually much shorter. In our case, the input files are going to be the audio signals, while our kernels will be the HRTFs, which are, as mentioned, 256-samples long. The ouput files will be the same audio signals, but spacialized so the brain is able to localize them at the intended spot in space. The number of samples contained in those output files will show to be of great importance in the proposed models presented in further sections. Fortunately, the formula used to calculate this number is very simple; the number of samples in the

32

output files equals the number of samples in the input file, plus the number of samples contained in the kernel, minus one. While convolution can certainly be approached on several different perspectives, the short introduction presented here will merely allow one. The point of view that shall be focused on is called the input side algorithm and will teach us how the input signal contributes to the making of the output signal. Although the input side algorithm perspective does not provide a good mathematical understanding of convolution, it does allow to gain some conceptual insights on the process of convolution, which is exactly what we are aiming for in this current section. Let us use a simple example of convolution for a 9-point input signal x[n] and a 4-point impulse response h[n].11 The input signal can be decomposed into discrete samples that can then be considered shifted and scaled versions of a delta function. Therefore, when looking sample x[2] (sample situated at index 2 in x), which has a value of two, we see that it can be expressed as 2#[n-2] because it corresponds to a delta function multiplied by 2 and shifted two indexes to the right. After passing through the system, this component of x called x[2] becomes 2h[n-2]. We can visually verify this concept on the second box of figure XXX where the little diamonds serve as place holders in each box, and are just added zeros, while the squares represent the actual contributions from each point of the input signal x[n].

Figure 10. Representation of the convolution between signal x[n] and impulse response h[n] yielding signal y[n]. Taken from "Digital Signal Processing - A Practical Guide for Engineers and Scientists" written by Steven Smith.

Very briefly, the input side algorithm works as follows (see Figures 9 and 10): once the vectors are placed into their respective arrays (x[] for the input file, h[] for the impulse response) and the programming usual practices are taken care of in the script (notably zeroing the output array y[] because it serves as an accumulator and therefore the variable needs to be reinitialized before each execution) two for loops will be initiated. The first loop allows to go through every single index of x[] to individually look at all of the input signals samples. For each of them (still associated to modified delta functions), a second, inner loop allows to calculate a shifted and scaled version of the impulse response contained in h[]. Each result is then added to the output array y[].

11

The schemes used in this example were taken from the excellent The Scientist and Engineers Guide to Signal Processing written by Steven W. Smith (1997)

33

Figure 11. Representation of the input-side algorithm. Taken from "Digital Signal Processing - A Practical Guide for Engineers and Scientists" written by Steven Smith.

4.5. I NTRODUCTION TO D IGITAL F ILTERS


This section serves the purpose of giving a very short introduction to the very large topic of digital filters. The goal is to convey some insights as to the way our filter kernels (HRTFs) are going to interact with our input signal. Every filter is characterized by three main attributes: the impulse response (i.e. its filter kernel) that makes it possible to find the step response and the frequency response. It is worth noting that all three of those attributes actually represent the same information, only described from different perspectives. Indeed, it is no problem to convert the information found in the impulse response to obtain the step response or the frequency response. Indeed, integration12 of the impulse response allows to find the step response, whereas doing a DFT (by means of the FFT algorithm) of this IR allows to find the filters frequency reponse. The realm of filtering is one of decisions; indeed, there is no such thing as a perfect filter. Therefore, the filters characteristics are to be adapted according to its function. Much in the same way seen in the central auditory nervous system, where the stimuli from the cochlear nerve were split into two distinct pathways respectively handling time and frequency information for the good reason that such system was necessary in order to preserve the features of the stimuli relevant to each stream, filters are not able to be performant in both the time and frequency domains at the same time. Therefore, the step response can be focused on if the application requires high time domain resolution, while the frequency response can be improved if the filter is to be
12

Or, to be mathematically correct, doing the running sum . Indeed, integration is an operation that applies to continuous signals solely, whereas the running sum is the appropriate term when dealing with discrete signals.

34

used in an application demanding a good frequency resolution. Let us take a closer look at their respective parameters. First of all, what is the step response? And before that what is a step function? In order to answer this second question, it can be useful to note how we, humans, interpret signals Our brain is capable of dividing the stimuli into regions of similar characteristics (such as noise, then high amplitudes, then low amplitudes, etc.) by identifying the turning points between those regions, i.e. the points that separate them.

Figure 12. Representations of the good and poor characteristics of a filter designed for a timedomain application. Taken from "Digital Signal Processing - A Practical Guide for Engineers and Scientists" written by Steven Smith.

That is exaclty what the step functions are: turning points between zones of similar characteristics. The step response (that can also be found by doing the running sum of the impulse) results from feeding a step function into a given system, i.e. in our case, a filter. Basically, the step response will show us how the step function was affected by the filter. The step response is composed of three main parameters: risetime, overshoot and linearity. In order to design a filter for use in the time domain, the risetime needs to be shorter than the spacing of the events, in order to provide good resolution. The step response should not overshoot because it distords the amplitude of samples in the signal. At last, the linearity of the filter is determined by the fact the upper half of the step response is a point reflection of its lower half (see Figure 11).

35

Figure 13. Representations of the good and poor characteristics of a filter designed for a frequency-domain application. Taken from "Digital Signal Processing - A Practical Guide for Engineers and Scientists" written by Steven Smith.

Filters used for applications in the frequency domain have three main parameters: roll-off, passband ripple and stopband attenuation. A fast roll-off allows a narrow transition band13, there should be no passband ripple in order for the signal we want to keep to not be affected by the filtering, and the stopband attenuation should be maximal. Now that we can look at digital filter a little more clearly, we can look at the two possible ways to filter an input signal: convolution and another process called recursion. Convolution allows to create FIR (Finite Impulse Response) filters while recursion makes up for IIR (Infinite Impulse Response) filters. In theory FIR filters are fantastic in our case because they have the great feature of not messing around with the phase of our input signal, which will show to be of paramount importance in further sections.

13

A transition band is the band situated between the pass-band and the stop-band, i.e. the band it takes to go from -3dB of the pass-band to the stop-band. It is worth noting that while this claim is correct in the analog realm, the transition bands in the digital realm were never really standardized and are often stipulated in percentage (99%, 70,7% which equals -3dB, 50%, etc.).

36

5 T HE M ACHINE : I MPLEMENTATION S TRATEGY


5.1. D ESIG N G O ALS
The goal of the converting algorithm is to output stereo, spacialized files that can be as immersive as possible for the listener. Although the script is offline and the computing time is not of paramount importance, the algorithm must be written efficiently enough to allow future changes for possible real-time adaptation. Human psychoacoustics considerations being of extreme importance for the immersive quality of the output files, they must be placed at the center of the design goals. Practically, those physiological considerations translate into the following propositions: (1) The converter must provide as many virtual sound source positions as possible. The MAA (Minimum Audible Angle) is the basic metric of relative localization ability of the listener, and is thought of as the smallest angle detectable by humans in azimuth or elevation for a sound source (Letowski & Letowski, 2012). Therefore, the MAA is a good indicator of the resolution of the auditory localization system. On the azimuthal plane, humans showed the ability to discriminate changes of only 1 or 2 in the frontal position when wide-band stimuli and low frequency tones were played (Grothe et al., 2010). Those values were reported to increase to 8-10 at 90 and decrease again to 6-7 at the rear (Letowski & Letowski, 2012). The MAA reported on the elevation plane is about 3-9 in the frontal hemisphere, and almost twice as large in the rear hemisphere (at 60 in elevation) (Letowski & Letowski, 2012). However, it is well worth mentioning that the MAA does not quantify absolute localization judgements, but only relative ones. The reported measurements were much larger for the average error in absolute localization for a broadband source: 5 for the frontal and about 20 for the lateral position (Hofman & Van Opstal, 1998). This valuable information can be useful in estimating the relevance of the ARI HRTF database used in the present research. The HRTF were measured in incremental steps of 2,5 in the azimuthal range of 45 and of 5 outside this range. The elevation was measured in increments of 5. From these facts, we can draw the conclusion that the ARI database has reasonably good resolution for HRTF interpolation not to be considered in the case of the present research. (2) The converter's time reference must be short enough to provide good resolution in the sound sources' movement. Humans' ability to perceive sound motion is effective through a series of cues: the main ones being the radial and the angular velocities (Letowski & Letowski, 2012). The radial velocity is the one at which sound sources move towards or away from the listener, directly affecting the sound intensity as well as inducing Doppler shifts in sound frequency. On the other hand, the angular velocity represents the velocity at which sounds rotate around the listener and is perceived through monaural and binaural localization cues. Although the radial velocity has little impact in the present research because the ARI HRTF database does not include ear-source distance variations, the angular velocity turns out to be very useful information to work with. The MAMA (Minimum Audible Movement Angle), that is the primary metric used in reporting perceived sound source motion, is defined as the smallest angular distance the sound source has to travel, so that its direction of motion is detected. It could therefore be thought of as the detection threshold for movement. The MAMA is the smallest in the listener's frontal plane and increases as the sound source moves away to the sides of the head. Indeed, a minimum duration of

37

150200 ms in the 060 range of observation angles were reported and the durations increased by ~25%30% at larger angles for sound sources moving at low velocities. 150ms thus seemingly being the shortest time for a human to perceive sound source motion, the time reference for the models was chosen to be inferior to that value, namely 100ms, so that excellent movement resolution could be obtained. (3) The converter must induce minimal phase artifacts.

5.2. Overview of the Channel-Based Model


The converter is fed audio files in the form of channels. The simplest form of algorithm would be to convolve those channels with the HRTFs corresponding to the physical positions of the speakers in the required layout, effectively creating static, "virtual speakers". However, the sense of immersion induced by this technique is limited because of the very few sound source positions available. In order to meet the first requirement of our design goals, two specific categories of sounds are to be distinguished: static sounds and sounds evolving in space. The implementation of the following process is suggested in order for the program to distinguish between both categories. Nonetheless, it is worth noting that the presentation of such process does not aim at providing an exhaustive and precise implementation strategy for the channel-based algorithm, but rather is used as a mean to recognize the type of processing required for the binaural auralization of 3D channel-based content. First of all, a timecode is applied to the audio content in order to provide it with a time reference system. As justified in section 3.1, the time reference is the tenth of second. The signals contained in every channel are then analyzed to reveal their frequency domains in order to have knowledge of the energy contained in each of their sub-bands. However, because the signals that are dealt with in the present application are usually non stationary, the proposed method for such process requires the use of wavelets transforms as opposed to the traditional Fourier transform, which is not suitable in this case because of its lack of precision in the revelation of a non stationary signal's temporal structure. Through the scaling and the time shifting of the mother wavelet function, the input signal can effectively be analyzed and reveal its spectral content as intended. The next step in the process leads to a complex comparison system of the all of the channels' sub-bands' RMS values with a windowing-time of 100ms. The purpose of such system is to monitor the spectral activity of the channels in order to draw conclusions about the spatial evolution of sounds from one channel to the other. Each channel's sub-bands are compared to the corresponding sub-bands of the channels played back on speakers whose physical positions are adjacent to the analyzed channel, 100ms later. Let us exemplify this idea using Barco's Auro 11.1 speaker layout (L, C, R, Ls, Rs, HL, HC, HR, HLs, HRs, VoG and LFE), with the analysis of the Right channel's sub-band centered on 1KHz. This Right channel's sub-band's RMS value at time 00:00:00:10 will be compared to the RMS values of the same sub-bands (i.e. centered on 1KHz) of the C, HC, HR, HRs and Rs channels at time 00:00:00:20. For simplicity's sake, let us continue expressing the present idea with only one of R's adjacent channels, namely C. When comparing the RMS value of R's sub-band centered around 1KHz with C's, three different outcomes can be expected: the RMS value of the R's 1KHz sub-band at 00:00:00:10 can be either greater, equal or smaller than C's at 00:00:00:20. Depending on the reached outcome, logical conclusions can be drawn from these observations: if R's sub-band is quieter than C's, chances are likely that whatever sound containing energy at 1KHz, is evolving from the right to the 38

center. Inversely, if R's sub-band is louder than C's, the sound is evolving form the center speaker to the right one. If the energy contained in R's sub-band is the same as C's, the sound scene is likely to be static at that time. However, this form of spectral tracking method is far from being infallible: for example, it could not track sounds whose spectral content would be evolving along their positions in space. Channels are then separated into sub-bands using pass-band FIR filters, and each channels sub-bands are convolved with HRTFs corresponding to the positions retrieved from the analysis of the channels spectral contents method explained in the previous paragraph; static sound sources are convolved with the HRTFs corresponding to the virtual speakers positions whereas sound sources evolving in space are convolved with HRTFs corresponding to intermediate positions between those virtual speakers. However, this method of binaural spacialization yields phase distorsion of the channels signals, which goes against proposition (3) of the design goals. This brief explanation allows one to start realizing the complex development procedures required in order to binaurally translate 3D channel-based audio contents while aiming to reach the design goal expressed in 5.1 stating "The converter must provide as many virtual sound source positions as possible". Indeed, in order to do so a system of detection of wave patterns coupled with minimum-phase FIR filters (whose qualitative performances can be very high if properly designed but would subsequently show poor computational efficiency), should allow the script to perfectly "crop" every single elements of the audio content in order to individually convolve them with the HRTFs corresponding to their positions. Although nowadays such procedure would be impossible to achieve with perfect results, it would effectively turn channel-based contents into object-based ones.

5.3. O VERVIEW

O F TH E

O BJEC T -B ASED M O D EL

Sounds are considered to be objects, each with their own set of spatial coordinates in regard to the time-reference. Since the HRTF measurements from the ARI database only include the direction variation of the incoming signal and not the ear-source distance (like most of the available HRTF databases), only two axes are relevant in our position coordinates system: azimuth (!) and elevation ("). The azimuth parameter will have increments of 2,5 from -45 to +45 and of 5 for the rest of the sphere. The elevation parameter will have increments of 5 throughout. As explained in 5.1 the time reference is the tenth of a second in order to provide good locational accuracy when the need arises to process objects that quickly evolve in space. The spatial coordinates and the time-reference (timecode) for each object are stored in a .txt file. The purpose of the program is to read into the objects own .txt file to use its coordinates in regard to the timecode, associate its coordinates to the corresponding HRTF, and convolve this HRTF with the object. For best efficiency, a function allowing the coordinates/HRTFs association can easily be built into the program so it does not have to be recreated during each execution. During each 100ms-window, a number of samples from the object are convolved with the HRTF corresponding to their coordinates. The number of samples will depend directly on the sampling rate of the .wav object. For example, an object whose sampling rate is 44100Hz is segmented into pieces every 4410 samples. Those 4410 samples are then convolved with the HRTF corresponding to their coordinates at that point in time. All of ARI's HRTFs lengths being equal to 256 samples, according to basic convolution rules the number of samples resulting from this single operation 39

will thus amount to 4410 + (256-1) = 4665 samples (which will be referred to as "window convolution output samples"). When these window convolution output samples are placed side by side into a vector, a system of overlap ensures that the total amount of samples does not increase as a side effect of the convolution process, and allows as well to not disregard the valuable information contained in the tails of the window convolution outputs. In practice, still in the case of an input signal originally sampled at 44,100Hz, the first 4665 samples resulting from the first convolution are laid into a vector, but the second "load" of 4665 samples will start at index 4409, adding themselves with the remaining 255 samples from the first window convolution outputs. This operation goes on until all of the objects samples are processed. As explained earlier, HRTFs work in pairs. Therefore, for each object, this aforementioned process requires to be carried out twice: once for each transfer function at any given position. Once the convolution process is finished for all of the objects, their vectors can then all be padded with 0s in order for them to have the same vectorial dimensions. After padding, two variables that will be referred to as Left Ear Bucket and Right Ear Bucket are initialized. Those so-called buckets will respectively gather all the values stored in the vectors affected by the left ear HRTFs (in the Left Ear Bucket) and by the right ear HRTFs (in the Right Ear Bucket). The last step concerns normalization. In this case, the EBU R128 standard was chosen in order to normalize signals to an appropriate level and is implemented through the use of C++ libraries made available. Independently from these considerations, a good question to raise is one that addresses the capability of the object-based model to reproduce diffused-field audio. Given the fact that, by definition, one object can only hold a maximum of one position at a given point in time it would then be impossible to reproduce a soundcape recorded with a microphone array using objects solely. Such sounds belong to the channel-based domain and are commonly referred to as beds. They require the use of up- or downmixing in order to be adapted to the number of speakers available in the object-based reproduction system used.

5.4. C H AN N ELS

VS .

O BJEC TS : C O N C LUSIO N

Although the perspectives of the channels- and object-based implementation models presented in this paper are very different from each other, they actually complement each other nicely. Indeed, as we have seen, the channel-based model is very indicated for sounds containing diffused-field, since such sounds surround the listener and thus originate from several positions. However, this model does not allow the convenient binaural localization of sound sources evolving in space. Those types of sounds are best handled by the object-based model, which allows to easily associate the individual sound sources positions to the corresponding HRTFs. The creation of a hybrid algorithm including the respective strength of both the presented models would be promising and hold much potential for the field of binaural conversion of 3D contents. As a side note, another conclusion can be reached: in order to have the best output quality as possible, one necessarily has to conceive a real-time algorithm. Indeed, as we have seen, good localization possibilities would rely on the listeners ability to rotate his/her head in order for his/her brain to improve the auditory objects localization. Thanks to several different available face-tracking systems (webcams, optical, electromagnetic, etc.), the listener is able to rotate his head while the system 40

interprets those rotations and maintain the soundstage up front. However, for best results the latency should be minimal and a system of HRTF interpolation should be set up in order to improve the resolution of auditory objects movements.

6 A FTERWORD
This work served the objective of introducing three-dimensional hearing on three different perspectives, namely the bodys, the brains and the machines. Although it is obvious that entire books could be written about every little section of this paper, my intention was to provide my readers with a glimpse into several very different concepts that are part of fields of study that usually do not blend in with each other. However, it is my firm belief that that is where the future resides: nowadays the complexity of the knowledge can be such that researchers become extremely specialized within a single area of their field, therefore sometimes losing the general view. I believe that getting interested in several fields of study is a good way to keep inspired and to keep this notion of ensemble because in the end everything is interrelated. How fascinating! Let us review what we have gone through. Starting off in Chapter 1 with some physics fundamentals required for the proper understankding of the second one, we notably went over the defining concepts of a sound wave and its propagation in air before moving on to discussing the impedance of a medium. The Fourier analysis was then briefly introduced before a word on the methematical definitions of linearity. Chapter 2 started directing us more toward the subject of this paper. Indeed, the Body section discussed the outer ear, including its composition, roles and a little focus on the important role of the pinna. We then discussed the middle ear and its ossicles, paramount in the impedance transformation process, which is a concept that was introduced in the first chapter. We also went over the action of the muscles of the middle ear that allow to reduce the intensity of the soundwave when entering the oval window. Then, the inner ear and the cochlea were introduced. We went over the inner ears composition while focusing on the cochleas important role.We saw its anatomy, mechanics and discussed its organ of Corti containing the hair cells and stereophilia responsible for the electrochemical translation of mechanical phenomena. Still within the cochlea, we saw its physiological functionings with a view on Von Bksys work. The last section of this second chapter discussed the fluids contained in the cochleas scalae, the perilymph and endolymph, whose very special composition pointed at a very special role. Chapter 3 started where Chapter 2 left us: with electrochemical impulses. Welcome to the brains domain! We learnt about the different pathways it uses to preserve the precious information that requires to be refined through several different stages before finally being summerized into a so-called auditory object. We hear! A short description of the ascending pathways is provided pp. 17 and 18 with a very useful figure that helps to visualize our introductory voyage through the central auditory nervous system. A second section with the third chapter allowed us to gain some level of insight into the way we, humans, localize sound sources on the horizontal (mainly relying on ITDs) and vertical planes (mainly relying on ILDs). We also addressed a short discussion about the cues involved in our ability to estimate the distance from a sound source. Chapter 4 initiates a shift in perspective on the three-dimensional hearing topic, and aims at introducing several concepts whose understanding will show to be useful in the discussions provided in Chapter 5. We introduced the concept of the Machine, 41

which is an attempt to present and compare the implementation strategies of two algorithms (channel- and object-based) for the offline binaural conversion of 3D audio contents. We then went over an historical introduction on why binaural conversion is a solution to a situation we are facing in todays world, namely the fact that the multichannel sound solutions seem to be meant to remain in the movie theaters as the cost and space required for consumers to acquire such systems is prohibitive in many cases. We subsequently continued on Chapter 4s main purpose and went over the concepts of HRTFs, compared channels and objects, then introduced convolution and digital filters. Chapter 5 brings us to the heart of the matter, with three propositions that constitute the algorithms common design goals, relying on recent research in psychoacoustics. Section 5.2 overviews the channel-based model, while section 5.3 handles the object-based one, before concluding in section 5.4 on which is best for what application. Chapter 6 is the current chapter. Chapter 7 offers two appendices. The first one is a bonus : its the MATLAB script for the channel-based beds presented earlier, which I programmed it with the help of Thomas Pairon. This is meant to show what the Saving Private Ryan 5.1 audio files (burned on the attached CD) went through. Chapter 8 presents the references I used to write this paper. You will find references of the books, the websites and the cited work. Thank you for your attention!

42

7 A PPENDIX
7.1. APPENDIX A: C H AN N EL -B ASED M ATLAB
clear all; close all; %OPEN HRTF FILES: LHRTFLeft = fopen('L0e030a.dat','r','ieee-be'); % L CHANNEL LDataLeft = fread(LHRTFLeft,256,'short'); fclose(LHRTFLeft); LHRTFRight = fopen('L0e210a.dat','r','ieee-be'); LDataRight = fread(LHRTFRight,256,'short'); fclose(LHRTFRight); CHRTFLeft = fopen('L0e000a.dat','r','ieee-be'); % C CHANNEL CDataLeft = fread(CHRTFLeft,256,'short'); fclose(CHRTFLeft); CHRTFRight = fopen('L0e180a.dat','r','ieee-be'); CDataRight = fread(CHRTFRight,256,'short'); fclose(CHRTFRight); RHRTFLeft = fopen('L0e330a.dat','r','ieee-be'); % R CHANNEL RDataLeft = fread(RHRTFLeft,256,'short'); fclose(RHRTFLeft); RHRTFRight = fopen('L0e150a.dat','r','ieee-be'); RDataRight = fread(RHRTFRight,256,'short'); fclose(RHRTFRight); LsHRTFLeft = fopen('L0e250a.dat','r','ieee-be'); % Ls CHANNEL LsDataLeft = fread(LsHRTFLeft,256,'short'); fclose(LsHRTFLeft); LsHRTFRight = fopen('L0e070a.dat','r','ieee-be'); LsDataRight = fread(LsHRTFRight,256,'short'); fclose(LsHRTFRight); RsHRTFLeft = fopen('L0e110a.dat','r','ieee-be'); % Rs CHANNEL RsDataLeft = fread(RsHRTFLeft,256,'short'); fclose(RsHRTFLeft); RsHRTFRight = fopen('L0e290a.dat','r','ieee-be'); RsDataRight = fread(RsHRTFRight,256,'short'); fclose(RsHRTFRight); % .WAV FILES IMPORT LLeft = wavread('SPR-L.wav'); LRight = wavread('SPR-L.wav');
SC RIPT

43

CLeft = wavread('SPR-C.wav'); CRight = wavread('SPR-C.wav'); RLeft = wavread('SPR-R.wav'); RRight = wavread('SPR-R.wav'); LsLeft = wavread('SPR-Ls.wav'); LsRight = wavread('SPR-Ls.wav'); RsLeft = wavread('SPR-Rs.wav'); RsRight = wavread('SPR-Rs.wav'); LFE = wavread('SPR-LFE.wav'); % CONVOLUTIONS LConvHRTFLeft = conv(LLeft,LHRTFLeft); % L CHANNEL LConvHRTFRight = conv(LRight,LHRTFRight); CConvHRTFLeft = conv(CLeft,CHRTFLeft); % C CHANNEL CConvHRTFRight = conv(CRight,CHRTFRight); RConvHRTFLeft = conv(RLeft,RHRTFLeft); % R CHANNEL RConvHRTFRight = conv(RRight,RHRTFRight); LsConvHRTFLeft = conv(LsLeft,LsHRTFLeft);% Ls CHANNEL LsConvHRTFRight = conv(LsRight,LsHRTFRight); RsConvHRTFLeft = conv(RsLeft,RsHRTFLeft);% Rs CHANNEL RsConvHRTFRight = conv(RsRight,RsHRTFRight); % PADDING: TotalLength = [length(LConvHRTFLeft) length(LConvHRTFRight) length(CConvHRTFLeft) length(CConvHRTFRight) length(RConvHRTFLeft) length(RConvHRTFRight) length(LsConvHRTFLeft) length(LsConvHRTFRight) length(RsConvHRTFLeft) length(RsConvHRTFRight) length(LFE) ]; L = zeros(max(TotalLength),2); for i = 1:length(LConvHRTFLeft) L(i,1) = LConvHRTFLeft(i); L(i,2) = LConvHRTFRight(i); end C = zeros(max(TotalLength),2); for i = 1:length(CConvHRTFLeft) C(i,1) = CConvHRTFLeft(i); C(i,2) = CConvHRTFRight(i); end

44

R = zeros(max(TotalLength),2); for i = 1:length(RConvHRTFLeft) R(i,1) = RConvHRTFLeft(i); R(i,2) = RConvHRTFRight(i); end Ls = zeros(max(TotalLength),2); for i = 1:length(LsConvHRTFLeft) Ls(i,1) = LsConvHRTFLeft(i); Ls(i,2) = LsConvHRTFRight(i); end Rs = zeros(max(TotalLength),2); for i = 1:length(RsConvHRTFLeft) Rs(i,1) = RsConvHRTFLeft(i); Rs(i,2) = RsConvHRTFRight(i); end LFE = zeros(max(TotalLength),1); for i = 1:length(LConvHRTFLeft) L(i,1) = LFE(i); end % ASSEMBLY INTO BUCKETS Bucket = [L(:,1) + C(:,1) + R(:,1) + Ls(:,1) + Rs(:,1) + LFE L(:,2) + C(:,2) + R(:,2) + Ls(:,2) + Rs(:,2)]; % BUCKET NORMALISATION: BucketNorma = Bucket/max(abs(Bucket))/2; %EXPORT wavwrite(BucketNorma,44100,'SPR-5_1_matlab.wav');

45

8 R EFERENCES
8.1. B IBLIO G RAPH Y
Pickles, James O. An Introduction to the Physiology of Hearing. 3rd ed. Leiden: Brill, 2013. Print. Botte, M.C. Psychoacoustique Et Perception Auditive. N.p.: INSERM/SFA/CNET, n.d. Print. Srie Audition. Ward, Jamie. The Student's Guide to Cognitive Neuroscience. Hove [England: Psychology, 2006. Print. Popper, Arthur N., and Richard R. Fay. Sound Source Localization. New York: Springer, 2005. Print. Smith, Steven W. The Scientist and Engineer's Guide to Digital Signal Processing. San Diego, CA: California Technical Pub., 1997. Print. Wang, DeLiang, and Guy J. Brown. "Chapter 5: Binaural Sound Localization." Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Hoboken, NJ: Wiley Interscience, 2006. N. pag. Print. J. Blauert, Spatial Hearing: The Psychophysics of Human Sound Localization, MIT Press, Cambridge, MA, 1997

G. von Bksy, Experiments in Hearing (Translated and edited by E. G. Wever), McGraw-Hill, New York, 1960

8.2. W EB SITES
Tewfik, Ted L., M.D. "Auditory System Anatomy ." Auditory System Anatomy. MedScape, n.d. Web. Aug. 2013.

8.3. C ITED W O RKS


Mershon, D. H., & King, L. E. (1975). Intensity and reverberation as factors in the auditory perception of egocentric distance. Perception & Psychophysics, 18(6), 409415. doi: 10.3758/BF03204113 A. D. Blumlein. U.K. Patent 394,325, 1931. Reprinted in Stereophonic Techniques, Audio Eng. Soc., NY, 1986 A. Kulkarni et al., "On the Minimum-Phase Ap- proximation of Head-Related Transfer Functions," in Proc. 1995 IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE catalog no. 95TH8144). ARI HRTF Database detailed http://www.kfs.oeaw.ac.at description, retrieved on 07/17/13 from

Beckius GE, Batra R & Oliver DL (1999). Axons from anteroventral cochlear nucleus that terminate in medial superior olive of cat: observations related to delay lines. J Neurosci 19, 31463161 46

Butler, R. A., & Humanski, R. A. (1992). Localization of sound in the vertical plane with and without high-frequency spectral cues. Perception & Psychophysics, 51(2), 182-186. Chen, C. I., & Wakefield, G. H. (2001). Introduction to head-related transfer functions (hrtfs): Representations of hrtfs in time, frequency, and space. J Audio Eng Soc, 49(4), 231-249. Goupell, M. G., Majdak, P., & Laback, B. (2009). Median-plane sound localization as a function of the number of spectral channels using a channel vocoder. J. Acoust. Soc. Am., 127(2), 990-1001. doi: 10.1121/1.3283014 Grothe, B., Pecka, M., & McAlpine, D. (2010). Mechanisms of sound localization in mammals. Physiol Rev, 90, 983-1012. doi: 1152/physrev.00026.2009 Jeffress, L. (1948). A place theory of sound localization. Journal of Comparative and Physiological Psychology, 41(1), 35-39. Johnstone, B., & Sellick, P. (1972). The peripheral auditory apparatus. Quarterly reviews of biophysics, 5(01), 1-57. L. Rayleigh, "On Our Perception of Sound Direc- tion," Philosoph.Mag., vol. 13, 1907 Langner, B., Black, A. (2005) Using Speech In Noise to Improve Understandability for Elderly Listeners, ASRU 2005, San Juan, Puerto Rico Liberman MC, Dodds LW, Pierce S (1990) Afferent and efferent innervation of the cat cochlea: quantitative analysis with light and electron microscopy. J Comp Neurol 301:443460. Little, A., Mershon, D., & Cox, P. (1992). Spectral content as a cue to perceived auditory distance. Perception, 21(3), 405-416. McAlpine, D. (2005). Creating a sense of auditory space. J. Physiol., 556(1), 21-28. Naguib, M., & Wiley, R. H. (2001). Estimating the distance to a source of sound: mechanisms and adaptations for long-range communication. Animal Behaviour, 62, 825-837. doi: 10.1006/anbe.2001.1860 Nam, J., Kolar, M. A., & Abel, J. S. (2008). On the minimum-phase nature of headrelated transfer functions. Audio Engineering Society 125th Convention paper. P .M. Hofman and A.J. V an Opstal, Spectro- Temporal Factors in Two-dimensional Human Sound Localization, in Journal of Acoustical Society of America, vol. 103, 2634-2648 (1998). Pecka M, Zahn TP, Saunier-Rebori B, Siveke I, Felmy F, Wiegrebe L, Klug A, Pollak GD, Grothe B (2007) Inhibiting the inhibition: a neuronal network for sound localization in reverberant environments. J Neurosci 27:17821790

47

Searle. (1975). The contribution of two ears to the perception of vertical angle in sagittal planes. J. Acoust. Soc. Am., 109(1596), 8 pages. Stevens, S., & Newman, E. (1936). The localization of actual sources of sound. The american journal of Psychology, 48(2), 297-306. T. R. Letowski and S.T. Letowski, Auditory Spatial Perception: Auditory Localization, Army Research Laboratory (ARL), 2012 April Wallach, H. (1940). The role of head movement and vestibular and visual cues in sound localization. Journal of Experimental Psychology, 27(4), 339-368. Wever, E. G., & Vernon, G. A. (1955). The threshold sensitivity of the tympanic muscle reflexes. Arch. Otolaryngol, 62, 204-213. Wiener, F. M., & Ross, D. A. (1946). The pressure distribution in the auditory canal in a progressive sound field. J. Acoust. Soc. Am., 18(2), 401-408. Wightman, F. L., & Kistler, D. J. (1997). Monaural soundlocalization revisited. Journal of the Acoustical Society of America,101, 1050-1063. Wightman, F. L., & Kistler, D. J. (1999). Resolution of front-back ambiguity in spatial hearing by listener and source movement. J. Acoust. Soc. Am., 105(5), 2841-2853. Yan-Chen, L., & Cooke, M. (2010). Binaural estimation of sound source distance via the direct-to-reverberant energy ratio for static and moving sources. Audio, Speech, and Language Processing, IEEE Transactions on, 18(7), 1793-1805. doi: 10.1109/TASL.2010.2050687 Zahorik, P. (2001). Estimating sound source distance with and without vision. Optometry and Vision Science, 78(5), 270-275. doi: 1040-5488/01/7805-0270/0

48

Вам также может понравиться