Вы находитесь на странице: 1из 13

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO.

5, JULY 2010

1017

Prosody-Preserving Voice Transformation to Evaluate


Brain Representations of Speech Sounds
Purvis Bedenbaugh, Member, IEEE, Diana K. Sarko, Heidi L. Roth, and Eugene M. Martin

AbstractThis study employs a voice-transformation to overcome the limitations of brain mapping to study brain representations of natural sounds such as speech. Brain mapping studies
of natural sound representations, which present a fixed sound to
many neurons with different acoustic frequency selectivity, are difficult to interpret because individual neurons exhibit considerable
unexplained variability in the dynamical aspects of their evoked
responses. This new approach samples how a single recording responds to an ensemble of sounds, instead of sampling an ensemble
of neuronal recordings. A noise excited filter-bank analysis and
resynthesis vocoder systematically shifts the frequency band occupied by sounds in the ensemble. The quality of the voice transformation is assessed by evaluating the number of bands the filter
bank must have to support emotional prosody identification. Perceptual data show that emotional prosody can be recognized within
normal limits if the bandwidth of filter-bank channels is less than
or equal to the bandwidth of perceptual auditory filters. Example
physiological data show that stationary linear transfer functions
cannot fully explain the responses of central auditory neurons to
speech sounds, and that deviations from model predictions are not
random. They may be related to acoustic or articulatory features
of speech.
Index TermsAuditory system, bioelectric potentials, identification, nervous system, speech analysis, speech coding, speech intelligibility, speech processing.

I. INTRODUCTION
HIS report describes a voice transformation approach to
the study of how central auditory neurons respond to, represent, and encode speech sounds and other natural sounds. Understanding the central auditory representation of speech is of
increasing practical importance. First, enhancing central auditory representations is a valuable objective for auditory prosthesis. One role of the central auditory system is to transform a
representation of mainly acoustic features at the cochlea into
a representation that is compatible with brain symbolic processing systems [1]. Prostheses which enhance the interface to

Manuscript received May 17, 2009; revised September 30, 2009. Current version published June 16, 2010. The associate editor coordinating the review of
this manuscript and approving it for publication was Dr. Chun-Hsien Wu.
P. Bedenbaugh is with the Department of Engineering, East Carolina University, Greenville, NC 27858 USA (e-mail:bedenbaughp@ecu.edu).
D. K. Sarko is with the Department of Biology, Vanderbilt University,
Nashville, TN 37240 USA.
H. L. Roth is with the Department of Neurology, University of North Carolina, Chapel Hill, NC 27599 USA.
E. M. Martin is with the Laboratory of Neurobiology and Behavior, Rockefeller University, New York, NY 10065 USA.
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TASL.2009.2035165

symbolic processing systems aid an important contributor to


successful speech communication. Second, new auditory implants are being developed that directly target stations more central than the cochlea [2]. Optimizing such devices requires improved models of central auditory processing.
Brain mapping has been employed to investigate how natural sounds are represented or encoded by a population of neurons [3]. In such experiments, neuronal recordings are obtained
while a sound is played. The recording electrode is then moved
to a new location within the brain, and new neuronal recordings are obtained while the same sound is played again. An
organizing principle is required to aid interpretation of an ensemble of neuronal recordings so obtained. The best organizing
principle available is the cochleotopic frequency tuning of auditory neurons [4]. This principle can be used to construct neurograms, by arranging collected responses, with time after stimulus onset on the horizontal axis, and the characteristic frequency of the recording on the vertical axis. The magnitude
of responses in the neighborhood of each timefrequency coordinate is color-coded, as in a spectrogram. The cochleotopic
organizing principle, however, does not account for variability
in neuronal temporal dynamics, processes related to short-term
and long-term neuronal plasticity, or the variability in the magnitude of neuronal responses. Different central auditory neurons, even those with the same frequency tuning, can vary considerably in how they respond to a sound. Their responses vary
in magnitude, in timing, and even in the variability of their responses to repeated presentations of the same sound. A neurogram therefore only elucidates features of the central representation that are common to most of the recordings, and average
dynamics.
In this report, a voice transformation technique is introduced to invert the logical organization of the brain mapping
experiment, circumventing its limitations. In brain mapping,
a fixed sound is presented while recordings are obtained from
an ensemble of sites in the brain with different frequency
tuning. Instead, we employ a voice transformation to generate
an ensemble of sounds, systematically shifted in frequency.
These sounds are presented in a random sequence while a corresponding ensemble recordings is obtained from each sampled
site within the brain. Sounds are transformed by a filter-bank
analysis and resynthesis technique, with a twist. The filter-bank
analyzes sounds to obtain an ensemble of temporal envelopes,
and those envelopes are used to modulate an ensemble of
band limited carriersbut the correspondence between the
frequency band from which an envelope is obtained, and the
frequency band of the carrier it modulates, is systematically
shifted, as described in Fig. 1. Filter-bank bandwidths are
chosen such that each shifted sound maps to approximately

1558-7916/$26.00 2010 IEEE

1018

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 5, JULY 2010

Fig. 1. Voice transformation architecture. A filter bank derives an ensemble of


band-limited signals from a speech sound, and uses a Hilbert transform approach
to obtain a temporal envelope for each. The filter-bank also derives an ensemble
band-matched carriers from random noise. The envelopes modulate the carriers,
and a transformed signal is synthesized by summing the resulting signals. When
the frequency band from which the envelope is derived matches the band from
which the carrier is derived, the synthesized sound is an approximation to the
original sound, with acoustic frequency resolution determined by the filter bank
(dark arrow). When the frequency band from which the envelope is derived is
shifted from the band from which the carrier is derived, the synthesized sound
is a frequency-shifted rendition of the original sound (gray arrow).

the same expanse along the basilar membrane in the cochlea.


The resulting ensemble of recordings from a single neuron
corresponds to how recordings from an ensemble neurons,
differing only in their frequency tuning, and having the exact
same temporal dynamics, would have responded to the original
soundif such an ensemble could be obtained. This voice
transformation permits study of how temporal dynamics, and
short-term and sequence-related plasticity, contribute to the
central auditory processing of speech and other natural sounds.
It can enable such study with data sets that are necessarily
limited in numbers of recording sites and duration of recording
time, such as recordings obtained from human patients during
surgery. The full dynamic range and time course of the auditory
response to the stimulus ensemble can be observed for each
recording.
An additional advantage of this voice transformation approach is that it facilitates evaluation of the transfer function
from the acoustic stimulus to the neuronal recording. This report provides example estimates of stationary transfer functions
(specifically spectro-temporal response fields (STRF)) from
stimulus-response pairings. Because each neuron responds to
the stimulus differently, it is difficult to interpret to how such
estimates obtained from brain mapping experiments relate to
acoustic features, or to features of the preceding portion of the
neuronal response itself. The approach developed here allows a
more direct observation of how errors in predictions generated
using the transfer function relate to such features. Data sets
such as those presented here could facilitate the development
and testing of nonstationary and sequence-dependent signal

processing models for the auditory system, which in turn


could facilitate improvements in signal processing for auditory
prostheses.
This approach depends upon the quality of the voice transformation. In order to insure that the voice transformation was
of sufficient quality, we tested whether processed speech could
support both sentence recognition and emotional prosody recognition. Speech can be recognized with primarily temporal cues,
but emotional prosody recognition also depends upon additional
acoustic features, such as the temporal contour of the fundamental frequency, rate of frequency modulation of the fundamental frequency or formants frequencies, and temporal dilation
of the utterance. This approach to validating the voice transformation depends on the assertions that cues employed in speech
and emotional prosody recognition are among the most ethologically relevant auditory cues, and conversely, if emotional
prosody and speech can be correctly identified from an utterance, then the utterance must contain the most ethologically relevant acoustic cues.
Sentences from a standardized emotional prosody test [5]
were transformed with different filter-bank frequency resolutions. Normal listeners were asked in, separate trials, to identify
the sentence, and to name the emotional prosody in the sentence.
Speech can be recognized by normal listeners with primarily
temporal cues, and greatly reduced acoustic frequency resolution [6]. This may explain why many cochlear implant users,
who receive stimulation at a small number of sites along the
cochlea, recognize speech with high accuracy, often sufficient
to use the telephone. In simulations of hearing with cochlear implants, normal listeners have difficulty recognizing the intonation in modified sentences [7], and in recognizing spectral cues
thought to be critical to emotional prosody identification [8], [9].
The single best cue to vocal affect has long been recognized to
be the fundamental frequency ( ) [10], with its level, range and
variability providing important contributions to prosody perception [11]. More recent work suggest that other cues, such as cues
related to stress pattern and syllabification [12], must also play
an important role. Just as multiple, parallel cues enable very low
audio fidelity speech to be recognized, multiple parallel cues
also contribute to emotional prosody identification. The evaluation of the voice transformation measures the acoustic spectral resolution required to identify the emotions in speech. The
frequency resolution required to identify emotional prosody in
speech serves to identify a lower bound in the number of bands
in the vocoder in order to support the objectives of our physiological studies. Since the vocoder randomizes the carrier phase,
it also examines whether relative phase and FM (fine-structure)
is necessary for emotional prosody identification.
The questions of which spectral and temporal cues are required to identify emotional prosody, in and of themselves, are
relevant to the everyday experience of the hearing aid users.
Such listeners experience degraded spectral and temporal cues
[13], and subjective outcomes may be improved if emotional
prosody perception is taken into account when hearing aids are
fitted. Such considerations led to the inclusion of an item related to judging someones mood from their voice to a recent
study of hearing aid outcomes [14]. A preliminary report found
that hearing aid users actually improved their emotional prosody

BEDENBAUGH et al.: PROSODY-PRESERVING VOICE TRANSFORMATION TO EVALUATE BRAIN REPRESENTATIONS OF SPEECH SOUNDS

recognition by removing their hearing aids, an effect which may


be related to distortion of temporal cues by the automatic gain
control [15].
In summary, this paper evaluates a voice transformation
approach to the study how the central auditory system processes natural sounds. The approach inverts the logic of the
traditional brain mapping experiment. The inversion facilitates
evaluation of how populations of neurons with the same temporal dynamics, but different frequency tuning, would respond
to a sound, if such populations could be directly observed.
Psychophysical evaluation of the voice transformation extends
direct measurements of emotional prosody perception beyond
the range of frequency resolutions experienced by cochlear
implant users, to the perceptual frequency resolution of normal
listeners. Other studies have employed high-frequency resolution vocoder techniques to assess cues related to emotional
prosody identification, rather than assessing emotional prosody
identification directly. The psychophysical evaluation methods
eliminated temporal fine-structure, thereby allowing evaluation
of whether fine-structure is essential for emotional prosody
identification.
II. METHODS
A. Voice Transformation
1) Audio Processing for Psychophysical Testing: The voice
transformation employed a noise-excited vocoder, a type of
filter-bank analysis and resynthesis processor. For psychophysical testing in humans, sounds were analyzed by a filter-bank
tiling the frequency band from 100 to 8100 Hz. Analysis
filter-bank bandwidths were chosen such that each band occupied the same proportion of cochlear distance. Filter-banks
were implemented in the frequency domain. The original
sentences were sampled at 44100 samples/s, and padded
with zeros to extend them to
samples, so that the Fourier
transformed signal is over-sampled in the frequency domain.
Single-band signals were derived by multiplying the frequency
domain signal by the magnitude of the transfer function, and
its mirror image at frequencies above the Nyquist frequency.
For each band-limited signal, the Hilbert transform was applied
in the time-domain to derive an analytic signal. The envelope
was estimated as the absolute value of the analytic signal. A
random-phase carrier was constructed in the frequency domain
by multiplying the original signal by
, where is a random
number between 0 and 1, and
. Each envelope and
carrier were multiplied together in the time domain, and the
energy was normalized to match the energy in the original
signal. The single-band signals were summed to generate the
vocoded signal. The amplitude of all vocoded sentences was
normalized to the same average power.
The temporal envelopes were not low-pass filtered, so that envelopes derived from wider band filters had a wider modulation
bandwidth compared to envelopes derived from narrower filters.
Because of this effect, sounds derived from filter-banks with
very many bands seem temporally distorted. This was observed
in an alternative sound set, not employed in the experiments described in this paper. That sound set employed a filter-bank in
which each filter band had an equal bandwidth in Hz. Temporal
distortion was noted anecdotally as filter bandwidth decreased

1019

Fig. 2. Filter-banks bandwidth of filters is proportional to the equivalent rectangular bandwidth (ERB) of auditory perceptual filters. The stair case curves
show filter-bank bandwidth. Shaded regions demarcate powers of two times the
ERB curve, beginning with the one ERB curve at the lower edge of region i. (i)
12 ERB. (ii) 24 ERB. (iii) 48 ERB. (iv) 816 ERB. (v) 1632 ERB.

below 15 Hz, but its effect on emotional prosody identification, speaker identification, or sentence identification was not
formally assessed. The bandwidth of the temporal envelopes influences the availability and specific form of temporal cues such
as stress pattern and the pitch contour. Because the sounds employed in this study were derived by employing a filter-bank
with filter bandwidths that are constant in the cochlear distance
domain, rather than constant in the linear frequency domain, the
available temporal cues vary with filter-band frequency.
Filter-banks bandwidths are plotted in Fig. 2. In both panels
of this figure, frequency is plotted along the horizontal axis, and
the bandwidth of the filter-band containing that frequency is
plotted on the vertical axis. Shaded regions show multiples of
the approximate bandwidth of perceptual processing channels.
For example, shaded region i lies between one and two times
the Glasberg [16] estimate of the equivalent rectangular bandwidth of (ERB) of auditory perceptual channels. Filter bands
plotted within region i have approximately the frequency resolution of the auditory system. Regions ii, iii, iv, and v correspond
to approximately 1/2, 1/4, 1/8, and 1/16 the perceptual resolution of the auditory system, respectively. Filter bands plotted
above region i have less than perceptual frequency resolution,
while filter bands plotted below region i have better than perceptual frequency resolution.
The filter-banks were implemented so that the transfer-function roll-off at the edges of the pass-bands constant across different frequency resolutions. To achieve this, the filter-bank for
a particular resolution was derived by combining adjacent bands
of an underlying, high-resolution filter-bank with 128-bands,
approximately 20 bands per octave. For example, the magnitude of the transfer function of filters in a 32-band filter-bank
are the sum of magnitudes of four adjacent filters in the underlying filter-bank. The underlying, high resolution, bands had
a Gaussian roll-off about the center frequency, and were separated by one standard deviation, peak-to-peak, in the cochlear
distance domain. Cochlear distance was computed according to
an adaptation of Greenwoods cochlear frequency map [17]. For
different spectral resolutions, the filter-bank was in turn comprised of 1, 2, 4, 8, 16, 32, and 64, frequency bands. This corresponds to filter-bank densities of approximately 0.2, 0.3, 0.6,

1020

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 5, JULY 2010

Fig. 3. Unmodified sentences spectrograms of the sentence The boy went to


the store pronounced by a female neuropsychologist with each of five emotional intonations. A) Happy. B) Sad. C) Angry. D) Fearful. E) Neutral.

1.3, 2.5, 5.1, and 10.1 filters per octave. In the single band-condition, a single temporal envelope modulated a single noise carrier waveform, and the spectrum of the carrier matched the spectrum of the original sentence. The filter-bank with 32 bands had
a frequency resolution closely matched to the equivalent rectangular bandwidth (ERB) of auditory perceptual channels [16], as
shown in Fig. 2. Examples spectrograms of the phrase The boy
went to the store spoken with happy prosody and synthesized
with filter-banks are shown in Fig. 4. Fourier analysis parameters are the same as in Fig. 3.
2) Audio Processing for Physiological Testing: For physiological testing in rats, the same vocoder architecture was
adapted to generate an ensemble of frequency-shifted sounds.
Sounds were up-sampled to 97560 samples per second. The
frequency range from 0.1 to 44 kHz was divided into 256
frequency bands, approximately 29 bands per octave, much
greater than required to achieve normal emotional prosody
recognition. A total of 256 stimuli were generated by shifting
the correspondence between the frequency band from which
the envelope was obtained, and the frequency band of the carrier, as diagrammed in Fig. 1. The power in each synthesized,

Fig. 4. Example of filter-bank processing spectrograms of the sentence The


boy went to the store pronounced by a female neuropsychologist with Happy
intonation. Sentences were processed by a noise excited vocoder with filter-bank
filter bandwidths set to a constant proportion of cochlear distance. Filter-bank
resolution: A) 32 bands. B) 16 bands. C) 8 bands. D) 4 bands. E) 2 bands. F) 1
band. A 64-band stimulus was included in the perceptual evaluation, but is not
illustrated.

frequency-shifted, band was scaled to match the power in the


original analysis band. Example spectrograms of the vocalized
word erase, vocoded without shift, and offset by 96 steps and
192 steps are illustrated in Fig. 5. Offset from 0 by 257 steps
corresponds to the unshifted vocoded sound.
B. Stimulus Presentation
1) Psychophysical Testing: A total of 12 normal hearing observers, ages 1830, were tested (six females, six males). Sentences from the Florida Affect Battery [5] (test 8-A, name the
emotional prosody) were presented by a portable compact disc
(CD) player with Sony MDR-V600 circumaural headphones in

BEDENBAUGH et al.: PROSODY-PRESERVING VOICE TRANSFORMATION TO EVALUATE BRAIN REPRESENTATIONS OF SPEECH SOUNDS

1021

TABLE I
SENTENCES AND THEIR EMOTIONAL PROSODY. THE SENTENCE AND
EMOTIONAL INTONATIONS WHICH WERE PRESENTED IN EACH BLOCK OF
TRIALS ARE DESCRIBED. EACH LETTER INDICATES THAT THE STIMULUS SET
INCLUDED THE SENTENCE PRONOUNCED WITH THE DESIGNATED PROSODY. A
REPEATED LETTER MEANS THAT THE SENTENCE WAS REPEATED WITH THAT
PROSODY. CHANCE PERFORMANCE IS ESTIMATED AS THE 1=2 , WHERE THE
SHANNON ENTROPY I IS DEFINED AS I =
p(response) log(p(response)).
H, HAPPY. S, SAD. A, ANGRY, F, FEARFUL. N, NEUTRAL

TABLE II
FUNDAMENTAL FREQUENCY OF SENTENCES. MEAN AND
STANDARD DEVIATION OF F AND COEFFICIENT OF VARIATION
OF POWER FOR EACH EMOTIONAL PROSODY

Fig. 5. Voice transformation examples. Three examples of filter-bank based


timefrequency representations of frequency-shifted sounds. The word erase
pronounced by a male speaker is shifted by (A) zero bands (B), 96 bands, and
(C) 192 bands.

a quiet study room. The sentences are effectively neutral statements, but are pronounced with one of five emotional intonations: happy, sad, angry, fearful, or neutral (Table I). This neuropsychological assessment has been used extensively to measure emotional cognition in patients with stroke or other brain
injury. Comparison data from normal controls is available for
young adult, middle aged, early old, and older adult age groups.
The sentences were pronounced by a female neuropsychologist
with both clinical and research experience in emotional cognition. Although they can be recognized as expressing the intended emotions, the relationship between emotional speech expressed by actors and that which is naturally produced in emotional situations is poorly understood [18], [19].
Representative spectrograms of the phrase The boy went to
the store, spoken with each of the five emotional intonations,
are shown in Fig. 3. To compute the spectrograms, sentences
were sampled at 44100 samples/s, and the DFT of 2205 Hanning windowed samples was computed every 551 samples. The
resulting images were scaled in dB, and plotted with a common
gray scale. Table II shows the average mean and standard defor the voiced portions of all sentences with
viation of the
each emotional prosody. The coefficient of variation (standard

deviation/mean) of power is shown separately for voiced and unand power statistics were calculated using
voiced portions.
Entropic/ESPS function get_f0 [20]. The happy and fearful sentences have relatively high , while is more variable for the
happy and sad prosody sentences. These observations suggest
that frequency cues might be particularly important for identifying happy prosody. Voiced power is relatively more variable
for happy, angry, and fearful sentences, while unvoiced power
was relatively more variable for happy and fearful sentences.
Unvoiced power was relatively less variable for sad sentences.
These observations suggest that less variable power could be a
cue for sad prosody.
While listening to the first track of the CD, listeners identified
the emotional prosody in the original sentences from the Florida
Affect Battery [5] (control task), in which frequency resolution
was not modified. For this track only, an additional group of
12 listeners (six females, six males) was tested, for a total of
24 listeners. Successive tracks presented the same sentences, in
random order, processed to one of the various frequency resolutions. While listening to half of the remaining tracks, listeners
identified which sentence was spoken, based upon the words
in the sentence, without regard for the intonation. In the other
tracks, they identified the emotion expressed by the speaker,
without regard for the words in the sentence. Listeners adjusted
the volume to a comfortable level while listening to the recorded
task instructions, and could listen to the instructions more than
once if necessary.

1022

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 5, JULY 2010

TABLE III
STIMULUS PRESENTATION ORDER. STIMULUS PRESENTATION ORDER
WAS RANDOMLY COUNTERBALANCED FOR TASK AND SPECTRAL
RESOLUTION. THE UNMODIFIED CONTROL TASK WAS ALWAYS PRESENTED
FIRST. FEW BANDS REFERS TO 1, 2, 4, AND 8 BANDS, AND MANY
BANDS REFERS TO 16, 32, and 64 BANDS

Each block included all 20 sentences. The sentences and


emotional intonations for each listening task are summarized
in Table I. Listeners heard the unprocessed (control) sentences
once, and were asked to identify the emotional intonation, from
among the five possibilities, without regard for the words of the
sentence. Listeners then heard each of the processed sentences
twice. In one presentation, they were asked to identify the
emotional prosody, while in the other presentation, they were
asked to identify the sentence. Listeners received no feedback
indicating that any of their choices were correct or incorrect.
In each experiment, listeners were randomly assigned to one
of four groups, with sound presentation order counterbalanced
for task (sentence-identification versus emotional prosody) and
spectral resolution (more bands versus fewer bands). In this
way, the potential influence of learning through experience with
the materials was minimized. The testing order for the groups
is shown in Table III.
2) Physiological Testing: The test sounds were derived
from sequences of four words from the TIDIGITS [21] speech
corpus, with one second of silence between words. One stimulus set was derived from a sequence of four words, comprised
of the word four pronounced by one male and one female
speaker, followed by the word seven pronounced by the
same two speakers. The other stimulus set was derived from
the words enter, erase, and seven, all pronounced by the
male speaker, followed by the word seven pronounced by the
female speaker. Each of the 256 frequency-shifted sounds in
the stimulus set were presented once, in a pseudorandom order.
Studies were conducted in albino rats (350-400 g) according
to approved protocols. A surgical plane of anesthesia was
induced by a halothane/oxygen mixture. The airway was
stabilized by orotracheal intubation. After i.v. access was established in the tail vein, anesthesia was maintained with sodium
pentobarbital, titrated to eliminate reflexes without causing
undue cardiorespiratory suppression. The rat was mechanically

ventilated as heart rate and blood oxygenation were monitored.


The rat was mounted in a stereotaxic frame with hollow ear
bars to permit acoustic stimulation. Sounds were presented
via electrostatic drivers with a closed sound delivery system
[22]. Electrodes were stereotaxically lowered to the auditory
thalamus, and the location was confirmed by acutely recording
evoked neuronal responses to a search stimulus. The search
stimulus was comprised of 500-ms bursts of dynamic ripple
noise, Gaussian noise, and ICRA noise, with spectral characteristics adapted to the frequency range of hearing of the rat
[23]. Between the sound bursts, silence was maintained for 500
ms. In separate acute recording experiments where electrode
location was verified histologically, this procedure leads to
a bias toward recording from ventral division of the medial
geniculate body (auditory thalamus). When auditory driven
responses were found, the electrode was held stationary while
the full sound-set was presented in a fixed, pseudorandom
order. Recordings were collected under computer control
(Brainware, TDT Technologies). The complete raw waveforms
were simultaneously streamed to a second computer and sorted
offline using a variant of Lewickis Bayesian algorithm [24].
C. Analysis
1) Psychophysical Analysis: Percent correct on sentence
identification and emotional prosody recognition tasks was
tabulated for each frequency resolution. Emotional prosody
identification performance was compared to age-group norms
for the Florida Affect Battery [5]. Although conventional
two-alternative forced-choice signal detection theory does not
strictly apply, the hit rate and false-alarm rate for each emotion
were examined for signs of statistical choice bias. In any given
test condition, a statistical bias away from or towards choosing
a particular emotion may reflect that cues which support identification of that emotion are more or less ambiguous. Hit rate
for each emotion was estimated as the ratio of the number of
correct choices of that emotion to the number of instances of
that emotion in the test. False alarm rate was estimated as the
ratio of the number of incorrect choices of that emotion to the
number of instances of other emotions in the test. Logistic functions modified to range between chance performance and 100%
were fit to the percent correct for the total pool of listeners.
The sentences, their emotional intonation, and the chance
performance level for the emotional prosody recognition and
sentence recognition tasks are tabulated in Table I. Chance
, where the Shannon
performance is estimated as the
response
response ,
entropy is defined as
and response is the number of trials for which a response is
correct divided by the total number of trials. Listeners received
no instructions on how likely they were to encounter any of
the sentences. To obtain error bars for the estimated frequency
resolution at each percent correct, the modified logistic function
was fit to each individual-listeners performance, and evaluated
at equally spaced values of percent correct, from 35% to 95%.
Horizontal error bars show the range of the middle half of the
individual fits at each percent correct. Vertical error bars show
the standard error of the pooled observations at each filter-bank
resolution.

BEDENBAUGH et al.: PROSODY-PRESERVING VOICE TRANSFORMATION TO EVALUATE BRAIN REPRESENTATIONS OF SPEECH SOUNDS

2) Physiological Analysis: In order to compare neuronal


responses to the auditory stimulus, a brain-based timefrequency representation of the stimulus ensemble was overlaid
over a filter-bank based timefrequency representation of
the base stimulus. Acoustic frequency is thought to be one
important determinant of the neuronal response, through a
neuronal pathway originating at the cochlea, which performs
a nonlinear mechanical frequency analysis. For the analysis
presented here, the entire system from the ear to the recording
site is considered as if it acts as one filter in a biological
dynamic signal analyzer. Shifting the correspondence between
the frequency band of the carrier sounds, and the frequency
band from which the modulation envelopes are derived, has
the effect of shifting the position of the sound on the receptor
surface of the cochlea. The neuronal response to each stimulus
was used to generate a sequence of hash marks, with each mark
corresponding to one observed spike. The hashmark sequences
were overlaid on the filter-bank based representation, in order
of stimulus frequency shift. The overlay therefore provides a
direct comparison between the filter-bank representation and
the biological representation. The overlay plot depends upon
a single free parameter, the offset between filter-bank bands
and the stimulus frequency-band shift evoking the overlaid
responses. This is necessary because the frequency tuning of
the neuronal recording site under observation is not known
a priori. In order to choose the offset, the cross-correlation
between the observed neuronal response and the filter-bank
based representation is computed for each of the 256 possible
offsets, and the offset yielding the greatest cross-correlation is
chosen.
Neuronal responses to the frequency-shifted stimulus ensemble were used to estimate a spectro-temporal receptive
field (STRF). A STRF is a linear transfer function between
a time-frequency representation of a sound, and an evoked
neuronal response. This analysis employed the envelope
outputs from the analysis filter-bank as the timefrequency
representation. This is a natural choice for this experiment,
because each frequency band maps to a similar expanse along
the cochlea, and because it is directly related to the voice
transformation. It is, however, based upon a different frequency
scale from the traditional fast Fourier transform (FFT)-based
spectrogram. Computations were performed using STRFPAK
[25] (http://strfpak.berkeley.edu/index.html), a Matlab toolbox
for stimulus-response system identification in sensory systems.
It seeks to minimize the mean-squared error between the predicted and actual response. Because neuronal responses during
the silent portions of the auditory stimuli were sparse, silent
intervals were removed from the stimuli and evoked responses
before STRF estimation. The experimentally observed neuronal
responses were compared both to the predicted response, and
to the filter-bank output.
III. RESULTS
A. Emotional Prosody Recognition, Control Task
Observer performance in the control task (93.3%,
), in which the sentences were unmodified,
SD
was comparable to age group norms (95.3%, SD
[5]).

1023

TABLE IV
EMOTIONAL PROSODY RECOGNITION: CONTROL PERFORMANCE. EMOTIONAL
PROSODY HIT RATE AND FALSE ALARM RATE

Fig. 6. Sentence recognition. The vertical axis represents the overall percent of
correct choices for 12 observers. The horizontal axis represents the number of
frequency bands in reduced spectral resolution sentences. The smooth curve is
fits of a modified logistic function constrained to range between chance (23%).
and 100% correct performance. Vertical error bars show the standard error of
the observations. Chance performance is denoted by a horizontal dashed line.
Shaded regions are approximations to the ERB regions shown in Fig. 2.

Across all 24 observers listening to four examples of each


prosody, angry was always recognized, while happy was incorrectly chosen only four times (see Table IV). For fearful, the
hit rate was relatively low, although the false alarm rate was
comparable to the false alarm rates for sad and neutral. Together, these are evidence of a statistical bias against choosing
fearful. Many listeners commented that it was more difficult to
recognize sad and fearful prosody.
B. Sentence Recognition
Consistent with previous reports [6], [17], listeners correctly
identified the sentences, based upon the words in the sentence,
with low spectral resolution. Sentences were clearly recognized
when the filter-bank had four or more bands, corresponding
to channel bandwidths of approximately 1.6 octaves (Fig. 6).
Shaded regions i, ii, and iii in this figure correspond to the corresponding regions in Fig. 2. For example, region i in Fig. 6
corresponds to filter bandwidth between one and two times the
estimated ERB [16], approximating perceptual frequency bandwidth. Perceptual bandwidth is proportional to frequency. These
results are comparable to a body of previous results finding
that recognition of noise-vocoded speech requires from 4 to 16
bands, depending on the particulars of the listening task and
signal processing [6], [8], [26][28].
C. Emotional Prosody Recognition
Much greater frequency resolution was required to correctly
identify emotional prosody than to identify sentences. Recognizing some emotions required finer frequency resolution than
others (Fig. 8). For filter-banks, performance reached the age
appropriate norm for emotional prosody identification at a resolution between 16 and 32 bands (Fig. 7), for which the filter
bandwidths approximate the bandwidth of auditory perceptual
channels (Fig. 2). The frequency resolution needed to attain a
given level of performance decreased by a factor of two when

1024

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 5, JULY 2010

TABLE V
EMOTIONAL PROSODY PERFORMANCE: HITS AND FALSE ALARMS. EMOTIONAL
HIT RATE (HIT) AND FALSE ALARM RATE (FA) FOR RECOGNITION OF
EACH EMOTIONAL PROSODY, FOR EACH FREQUENCY RESOLUTION. LETTER
SUPERSCRIPTS CORRESPOND TO POINTS MADE IN THE TEXT

Fig. 7. Emotional prosody recognition. The vertical axis represents the overall
percent of correct choices for 12 observers. Chance performance, 20%, is shown
by a dashed line. The sigmoidal curves display fits of a modified logistic function constrained to range between chance and 100% correct performance. The
smooth curve is based on all data items. Circles and vertical error bars show the
mean performance for all listeners, and the standard error of the means. Horizontal error bars show the range from the 25th percentile to the 75th percentile of
curves for individual observers. The dashed curve is based on restricted analysis
including items with happy, angry, and neutral prosody only. Shaded regions
are approximations to the ERB regions, as explained in Fig. 6.

Fig. 8. Emotional prosody recognition by emotional category. Each curve represents the overall percent of correct choices for 12 observers, considering only
observations of a sounds from single emotional category. Chance performance,
20%, is shown by a dashed line. Smooth curves display fits of a modified logistic
function constrained to range between chance and 100% correct performance.
Shaded regions are approximations to the ERB regions, as explained in Fig. 6.
H happy, S sad, A angry, F fearful, N neutral.

the analysis was restricted to items with happy, angry, and neutral prosody.
At the highest frequency resolutions, the pattern of prosody
recognition performance was similar to the performance pattern
with control sentences (Table V- ). With both unprocessed and
vocoded speech, the most accurately identified emotional intonations were happy and angry (Table V- ). In all cases, neutral
intonation was recognized at an intermediate level. In all cases,
the false alarm rate for sad and fearful intonation is relatively
high, and sad prosody recognition performance was relatively
low. An interesting difference was that with unprocessed sentences, the lowest emotional prosody recognition performance
was observed with fearful prosody, while vocoded speech the
lowest performance was observed with sad speech (Table V- ).
Recognizing happy intonation depended on frequency resolution strongly (Fig. 8). Recognition of happy intonation rose
suddenly from near chance to relatively accurate at four bands.
Considering the hit and false alarm rates together suggests a
statistical choice bias against identifying emotional prosody
as happy. At low-frequency resolution (1 and 2 band), the
hit rate for happy was less than chance, while the false-alarm

rate for happy is comparable to or less than the false alarm


rate for the other emotions. As frequency resolution increases
through the intermediate range, the hit rate for happy prosody
increases rapidly with the false alarm rate comparable to the
other prosodies. At high-frequency resolutions happy is both
recognized with high accuracy, and rarely chosen incorrectly.
Angry prosody is correctly identified at rates well above
chance even at the lowest frequency resolutions(Table V- ). At
moderate frequency resolutions (eight bands), the false alarm
rate for angry is comparable to the false alarm rate for other
emotions, coupled with a relatively high hit rate for angry,
provide evidence for a statistical bias towards identifying
emotional prosody as angry (Table V- ).
D. Neuronal Responses to Shifted Sounds
Fig. 9 shows the neural responses from a single recording
overlaid on the filter-bank based time-frequency representation of the vocal sequence, four four seven seven. Overall,
the neural response shows low spontaneous activity, robust responses to response to features such as onsets and
unvoiced-to-voiced transitions, and temporally-scattered excitatory responses to the vowels. Notably, there are wide ranges
of frequency shifts over which the phasic responses are insensitive to the frequency shifting, i.e., showing equal magnitude of
firing with high temporal precision relative to the onset of the
vocalization. An example of this is at the onset of the female
spoken four (Fig. 9). At other times, frequency shifting seemingly results in a dramatic change in the timing and magnitude
of the response. An example of this is the increased response

BEDENBAUGH et al.: PROSODY-PRESERVING VOICE TRANSFORMATION TO EVALUATE BRAIN REPRESENTATIONS OF SPEECH SOUNDS

1025

Fig. 9. Neuronal response compared to timefrequency representation. This illustrates the response of one neuronal recording to an ensemble of frequencyshifted sounds, overlaid on a filter-bank based timefrequency shifted representation of the sequence four four seven seven. Each row of black hashmarks
represents the response to one of the frequency-shifted sounds, with hash marks
placed at the times of action potential responses. The raster shows how a population of neurons with the same dynamical properties as observed recording, but
different frequency tuning, would respond to original sound, if the population
could be observed directly.

to seven spoken by the male speaker at 2.38 kHz and 4.9 s


(Fig. 9).
In addition to relating neuronal responses directly to the stimulus timefrequency representation, Figs. 10 and 11 relate two
example neuronal responses to a linear STRF-based prediction
of the response. In Figs. 10(a) and 11(a), neuronal responses are
overlaid on the timefrequency representation of the vocal sequence enter erase seven seven, as in the manner of Fig. 9.
Linear STRFs estimated from stimulus-response pairings are illustrated in Figs. 10(b) and 11(b). These STRFs (see Methods)
in turn can generate predictions of the neuronal response, as illustrated in Figs. 10(c) and 11(c). The actual neuronal responses
are overlaid on the predicted responses in Figs. 10(d) and 11(d).
Often there is a correspondence between the STRF prediction of a large magnitude response and the observation that the
response is mostly relatively invariant relative to stimulus-frequency shift (e.g., Fig. 11(d), 1600 s, shift 150 to 175) invariant
response, although there are instances where a large magnitude
response is predicted and few or no neuronal response are observed (e.g., Fig. 11(d), 2400 s, shift 150 to 175).
The recorded responses differ from the large magnitude predictions in that the recorded responses tend to have greater temporal precision, as is visualized by the tight temporal clustering
of action potentials relative to the broad yellow (hot) areas (e.g.,
Fig. 10(d), 200 ms, neighborhood of shift 170). Finally, at the
upper and lower boundaries of the high magnitude predicted responses, the responses show an unpredicted increase in latency
relative to the other temporally-clustered responses.
Often, where a moderate neuronal response is predicted, a
neuronal response occurs, but there is considerable variability
in the match between the magnitude and duration of the prediction and the magnitude and duration of the observed response.
When little or no neuronal response is predicted, little or none
occurs. Most importantly, deviations from the prediction appear

Fig. 10. Example neuronal response compared to timefrequency representation and STRF prediction This illustrates the response of one neuronal recording
to an ensemble of frequency-shifted derived from the vocal sequence erase
enter seven seven. A) Direct comparison of the response to the timefrequency
representation of the stimulus, in the manner of Fig. 9. B) STRF estimated
from the responses to the stimulus ensemble. C) STRF-based prediction of the
response to the stimulus ensemble. D) Recorded response overlaid over the
STRF-based prediction.

to be structured rather than random, perhaps related to speech


features.
IV. DISCUSSION
This paper employs voice transformation in a novel mode to
circumvent some of the limitations of brain mapping in the study
of the central neuronal representation of speech and other natural sounds. The physiological finding is that differences between the observed neuronal response and its linear prediction
have nonrandom structure, which may be related to features of

1026

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 5, JULY 2010

judgments about emotional prosody can be made on the basis of


mainly temporal cues.
The overall quality of the voice transformation is sufficient to
support physiological studies of the central auditory representation of speech phonetics and the emotional prosody in speech.
A. Psychophysical Task Difficulty

Fig. 11. Another example neuronal response compared to timefrequency representation and STRF prediction A second example recording, with the same
layout as Fig. 10.

the voice or speech. In the examples presented here, the unvoiced to voiced transition is associated with this effect.
The main perceptual findings of this report are 1) in signals
processed by a noise-excited vocoder, frequency resolution similar to auditory perceptual frequency resolution is both necessary and sufficient to support emotional prosody identification,
and 2) it is possible for emotional prosody to be identified within
normal performance limits when speech is transformed in a way
that does not preserve temporal fine-structure. Also of note is
that angry prosody is identified at levels well above chance even
when a one-band filter-bank is employed, and happy prosody is
identified almost perfectly with frequency resolution somewhat
less than auditory perceptual frequency resolution. These observations suggest that although emotional prosody identification may be ultimately limited by frequency resolution, limited

While the present study shows that accurate emotional


prosody identification requires auditory frequency resolution
approximating perceptual frequency resolution, an earlier
study [12] suggested that high audio fidelity is not required to
identify emotional prosody. These differences could be related
to task difficulty. Lakshminarayanan et al. assessed emotional
prosody identification using sentences which were degraded
in four ways: 1) vocoding to preserve pitch and the intensity
contour, while removing formants, 2) vocoding to preserve
pitch and the intensity contour, while replacing the phonetically
modulated formants with constant formants corresponding to
the vowel /ae/, 3) low-pass filtering (band = 80300 Hz), and 4)
high-pass filtering (band = 18302240 Hz). Listeners identified
tokens as happy, angry, or sad. They reported that emotional
prosody recognition was near normal for conditions 1) and 4),
substantially impaired for sad tokens only in condition 3), and
substantially impaired for condition 3) for all tokens. In our
experiment, happy, angry and neutral tokens are the most easily
identified. When our analysis was restricted to these emotional
prosodies the performance curve both became steeper, and
shifted slightly to the left (better performance). The percent of
correctly identified sentences would be expected to increase
further if listeners had only three choices, instead of five.
Two other studies, which employed a relatively complex
affective prosody task, measured emotional prosody identification as a function of stimulus quality. Smith and Faulkner
[7] evaluated identification of a mix of linguistic and affective
intonations in speech processed by a noise-excited vocoder
similar to that employed in our perceptual experiment. The
vocoder employed filter-banks with 1, 4, 8, and 16 bands, with
filter bandwidths chosen to occupy a constant proportion of
cochlear distance. In addition the vocoder limited the limited
envelope bandwidth to either 32 or 400 Hz. In an additional
manipulation, the envelope modulations in the band from
50400 Hz were enhanced in order to increase the salience
of the fundamental frequency. Near normal recognition of
tokens expressing the affect question expressing disbelief
was achieved by vocoders employing 8 and 16 band filter
banks, regardless of envelope manipulation. Although some
of the other intonations were more often correctly recognized
than others, near normal recognition was not observed for
other tokens with any of the tested signal processing strategies.
Lieberman and Michaels [29] employed a POVO-type synthesizer to produce tokens with a constant or variable fundamental
frequency contour, and a natural or band-limited amplitude
envelope. The emotional intonation was not recognized at near
normal levels in any of the synthesized tokens. In these studies,
prosody identification performance using signal processed
stimuli never reached control-level performance [7], [29],
although the maximum audio fidelity employed was somewhat
less than obtained using our many-band noise-excited vocoders.

BEDENBAUGH et al.: PROSODY-PRESERVING VOICE TRANSFORMATION TO EVALUATE BRAIN REPRESENTATIONS OF SPEECH SOUNDS

The possibility remains that if the task in the present study had
were more difficult, normal prosody identification might have
required higher frequency resolution, or might not have been
attainable with speech processed by a noise-excited vocoder.
In the present and in related studies [7], [12], [29], the
(simulated) emotional intonations were validated to be good
exemplars of the intended emotion by an age-matched groups
of normal listeners [5]. The experimental tokens employed by
Lieberman and Michaels [29] were validated by the judgments
of a group of linguistically naive listeners. The experimental
tokens employed by Smith et al. [7] were validated by a group
of final-year speech and language therapy students. The tokens
employed by Lakshminarayanan et al. [12] were validated by
the judgments of a group of undergraduate students. Tokens in
which the emotional prosody is easily identified are appropriate
for such studies, because ease of identification allows perceptual judgments to be limited principally by acoustic quality, not
by the cognitive load of emotional processing. Even though the
baseline emotional prosody identification task is easy, it is still
more difficult than the baseline sentence identification task.
Emotional prosody identification clearly requires higher audio
fidelity than sentence identification, but the difference in task
difficulty limits our ability to quantify this difference.
B. Fine-Structure and Emotional Prosody Perception
Since the noise-excited vocoder eliminates temporal
fine-structure, these data show that fine-structure is not required for emotional prosody identification. Although similar
speech and emotional prosody recognition was obtained using
transformed sounds, the sounds were obviously distorted versions of the original [5]. Filter-banks with more bands than
necessary to achieve emotional prosody recognition generate
sounds with improved voice quality, and one has the impression
that it is easier to discriminate different speakers voices.
C. Neuronal Responses to Shifted Sounds
Neuronal responses were compared directly to a filter-bank
based timefrequency representation, and to a linear prediction
of the response to the stimulus ensemble. The stimulus evoked
responses occasionally showed higher response-time precision,
for large magnitude responses. The temporal spread of the
response at these times was less than duration over which a
large magnitude response than was predicted by the STRF.
Such precisely timed responses were relatively insensitive to
the particular member of the stimulus ensemble, and qualitatively could be described as having an all-or-none character. At
other instances during the response, the timing and magnitude
of the response varied more as the stimulus changed, and
qualitatively had the appearance of a transient modulation of
neuronal firing rate, such as is employed in conventional neural
network models.
A necessary assumption of this approach is that the auditory
system processes each frequency band in the same way. This is
a good starting assumption, as it has supported the implementation of many successful auditory processing algorithms, such as
those employed in auditory prostheses and audio data compression algorithms, and is supported by many psychophysical experiments. Despite this, two observations suggest that it may not

1027

be a universally valid assumption. Information transmitted in


the low acoustic frequencies may be processed differently from
information in the high acoustic frequencies. Auditory nerve
fibers tuned to low frequencies respond in phase with an acoustic
stimulus, whereas auditory nerve fibers tuned to high frequencies modulate their responses with acoustic intensity, but with
no specific phase relationship. Second, at least in humans, auditory processing systems interface to speech and language symbolic processing systems, and may depend on learning, which
may be specific to each frequency band.
An assumption of carrying out such experiments with animal
models is that different auditory systems process sounds in the
same way. This is a good starting assumption. Many animals
can learn to respond to vocal commands, and so therefore have
a basic ability to process sounds in a framework compatible
with human vocal productions. Furthermore, trained animals respond to speech categorically, with categorical boundaries similar to those found in humans [30]. An additional advantage of
mammalian animal models for the studies such as these is that
animals have auditory systems organized according to a mammalian plan, but neutral with respect to language and experience-dependent phonetic plasticity. Comparisons of data sets
such as this from employing this approach in naive animals and
animals trained to recognize speech, and in animals and humans
can potentially generate specific hypotheses about how learning
and experience specialize the auditory system to support skilled
perception of speech and language.
The example recordings were obtained from the lemniscal
division of the auditory thalamus (the ventral medial geniculate nucleus). This nucleus lies along the main ascending auditory pathway. We chose this location for the initial recordings
because we hypothesize that the auditory thalamus and cortex
are together involved in a transformation of the neuronal representation of sound from a representation that more directly
reflects the time-evolution of the acoustic spectrum to a representation that is in a format compatible with brain mechanisms
of symbol and sequence processing [1]. We hypothesize that approach developed here will show neurons in the thalamus and
more central stations will be modulated by acoustic features
which vary on the time scale of transitions within and between
phonemes (tens to hundreds of milliseconds). Examples of such
features are transitions from voiced to unvoiced vocalization,
shifts in formant frequencies, pauses in sound energy, and modulation of fundamental frequency. Association of errors in the
prediction of the STRF model with such features would highlight their relevance. A similar logic can be used to highlight a
potential contribution of experience dependent plasticity to auditory processing. Recordings obtained from brains experienced
in making judgments about sounds with a particular shift may
show a preponderance of deviations from the predictions of the
STRF model associated with that shift an neighboring shifts.
Zero shift is the most obvious shift to consider in this way, but
in principal such effects could be demonstrated in experimental
subjects trained to discriminate or recognize sounds of any particular shift.
An adaptive signal processing model may improve the predictions of the model. The model behind the STRF estimate is
that neuronal responses are generated by a linear system, with

1028

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 5, JULY 2010

added random noise. The observed error often takes the form of
little or no response in instances where a response is predicted
to occur, and such errors may be related to transitions and other
features of the speech signal. Since errors are not random, it is
reasonable to explore adaptive or nonstationary models. There
are many physiological mechanisms which could support such
adaptive or sequence-dependent processing within the central
nervous system, over a range of time scales [31][34]. Such
models could lead to improvements in hearing aid and cochlear
implant speech processors, and could prove to be critical for optimum performance of more central auditory prostheses.
ACKNOWLEDGMENT
The authors would like to thank C. Leonard, S. Nagarajan,
D. Poeppel, J. Sanchez, R. Shrivastav, and B. Wright for helpful
comments on an earlier version of this manuscript. They would
also like to thank M. Kvale for providing the spike sorting program. This study originated in the course of project NIDCDDC004523.
REFERENCES
[1] P. Bedenbaugh, Auditory physiology: Cortical assistance for the auditory signals-to-symbols transformation, Current Biol. , no. 4, pp.
R127129, Feb. 2006.
[2] H. Lim, T. Lenarz, G. Joseph, R.-D. Battmer, J. Patrick, and M. Lenarz,
Effects of phase duration and pulse rate on loudness and pitch percepts
in the first auditory midbrain implant patients: Comparison to cochlear
implant and auditory brainstem implant results, Neuroscience, vol.
154, no. 1, pp. 370380, Jun. 2008.
[3] S. S. Nagarajan, S. W. Cheung, P. Bedenbaugh, R. E. Beitel, C. E.
Schreiner, and M. M. Merzenich, Representation of spectral and temporal envelope of twitter vocalizations in common marmoset primary
auditory cortex, J. Neurophysiol., vol. 87, pp. 17231737, 2002.
[4] S. W. Cheung, P. H. Bedenbaugh, S. S. Nagarajan, and C. E. Schreiner,
Functional organization of squirrel monkey primary auditory cortex:
Responses to pure tones, J. Neurophysiol., vol. 85, pp. 17321749,
2001.
[5] D. Bowers, L. X. Blonder, and K. M. Heilman, Florida Affect Battery. Gainesville, FL: Univ. of Florida Brain Inst., Center for Neuropsychol. Studies, 1991.
[6] R. V. Shannon, F. G. Zeng, V. Kamath, J. Wygonski, and M. Ekelid,
Speech recognition with primarily temporal cues, Science, vol. 270,
no. 5234, pp. 303304, Oct. 1995.
[7] M. Smith and A. Faulkner, Effects of the number of speech-bands
and envelope smoothing condition on the ability to identify intonational patterns through a simulated cochlear implant speech processor,
Speech Hearing and Lang., Dept. of Phonetics and Linguist., Univ. College London,, 2002, work in progress.
[8] T. Green, A. Faulkner, and S. Rosen, Spectral and temporal cues to
pitch in noise-excited vocoder simulations of continuous-interleavedsampling cochlear implants, J. Acoust. Soc. Amer., vol. 112, no. 5, pp.
21552164, Nov. 2002.
[9] T. Green, A. Faulkner, and S. Rosen, Enhancing temporal cues to
voice pitch in continuous interleaved sampling cochlear implants, J.
Acoust. Soc. Amer., vol. 116, no. 4, pp. 22982310, Oct. 2004.
[10] G. H. Monrad-Krohn, Dysprosody or altered melody of language,
Brain, vol. 70, pp. 405423, 1947.
[11] K. R. Scherer, Vocal affect expressin: A review and a model for future
research, Psychol. Bull., vol. 99, no. 2, pp. 143165, 1986.
[12] K. Lakshminarayanan, D. B. Shalom, V. van Wassenhove, D. Orbelo,
J. Houde, and D. Poeppel, The effect of spectral manipulations on the
identification of affective and linguistic prosody, Brain Lang., vol. 84,
pp. 250263, 2003.
[13] A. J. Oxenham and S. P. Bacon, Cochlear compression: Perceptual
measures and implications for normal and impaired hearing, Ear
Hear., vol. 24, no. 5, pp. 352366, Oct. 2003.
[14] S. Gatehouse and W. Noble, The speech, spatial and qualities of
hearing scale (ssq), Int. J. Audiol., vol. 43, no. 2, pp. 8599, Feb.
2004.

[15] L. S. Chester, A. E. Holmes, and P. Bedenbaugh, Emotional prosody


recognition in hearing impaired listeners, in Proc. Amer. Acad. Audiol.
17th Annu. Conv. Expo, Apr. 1, 2005, vol. Presentation PP306.
[16] B. R. Glasberg and B. C. J. Moore, Derivation of auditory filter shapes
from notched-noise data, Hear. Res., vol. 47, no. 1-2, pp. 103138,
1990.
[17] Z. M. Smith, B. Delgutte, and A. J. Oxenham, Chimaeric sounds reveal dichotomies in auditory perception, Nature, vol. 416, pp. 8790,
Mar. 2002.
[18] E. Douglas-Cowie, N. Campbell, R. Cowie, and P. Roach, Emotional
speech: Towards a new generation of databases, Speech Commun., vol.
40, no. 1-2, pp. 3360, Apr. 2003.
[19] K. R. Scherer, Vocal communication of emotion: A review of research
paradigms, Speech Commun., vol. 40, no. 1-2, pp. 227256, 2003.
[20] D. Talkin, A robust algorithm for pitch tracking (rapt), in Speech
Coding and Synthesis. New York: Elsevier, 1995.
[21] R. G. Leonard and G. R. Doddington, A Speaker-Independent Connected-Digit Database, Tech. Rep. Texas Instruments Inc., Dallas, TX,
1982.
[22] W. G. Sokolich, Closed sound delivery system, U.S. patent
4,251,686, 1981.
[23] E. M. Martin, M. F. West, and P. H. Bedenbaugh, Masking and scrambling in the auditory thalamus of awake rats by gaussian and modulated
noises, in Proc. National Acad. Sci., Oct. 12, 2004, vol. 101, pp. 14
96114 965, 41.
[24] M. S. Lewicki, Bayesian modeling and classification of neural signals, Neural Comput., vol. 6, no. 5, pp. 10051030, 1994.
[25] F. E. Theunissen, S. V. David, N. C. Singh, A. Hsu, W. E. Vinje, and
J. L. Gallant, Estimating spatio-temporal receptive fields of auditory
and visual neurons from their responses to natural stimuli, Network,
vol. 12, no. 3, pp. 289316, Aug. 2001.
[26] M. F. Dorman, P. C. Loizou, J. Fitzke, and Z. Tu, The recognition
of sentences in noise by normal-hearing listeners using simulations of
cochlear-implan signal processor with 620 channels, J. Acoust. Soc.
Amer., vol. 104, no. 6, pp. 35833585, Dec. 1998.
[27] P. C. Loizou, M. F. Dorman, and Z. Tu, On the number of channels
needed to understand speech, J. Acoust. Soc. Amer., vol. 106, no. 4,
pt. 1, pp. 20972103, Oct. 1999.
[28] A. Faulkner, S. Rosen, and L. Wilkinson, Effects of the number of
channels and speech-to-noise ratio of connected discourse tracking
through a simulated cochlear implant speech processor, Ear Hear.,
vol. 22, pp. 431438, Oct. 2001.
[29] P. Lieberman and S. B. Michaels, Some aspects of fundamental frequency and envelope amplitude as related to the emotional content of
speech, J. Acoust. Soc. Amer., vol. 34, no. 7, pp. 922927, 1962.
[30] P. K. Kuhl and J. D. Miller, Speech perception by the chichilla: Identification functions for synthetic vot stimuli, J. Acoust. Soc. Amer., vol.
63, no. 3, pp. 905917, Mar. 1978.
[31] H. L. Atwood and S. Karunanithi, Diversification of synaptic strength:
Presynaptic elements, Nature Rev. Neurosci., vol. 3, pp. 497516, Jul.
2002.
[32] M. Blatow, A. Caputi, N. Burnashev, H. Monyer, and A. Rozov, Ca2+
buffer saturation underlies paired pulse facilitation in calbindin-d28kcontaining terminals, Neuron, vol. 38, pp. 7988, Apr. 2003.
[33] D. V. Buonomano and M. M. Merzenich, Net interaction between
different forms of short-term synaptic plasticity and slow-ipsps in the
hippocampus and auditory cortex, J. Neurophysiol., vol. 80, no. 4, pp.
17651774, Oct. 1998.
[34] D. V. Buonomano and M. M. Merzenich, Cortical plasticity: From
synapses to maps, Annu. Rev. Neurosci., vol. 129, no. 3355, pp.
9971002, Apr. 1998.

Purvis Bedenbaugh (M07) received the B.S.E.


degree in biomedical engineering from Duke University, Durham, NC, the M.S. degree in bioengineering
from Clemson University, Clemson, SC, and the
Ph.D. degree in bioengineering from the University
of Pennsylvania, Philadelphia.
He is the Director of the biomedical engineering
concentration within the newly ABET accredited
general engineering program at East Carolina University, Greenville, SC. He was a Postdoctoral Fellow
at the Keck Center for Integrative Neuroscience and
Department of Otolaryngology, University of California, San Francisco. Prior
to joining the Department of Engineering faculty at East Carolina University,
he served on the faculty of the Department of Neuroscience at the University
of Florida College of Medicine. In addition to his academic appointment, he
serves as Chief Technology Officer for Cranial Medical Systems, Inc.

BEDENBAUGH et al.: PROSODY-PRESERVING VOICE TRANSFORMATION TO EVALUATE BRAIN REPRESENTATIONS OF SPEECH SOUNDS

Diana K. Sarko received the B.S. degree in neuroscience and behavioral biology from Emory University, Druid Hills, GA, and the Ph.D. degree in neuroscience from the University of Florida College of
Medicine, Gainesville.
She is currently a Postdoctoral Fellow in the
Department of Biology, Vanderbilt University,
Nashville, TN. She studies comparative neurobiology, cognition, and behavior, with particular focus
on neurobiology and its relationship to behavior and
cognition in unique species, such as the naked mole
rat, star-nose mole, and manatee.

Heidi L. Roth received the B.A. degree from Harvard


College, Cambridge, MA, the M.A. degree in the history of neurosciences from Harvard Graduate School,
and the M.D. degree from Harvard Medical School,
Health Sciences, and Technology (HST) Division.
She is a Neurologist on the faculty of the University of North Carolina at Chapel Hill. She completed
a residency in neurology at the Harvard Longwood
Program, and a fellowship in Behavioral and Cognitive Neurology at the University of Florida. Her academic interests include memory and sleep, diagnosis
and treatment of language disorders and aphasia, hemispheric asymmetries and
sleep, and primary care and treatment of sleep disorders.

1029

Eugene M. Martin received the B.A. degree in


bio-psychology from Franklin and Marshall College,
Lancaster, PA, and the Ph.D. degree in neuroscience
from the University of Florida College of Medicine,
Gainseville.
He is currently a Postdoctoral Associate in the
Laboratory of Neurobiology and Behavior at Rockefeller University, New York. He is interested in how
the presence of competing stimuli effects sensory
encoding and in the neural mechanisms of behavioral
arousal.

Вам также может понравиться