Академический Документы
Профессиональный Документы
Культура Документы
5, JULY 2010
1017
AbstractThis study employs a voice-transformation to overcome the limitations of brain mapping to study brain representations of natural sounds such as speech. Brain mapping studies
of natural sound representations, which present a fixed sound to
many neurons with different acoustic frequency selectivity, are difficult to interpret because individual neurons exhibit considerable
unexplained variability in the dynamical aspects of their evoked
responses. This new approach samples how a single recording responds to an ensemble of sounds, instead of sampling an ensemble
of neuronal recordings. A noise excited filter-bank analysis and
resynthesis vocoder systematically shifts the frequency band occupied by sounds in the ensemble. The quality of the voice transformation is assessed by evaluating the number of bands the filter
bank must have to support emotional prosody identification. Perceptual data show that emotional prosody can be recognized within
normal limits if the bandwidth of filter-bank channels is less than
or equal to the bandwidth of perceptual auditory filters. Example
physiological data show that stationary linear transfer functions
cannot fully explain the responses of central auditory neurons to
speech sounds, and that deviations from model predictions are not
random. They may be related to acoustic or articulatory features
of speech.
Index TermsAuditory system, bioelectric potentials, identification, nervous system, speech analysis, speech coding, speech intelligibility, speech processing.
I. INTRODUCTION
HIS report describes a voice transformation approach to
the study of how central auditory neurons respond to, represent, and encode speech sounds and other natural sounds. Understanding the central auditory representation of speech is of
increasing practical importance. First, enhancing central auditory representations is a valuable objective for auditory prosthesis. One role of the central auditory system is to transform a
representation of mainly acoustic features at the cochlea into
a representation that is compatible with brain symbolic processing systems [1]. Prostheses which enhance the interface to
Manuscript received May 17, 2009; revised September 30, 2009. Current version published June 16, 2010. The associate editor coordinating the review of
this manuscript and approving it for publication was Dr. Chun-Hsien Wu.
P. Bedenbaugh is with the Department of Engineering, East Carolina University, Greenville, NC 27858 USA (e-mail:bedenbaughp@ecu.edu).
D. K. Sarko is with the Department of Biology, Vanderbilt University,
Nashville, TN 37240 USA.
H. L. Roth is with the Department of Neurology, University of North Carolina, Chapel Hill, NC 27599 USA.
E. M. Martin is with the Laboratory of Neurobiology and Behavior, Rockefeller University, New York, NY 10065 USA.
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TASL.2009.2035165
1018
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 5, JULY 2010
BEDENBAUGH et al.: PROSODY-PRESERVING VOICE TRANSFORMATION TO EVALUATE BRAIN REPRESENTATIONS OF SPEECH SOUNDS
1019
Fig. 2. Filter-banks bandwidth of filters is proportional to the equivalent rectangular bandwidth (ERB) of auditory perceptual filters. The stair case curves
show filter-bank bandwidth. Shaded regions demarcate powers of two times the
ERB curve, beginning with the one ERB curve at the lower edge of region i. (i)
12 ERB. (ii) 24 ERB. (iii) 48 ERB. (iv) 816 ERB. (v) 1632 ERB.
below 15 Hz, but its effect on emotional prosody identification, speaker identification, or sentence identification was not
formally assessed. The bandwidth of the temporal envelopes influences the availability and specific form of temporal cues such
as stress pattern and the pitch contour. Because the sounds employed in this study were derived by employing a filter-bank
with filter bandwidths that are constant in the cochlear distance
domain, rather than constant in the linear frequency domain, the
available temporal cues vary with filter-band frequency.
Filter-banks bandwidths are plotted in Fig. 2. In both panels
of this figure, frequency is plotted along the horizontal axis, and
the bandwidth of the filter-band containing that frequency is
plotted on the vertical axis. Shaded regions show multiples of
the approximate bandwidth of perceptual processing channels.
For example, shaded region i lies between one and two times
the Glasberg [16] estimate of the equivalent rectangular bandwidth of (ERB) of auditory perceptual channels. Filter bands
plotted within region i have approximately the frequency resolution of the auditory system. Regions ii, iii, iv, and v correspond
to approximately 1/2, 1/4, 1/8, and 1/16 the perceptual resolution of the auditory system, respectively. Filter bands plotted
above region i have less than perceptual frequency resolution,
while filter bands plotted below region i have better than perceptual frequency resolution.
The filter-banks were implemented so that the transfer-function roll-off at the edges of the pass-bands constant across different frequency resolutions. To achieve this, the filter-bank for
a particular resolution was derived by combining adjacent bands
of an underlying, high-resolution filter-bank with 128-bands,
approximately 20 bands per octave. For example, the magnitude of the transfer function of filters in a 32-band filter-bank
are the sum of magnitudes of four adjacent filters in the underlying filter-bank. The underlying, high resolution, bands had
a Gaussian roll-off about the center frequency, and were separated by one standard deviation, peak-to-peak, in the cochlear
distance domain. Cochlear distance was computed according to
an adaptation of Greenwoods cochlear frequency map [17]. For
different spectral resolutions, the filter-bank was in turn comprised of 1, 2, 4, 8, 16, 32, and 64, frequency bands. This corresponds to filter-bank densities of approximately 0.2, 0.3, 0.6,
1020
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 5, JULY 2010
1.3, 2.5, 5.1, and 10.1 filters per octave. In the single band-condition, a single temporal envelope modulated a single noise carrier waveform, and the spectrum of the carrier matched the spectrum of the original sentence. The filter-bank with 32 bands had
a frequency resolution closely matched to the equivalent rectangular bandwidth (ERB) of auditory perceptual channels [16], as
shown in Fig. 2. Examples spectrograms of the phrase The boy
went to the store spoken with happy prosody and synthesized
with filter-banks are shown in Fig. 4. Fourier analysis parameters are the same as in Fig. 3.
2) Audio Processing for Physiological Testing: For physiological testing in rats, the same vocoder architecture was
adapted to generate an ensemble of frequency-shifted sounds.
Sounds were up-sampled to 97560 samples per second. The
frequency range from 0.1 to 44 kHz was divided into 256
frequency bands, approximately 29 bands per octave, much
greater than required to achieve normal emotional prosody
recognition. A total of 256 stimuli were generated by shifting
the correspondence between the frequency band from which
the envelope was obtained, and the frequency band of the carrier, as diagrammed in Fig. 1. The power in each synthesized,
BEDENBAUGH et al.: PROSODY-PRESERVING VOICE TRANSFORMATION TO EVALUATE BRAIN REPRESENTATIONS OF SPEECH SOUNDS
1021
TABLE I
SENTENCES AND THEIR EMOTIONAL PROSODY. THE SENTENCE AND
EMOTIONAL INTONATIONS WHICH WERE PRESENTED IN EACH BLOCK OF
TRIALS ARE DESCRIBED. EACH LETTER INDICATES THAT THE STIMULUS SET
INCLUDED THE SENTENCE PRONOUNCED WITH THE DESIGNATED PROSODY. A
REPEATED LETTER MEANS THAT THE SENTENCE WAS REPEATED WITH THAT
PROSODY. CHANCE PERFORMANCE IS ESTIMATED AS THE 1=2 , WHERE THE
SHANNON ENTROPY I IS DEFINED AS I =
p(response) log(p(response)).
H, HAPPY. S, SAD. A, ANGRY, F, FEARFUL. N, NEUTRAL
TABLE II
FUNDAMENTAL FREQUENCY OF SENTENCES. MEAN AND
STANDARD DEVIATION OF F AND COEFFICIENT OF VARIATION
OF POWER FOR EACH EMOTIONAL PROSODY
a quiet study room. The sentences are effectively neutral statements, but are pronounced with one of five emotional intonations: happy, sad, angry, fearful, or neutral (Table I). This neuropsychological assessment has been used extensively to measure emotional cognition in patients with stroke or other brain
injury. Comparison data from normal controls is available for
young adult, middle aged, early old, and older adult age groups.
The sentences were pronounced by a female neuropsychologist
with both clinical and research experience in emotional cognition. Although they can be recognized as expressing the intended emotions, the relationship between emotional speech expressed by actors and that which is naturally produced in emotional situations is poorly understood [18], [19].
Representative spectrograms of the phrase The boy went to
the store, spoken with each of the five emotional intonations,
are shown in Fig. 3. To compute the spectrograms, sentences
were sampled at 44100 samples/s, and the DFT of 2205 Hanning windowed samples was computed every 551 samples. The
resulting images were scaled in dB, and plotted with a common
gray scale. Table II shows the average mean and standard defor the voiced portions of all sentences with
viation of the
each emotional prosody. The coefficient of variation (standard
deviation/mean) of power is shown separately for voiced and unand power statistics were calculated using
voiced portions.
Entropic/ESPS function get_f0 [20]. The happy and fearful sentences have relatively high , while is more variable for the
happy and sad prosody sentences. These observations suggest
that frequency cues might be particularly important for identifying happy prosody. Voiced power is relatively more variable
for happy, angry, and fearful sentences, while unvoiced power
was relatively more variable for happy and fearful sentences.
Unvoiced power was relatively less variable for sad sentences.
These observations suggest that less variable power could be a
cue for sad prosody.
While listening to the first track of the CD, listeners identified
the emotional prosody in the original sentences from the Florida
Affect Battery [5] (control task), in which frequency resolution
was not modified. For this track only, an additional group of
12 listeners (six females, six males) was tested, for a total of
24 listeners. Successive tracks presented the same sentences, in
random order, processed to one of the various frequency resolutions. While listening to half of the remaining tracks, listeners
identified which sentence was spoken, based upon the words
in the sentence, without regard for the intonation. In the other
tracks, they identified the emotion expressed by the speaker,
without regard for the words in the sentence. Listeners adjusted
the volume to a comfortable level while listening to the recorded
task instructions, and could listen to the instructions more than
once if necessary.
1022
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 5, JULY 2010
TABLE III
STIMULUS PRESENTATION ORDER. STIMULUS PRESENTATION ORDER
WAS RANDOMLY COUNTERBALANCED FOR TASK AND SPECTRAL
RESOLUTION. THE UNMODIFIED CONTROL TASK WAS ALWAYS PRESENTED
FIRST. FEW BANDS REFERS TO 1, 2, 4, AND 8 BANDS, AND MANY
BANDS REFERS TO 16, 32, and 64 BANDS
BEDENBAUGH et al.: PROSODY-PRESERVING VOICE TRANSFORMATION TO EVALUATE BRAIN REPRESENTATIONS OF SPEECH SOUNDS
1023
TABLE IV
EMOTIONAL PROSODY RECOGNITION: CONTROL PERFORMANCE. EMOTIONAL
PROSODY HIT RATE AND FALSE ALARM RATE
Fig. 6. Sentence recognition. The vertical axis represents the overall percent of
correct choices for 12 observers. The horizontal axis represents the number of
frequency bands in reduced spectral resolution sentences. The smooth curve is
fits of a modified logistic function constrained to range between chance (23%).
and 100% correct performance. Vertical error bars show the standard error of
the observations. Chance performance is denoted by a horizontal dashed line.
Shaded regions are approximations to the ERB regions shown in Fig. 2.
1024
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 5, JULY 2010
TABLE V
EMOTIONAL PROSODY PERFORMANCE: HITS AND FALSE ALARMS. EMOTIONAL
HIT RATE (HIT) AND FALSE ALARM RATE (FA) FOR RECOGNITION OF
EACH EMOTIONAL PROSODY, FOR EACH FREQUENCY RESOLUTION. LETTER
SUPERSCRIPTS CORRESPOND TO POINTS MADE IN THE TEXT
Fig. 7. Emotional prosody recognition. The vertical axis represents the overall
percent of correct choices for 12 observers. Chance performance, 20%, is shown
by a dashed line. The sigmoidal curves display fits of a modified logistic function constrained to range between chance and 100% correct performance. The
smooth curve is based on all data items. Circles and vertical error bars show the
mean performance for all listeners, and the standard error of the means. Horizontal error bars show the range from the 25th percentile to the 75th percentile of
curves for individual observers. The dashed curve is based on restricted analysis
including items with happy, angry, and neutral prosody only. Shaded regions
are approximations to the ERB regions, as explained in Fig. 6.
Fig. 8. Emotional prosody recognition by emotional category. Each curve represents the overall percent of correct choices for 12 observers, considering only
observations of a sounds from single emotional category. Chance performance,
20%, is shown by a dashed line. Smooth curves display fits of a modified logistic
function constrained to range between chance and 100% correct performance.
Shaded regions are approximations to the ERB regions, as explained in Fig. 6.
H happy, S sad, A angry, F fearful, N neutral.
the analysis was restricted to items with happy, angry, and neutral prosody.
At the highest frequency resolutions, the pattern of prosody
recognition performance was similar to the performance pattern
with control sentences (Table V- ). With both unprocessed and
vocoded speech, the most accurately identified emotional intonations were happy and angry (Table V- ). In all cases, neutral
intonation was recognized at an intermediate level. In all cases,
the false alarm rate for sad and fearful intonation is relatively
high, and sad prosody recognition performance was relatively
low. An interesting difference was that with unprocessed sentences, the lowest emotional prosody recognition performance
was observed with fearful prosody, while vocoded speech the
lowest performance was observed with sad speech (Table V- ).
Recognizing happy intonation depended on frequency resolution strongly (Fig. 8). Recognition of happy intonation rose
suddenly from near chance to relatively accurate at four bands.
Considering the hit and false alarm rates together suggests a
statistical choice bias against identifying emotional prosody
as happy. At low-frequency resolution (1 and 2 band), the
hit rate for happy was less than chance, while the false-alarm
BEDENBAUGH et al.: PROSODY-PRESERVING VOICE TRANSFORMATION TO EVALUATE BRAIN REPRESENTATIONS OF SPEECH SOUNDS
1025
Fig. 9. Neuronal response compared to timefrequency representation. This illustrates the response of one neuronal recording to an ensemble of frequencyshifted sounds, overlaid on a filter-bank based timefrequency shifted representation of the sequence four four seven seven. Each row of black hashmarks
represents the response to one of the frequency-shifted sounds, with hash marks
placed at the times of action potential responses. The raster shows how a population of neurons with the same dynamical properties as observed recording, but
different frequency tuning, would respond to original sound, if the population
could be observed directly.
Fig. 10. Example neuronal response compared to timefrequency representation and STRF prediction This illustrates the response of one neuronal recording
to an ensemble of frequency-shifted derived from the vocal sequence erase
enter seven seven. A) Direct comparison of the response to the timefrequency
representation of the stimulus, in the manner of Fig. 9. B) STRF estimated
from the responses to the stimulus ensemble. C) STRF-based prediction of the
response to the stimulus ensemble. D) Recorded response overlaid over the
STRF-based prediction.
1026
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 5, JULY 2010
Fig. 11. Another example neuronal response compared to timefrequency representation and STRF prediction A second example recording, with the same
layout as Fig. 10.
the voice or speech. In the examples presented here, the unvoiced to voiced transition is associated with this effect.
The main perceptual findings of this report are 1) in signals
processed by a noise-excited vocoder, frequency resolution similar to auditory perceptual frequency resolution is both necessary and sufficient to support emotional prosody identification,
and 2) it is possible for emotional prosody to be identified within
normal performance limits when speech is transformed in a way
that does not preserve temporal fine-structure. Also of note is
that angry prosody is identified at levels well above chance even
when a one-band filter-bank is employed, and happy prosody is
identified almost perfectly with frequency resolution somewhat
less than auditory perceptual frequency resolution. These observations suggest that although emotional prosody identification may be ultimately limited by frequency resolution, limited
BEDENBAUGH et al.: PROSODY-PRESERVING VOICE TRANSFORMATION TO EVALUATE BRAIN REPRESENTATIONS OF SPEECH SOUNDS
The possibility remains that if the task in the present study had
were more difficult, normal prosody identification might have
required higher frequency resolution, or might not have been
attainable with speech processed by a noise-excited vocoder.
In the present and in related studies [7], [12], [29], the
(simulated) emotional intonations were validated to be good
exemplars of the intended emotion by an age-matched groups
of normal listeners [5]. The experimental tokens employed by
Lieberman and Michaels [29] were validated by the judgments
of a group of linguistically naive listeners. The experimental
tokens employed by Smith et al. [7] were validated by a group
of final-year speech and language therapy students. The tokens
employed by Lakshminarayanan et al. [12] were validated by
the judgments of a group of undergraduate students. Tokens in
which the emotional prosody is easily identified are appropriate
for such studies, because ease of identification allows perceptual judgments to be limited principally by acoustic quality, not
by the cognitive load of emotional processing. Even though the
baseline emotional prosody identification task is easy, it is still
more difficult than the baseline sentence identification task.
Emotional prosody identification clearly requires higher audio
fidelity than sentence identification, but the difference in task
difficulty limits our ability to quantify this difference.
B. Fine-Structure and Emotional Prosody Perception
Since the noise-excited vocoder eliminates temporal
fine-structure, these data show that fine-structure is not required for emotional prosody identification. Although similar
speech and emotional prosody recognition was obtained using
transformed sounds, the sounds were obviously distorted versions of the original [5]. Filter-banks with more bands than
necessary to achieve emotional prosody recognition generate
sounds with improved voice quality, and one has the impression
that it is easier to discriminate different speakers voices.
C. Neuronal Responses to Shifted Sounds
Neuronal responses were compared directly to a filter-bank
based timefrequency representation, and to a linear prediction
of the response to the stimulus ensemble. The stimulus evoked
responses occasionally showed higher response-time precision,
for large magnitude responses. The temporal spread of the
response at these times was less than duration over which a
large magnitude response than was predicted by the STRF.
Such precisely timed responses were relatively insensitive to
the particular member of the stimulus ensemble, and qualitatively could be described as having an all-or-none character. At
other instances during the response, the timing and magnitude
of the response varied more as the stimulus changed, and
qualitatively had the appearance of a transient modulation of
neuronal firing rate, such as is employed in conventional neural
network models.
A necessary assumption of this approach is that the auditory
system processes each frequency band in the same way. This is
a good starting assumption, as it has supported the implementation of many successful auditory processing algorithms, such as
those employed in auditory prostheses and audio data compression algorithms, and is supported by many psychophysical experiments. Despite this, two observations suggest that it may not
1027
1028
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 5, JULY 2010
added random noise. The observed error often takes the form of
little or no response in instances where a response is predicted
to occur, and such errors may be related to transitions and other
features of the speech signal. Since errors are not random, it is
reasonable to explore adaptive or nonstationary models. There
are many physiological mechanisms which could support such
adaptive or sequence-dependent processing within the central
nervous system, over a range of time scales [31][34]. Such
models could lead to improvements in hearing aid and cochlear
implant speech processors, and could prove to be critical for optimum performance of more central auditory prostheses.
ACKNOWLEDGMENT
The authors would like to thank C. Leonard, S. Nagarajan,
D. Poeppel, J. Sanchez, R. Shrivastav, and B. Wright for helpful
comments on an earlier version of this manuscript. They would
also like to thank M. Kvale for providing the spike sorting program. This study originated in the course of project NIDCDDC004523.
REFERENCES
[1] P. Bedenbaugh, Auditory physiology: Cortical assistance for the auditory signals-to-symbols transformation, Current Biol. , no. 4, pp.
R127129, Feb. 2006.
[2] H. Lim, T. Lenarz, G. Joseph, R.-D. Battmer, J. Patrick, and M. Lenarz,
Effects of phase duration and pulse rate on loudness and pitch percepts
in the first auditory midbrain implant patients: Comparison to cochlear
implant and auditory brainstem implant results, Neuroscience, vol.
154, no. 1, pp. 370380, Jun. 2008.
[3] S. S. Nagarajan, S. W. Cheung, P. Bedenbaugh, R. E. Beitel, C. E.
Schreiner, and M. M. Merzenich, Representation of spectral and temporal envelope of twitter vocalizations in common marmoset primary
auditory cortex, J. Neurophysiol., vol. 87, pp. 17231737, 2002.
[4] S. W. Cheung, P. H. Bedenbaugh, S. S. Nagarajan, and C. E. Schreiner,
Functional organization of squirrel monkey primary auditory cortex:
Responses to pure tones, J. Neurophysiol., vol. 85, pp. 17321749,
2001.
[5] D. Bowers, L. X. Blonder, and K. M. Heilman, Florida Affect Battery. Gainesville, FL: Univ. of Florida Brain Inst., Center for Neuropsychol. Studies, 1991.
[6] R. V. Shannon, F. G. Zeng, V. Kamath, J. Wygonski, and M. Ekelid,
Speech recognition with primarily temporal cues, Science, vol. 270,
no. 5234, pp. 303304, Oct. 1995.
[7] M. Smith and A. Faulkner, Effects of the number of speech-bands
and envelope smoothing condition on the ability to identify intonational patterns through a simulated cochlear implant speech processor,
Speech Hearing and Lang., Dept. of Phonetics and Linguist., Univ. College London,, 2002, work in progress.
[8] T. Green, A. Faulkner, and S. Rosen, Spectral and temporal cues to
pitch in noise-excited vocoder simulations of continuous-interleavedsampling cochlear implants, J. Acoust. Soc. Amer., vol. 112, no. 5, pp.
21552164, Nov. 2002.
[9] T. Green, A. Faulkner, and S. Rosen, Enhancing temporal cues to
voice pitch in continuous interleaved sampling cochlear implants, J.
Acoust. Soc. Amer., vol. 116, no. 4, pp. 22982310, Oct. 2004.
[10] G. H. Monrad-Krohn, Dysprosody or altered melody of language,
Brain, vol. 70, pp. 405423, 1947.
[11] K. R. Scherer, Vocal affect expressin: A review and a model for future
research, Psychol. Bull., vol. 99, no. 2, pp. 143165, 1986.
[12] K. Lakshminarayanan, D. B. Shalom, V. van Wassenhove, D. Orbelo,
J. Houde, and D. Poeppel, The effect of spectral manipulations on the
identification of affective and linguistic prosody, Brain Lang., vol. 84,
pp. 250263, 2003.
[13] A. J. Oxenham and S. P. Bacon, Cochlear compression: Perceptual
measures and implications for normal and impaired hearing, Ear
Hear., vol. 24, no. 5, pp. 352366, Oct. 2003.
[14] S. Gatehouse and W. Noble, The speech, spatial and qualities of
hearing scale (ssq), Int. J. Audiol., vol. 43, no. 2, pp. 8599, Feb.
2004.
BEDENBAUGH et al.: PROSODY-PRESERVING VOICE TRANSFORMATION TO EVALUATE BRAIN REPRESENTATIONS OF SPEECH SOUNDS
Diana K. Sarko received the B.S. degree in neuroscience and behavioral biology from Emory University, Druid Hills, GA, and the Ph.D. degree in neuroscience from the University of Florida College of
Medicine, Gainesville.
She is currently a Postdoctoral Fellow in the
Department of Biology, Vanderbilt University,
Nashville, TN. She studies comparative neurobiology, cognition, and behavior, with particular focus
on neurobiology and its relationship to behavior and
cognition in unique species, such as the naked mole
rat, star-nose mole, and manatee.
1029