Auditory Visual Perception

Auditory-Visual Speech Perception and Auditory-
Visual Enhancement in Normal-Hearing Younger

and Older Adults
Mitchell S. Sommers, Nancy Tye-Murray, and Brent Spehar
Objective: The purpose of the present study was to
examine the effects of age on the ability to benefit
from combining auditory and visual speech infor-
mation, relative to listening or speechreading
alone. In addition, the study was designed to com-
pare visual enhancement (VE) and auditory en-
hancement (AE) for consonants, words, and sen-
tences in older and younger adults.
Design: Forty-four older adults and 38 younger
adults with clinically normal thresholds for fre-
quencies of 4 kHz and below were asked to identify
vowel-consonant-vowels (VCVs), words in a carrier
phrase, and semantically meaningful sentences in
auditory-only (A), visual-only (V), and auditory-vi-
sual (AV) conditions. All stimuli were presented in a
background of 20-talker babble, and signal-to-bab-
ble ratios were set individually for each participant
and each stimulus type to produce approximately
50% correct in the A condition.
Results: For all three types of stimuli, older and
younger adults obtained similar scores for the A
condition, indicating that the procedure for indi-
vidually adjusting signal-to-babble ratios was suc-
cessful at equating A scores for the two age groups.
Older adults, however, had significantly poorer per-
formance than younger adults in the AV and V
modalities. Analyses of both AEand VEindicated no
age differences in the ability to benefit from com-
bining auditory and visual speech signals after con-
trolling for age differences in the V condition. Cor-
relations between scores for the three types of
stimuli (consonants, words, and sentences) indi-
cated moderate correlations in the V condition but
small correlations for AV, AE, and VE.
Conclusions: Overall, the findings suggest that the
poorer performance of older adults in the AV con-
dition was a result of reduced speechreading abili-
ties rather than a consequence of impaired integra-
tion capacities. The pattern of correlations across
the three stimulus types indicates some overlap in
the mechanisms mediating AV perception of words
and sentences and that these mechanisms are
largely independent from those used for AV percep-
tion of consonants.
(Ear & Hearing 2005;26;263275)
Considerable evidence is now available to suggest
that speech intelligibility improves when listeners can
both see and hear a talker, compared with listening
alone (Grant, Walden &Seitz, 1998; Sumby &Pollack,
1954). Moreover, the benefits of combining auditory
and visual speech information increase with the diffi-
culty of auditory-only (A) perception (Sumby & Pol-
lack, 1954). This increased performance for auditory-
visual (AV) compared with A or visual-only (V)
presentations is at least partially a result of comple-
mentary information available in the auditory and
visual speech signals (Grant & Seitz, 1998; Grant, et
al., 1998; Summerfield, 1987). Thus, for example, if a
listener is unable to perceive the acoustic cues for place
of articulation, accurate intelligibility can be main-
tained if the speaker is visible because speechreading
provides an additional opportunity to extract place
information.
Grant et al. (1998) proposed a conceptual frame-
work for understanding the improved performance
for AV presentations, compared with either unimo-
dal format (A or V) in which both peripheral and
central mechanisms contribute to an individuals
ability to benefit from combining auditory and vi-
sual speech information. In the initial step of the
model, peripheral sensory systems (audition and
vision) are responsible for extracting signal-related
segmental and suprasegmental phonetic cues inde-
pendently from the auditory and visual speech sig-
nals. These cues are then integrated and serve as
input to more central mechanisms that incorporate
semantic and syntactic information to arrive at
phonetic and lexical decisions. This model is partic-
ularly useful for comparing the benefits of combin-
ing auditory and visual speech information across
different populations because it highlights that dif-
ferential benefits can result from changes in periph-
eral, central, or a combination of peripheral and
central abilities.
One of the difficulties in comparing the benefits
obtained from combining auditory and visual speech
information across different populations, however,
is that performance for unimodal presentations is
often different across the populations of interest. In
the present study, for example, we wanted to inves-
tigate the effects of age on the ability to combine
Department of Psychology (MSS), Washington University, St.
Louis, Missouri; and Central Institute for the Deaf (N.T. and
B.S.), St. Louis, Missouri.
0196/0202/05/2603-0263/0 Ear & Hearing Copyright 2005 by Lippincott Williams & Wilkins Printed in the U.S.A.
263
auditory and visual speech information. However,
this effort is complicated by well-documented de-
clines in both A and V speech perception as a
function of age (CHABA, 1988; Dancer, Krain,
Thompson, Davis, & Glenn, 1994; Honneil, Dancer,
& Gentry, 1991; Lyxell & Ronnberg, 1991; Middel-
weerd & Plomp, 1987; Shoop & Binnie, 1979). Thus,
evidence for age differences in the ability to benefit
from combining auditory and visual speech signals
could be attributed to age-related differences in
auditory sensitivity, age-related differences in
speechreading, age differences in integrating audi-
tory and visual speech information, or some combi-
nation of these factors.
Although a few studies comparing AV scores in
older and younger adults have been designed to
minimize the influence of unimodal performance
differences (Cienkowski & Carney, 2002; Walden et
al., 1993), other methodological concerns make their
findings difficult to interpret. In one investigation,
for example, Walden et al. (1993) compared differ-
ences between AV, V, and A performance for conso-
nant-vowels (CVs) and sentences in middle-aged (35
to 50 years of age) and older (65 to 80 years of age)
adults. For the sentences (but not for the CVs),
testing was conducted in the presence of speech-
shaped noise and noise levels were adjusted to
obtain approximately 40 to 50% correct in the A
condition. Visual enhancement for sentences, de-
fined as the difference between the AV and A condi-
tions, did not differ between the two groups. V scores
for sentences were significantly lower for the older
(16.7%) than for the middle-aged adults (34.4%),
preventing an analysis of age differences under
conditions of equivalent unimodal performance.
Moreover, scores in the AV condition for sentences
were near ceiling for both groups (93.8 and 92% for
the middle-aged and older adults, respectively),
making it difficult to interpret the null effects of age.
Cienkowski & Carney (2002) minimized the con-
tribution of presbycusis in an AV perception task by
comparing normal-hearing younger and older adults
on susceptibility to the McGurk effect (McGurk &
MacDonald, 1976). In the McGurk effect, partici-
pants are presented with discrepant auditory and
visual information and often perceive a fused re-
sponse that differs from both inputs. For example,
participants might be presented with an auditory
VCV containing a medial bilabial stop (e.g., /aba/)
while simultaneously viewing a face articulating a
VCV with a medial velar stop (e.g., /aga/). On a
certain percentage of trials, listeners will report
hearing a VCV with a medial alveolar place of
articulation (e.g., /ada/ or /aa/) that represents a
fusion of the auditory and visual inputs. Cien-
kowski & Carney found that the percentage of fused
responses did not differ between normal-hearing
older and younger adults and suggested that this
finding argued against age-related declines in audi-
tory-visual integration. The results of the Cien-
kowski & Carney study provide indirect evidence for
similar enhancement from combining auditory and
visual speech information as a function of age but
because their study was not designed to assess
enhancement specifically, they did not include mea-
sures of unimodal performance. Consequently, it is
not clear whether the comparable susceptibility to
the McGurk effect in the two age groups was asso-
ciated with equivalent A and V abilities.
One purpose of the present study, therefore, was
to examine the effects of age on the ability to benefit
from combining auditory and visual speech informa-
tion after minimizing differences in unimodal per-
formance. Toward this end, we tested normal-hear-
ing younger and older adults under conditions that
produced similar levels of A performance and that
also avoided ceiling effects in the AV condition. In
addition, we computed both visual enhancement
(VE, the benefit obtained fromadding a visual signal
to an auditory stimulus) and auditory enhancement
(AE, the benefit obtained from adding an auditory
signal to a visual-only stimulus) as a means of
examining age differences in the ability to combine
auditory and visual speech information after nor-
malizing for any differences in unimodal perfor-
mance (see Methods section for additional details on
computing these two measures).
A second goal of the study was to investigate age
differences in AE and VE for consonants, words, and
sentences. In general, only small to moderate corre-
lations have been observed between VE for stimuli
differing in semantic context (e.g., nonsense sylla-
bles, isolated words, and meaningful sentences),
suggesting that the mechanisms mediating en-
hancement may differ as a function of the amount of
semantic or lexical information available in the
stimulus (Grant & Seitz, 1998; Grant et al., 1998).
Furthermore, to our knowledge, there have been no
systematic investigations of the relationship be-
tween AE or VE for speech stimuli differing in
semantic content, although a small number of inves-
tigations (Dekle, Fowler, & Funnell, 1992; Sams,
Manninen, Surakka, Helin, & Ktt, 1998) have
compared AV integration for different stimulus
types using the McGurk effect. In the study most
similar to the current investigation, Sams et al.
(1998) found considerable variability in participants
susceptibility to the McGurk effect in Finnish as a
function of semantic context (consonants, isolated
words, words in sentences). Thus, similar to results
with A identification (Pichora-Fuller, Schneider, &
Daneman, 1995; Sommers & Danielson, 1999) and
264 EAR & HEARING / JUNE 2005
measures of auditory-visual integration with the
McGurk effect (Sams et al., 1998) the effect of age on
VE and AE may differ as function of stimulus type.
The current study was therefore designed to exam-
ine whether any observed age effects on VE and AE
would be modulated by the amount of lexical and
semantic information available.
METHODS
Participants
Thirty-eight younger adults (mean age, 20.1, SD
2.1) and forty-four older adults (mean age 70.2,
SD 6.8) served as participants. Younger adults
were all students at Washington University and
were recruited through posted advertisements.
Older adults were all community-dwelling residents
and were recruited through a database maintained
by the Aging and Development Program at Wash-
ington University. Testing required three 2.5-hour
sessions that were conducted on separate days. All
participants reported that English was their first
language and that they had never had any lipread-
ing training. Participants were paid $10/hour for
taking part in the experiments. All participants
were screened for CNS dysfunctions, using an exten-
sive medical history questionnaire that asked about
significant CNS events, including stroke, open or
closed head injury, concussions, any event in which
the participant was rendered unconscious, dizzi-
ness, and current medications. In addition, partici-
pants were asked about conditions for which they
were currently being treated. Any participant with a
history of CNS disorders, who was currently being
treated for a CNS condition or who was currently
taking drugs that affect CNS activity, was excluded.
Verbal abilities were assessed using the vocabulary
subtest of the Wechsler Adults Intelligence Scale,
with a maximum score of 70. Mean scores for older
and younger adults were 55.3 (SD 8.7) and 46.2
(SD 4.2), respectively. An independent-samples
t-test indicated that vocabulary scores were signifi-
cantly higher for older than for younger adults, t(81)
5.5, p 0.001. Older participants were also
screened for dementia using the Mini Mental Status
Exam (Folstein, Folstein, & McHugh, 1975). Partic-
ipants who scored below 24 (of 30) were excluded
from further testing.
Participants were also screened for vision and
hearing before testing. Participants whose normal
or corrected visual acuity, as assessed with a Snellen
eye chart, exceeded 20/40 were excluded from par-
ticipating to minimize the influence of reduced vi-
sual acuity on the ability to encode visual speech
information. Visual contrast sensitivity was mea-
sured using the Pelli-Robson contrast sensitivity
chart (Pelli, Robson, & Wilkins, 1998) and partici-
pants whose score exceeded 1.8 were also excluded
from further participation. Pure-tone air-conduction
thresholds were obtained for all participants at
octave frequencies from 250 to 4000 Hz, using a
portable audiometer (Beltone 110) and headphones
(TDH 39). Any participant whose threshold ex-
ceeded 20 dB HL (American National Standards
Institute, 1989) was excluded from participating.
Participants with asymmetric hearing losses, opera-
tionalized as greater than a 10-dB threshold differ-
ence between the two ears at any of the test frequen-
cies, were also excluded. Pure-tone averages (500,
1000, and 2000 Hz) for younger adults were 0.61 (SD
4.3) and 0.67 (SD 4.2) dB HL for the left and
right ears, respectively. Corresponding values for
older adults were 14.1 (SD 6.6) and 13.2 (SD
6.3) dB HL. A comparison of pure-tone averages
revealed that thresholds were significantly greater
for older than for younger adults (left ear: t(81)
10.4, p 0.001; right ear: t(81) 10.1, p 0.001). It
should also be noted that although we did not obtain
threshold measures at frequencies higher than 4000
Hz, based on previous studies of age-related hearing
loss (Morrell, Gordon-Salant, Pearson, Brant, &
Fozard, 1996), the older adults almost certainly had
clinically significant hearing losses (e.g., greater
than 20 dB HL) at frequencies above 4000 Hz. Thus,
despite meeting the criteria for normal hearing for
frequencies through 4 kHz, older adults had signif-
icantly poorer hearing than younger adults.
Stimuli and Procedures
Participants were presented with consonants,
words (in a carrier phrase), and sentences in A, V,
and AV conditions. All participants were first tested
on consonants, followed by words and then sen-
tences. Within each stimulus type, however, testing
in the A, V, and AV conditions was counterbalanced
such that approximately equal numbers of partici-
pants from each age group were tested in each
possible order of testing modality. All testing was
conducted in a double-walled sound attenuating
chamber (IAC, 4106). Stimuli were presented via a
PC (Dell 420) equipped with a Sound Blaster Live
audio card and Matrox (Millennium G400 Flex) 3D
video card. Auditory stimuli for the A and AV
conditions were presented binaurally in a multi-
talker background babble (see below for details on
babble level) over headphones (Sennheiser HD 265).
Signal level remained constant at 60 dB SPL for the
A and AV conditions, as measured in a 6-cc supra-
aural flat plate coupler using the A-weighting scale
(Bruel & Kjaer 2231). Testing levels were evaluated
against calibrated levels before every test using an
EAR & HEARING, VOL. 26 NO. 3 265
in-line RMS meter. Visual stimuli for the V and AV
conditions were presented on a 17-inch touch screen
monitor with participants sitting approximately
0.5 m from the display. The video image completely
filled the 17-in monitor. Participants viewed the
head and neck of each talker as they articulated the
stimuli.
Setting the Background Babble Level
As noted, one goal of the present study was to
investigate age differences in VE and AE under
conditions in which A performance was similar
across groups and that did not produce ceiling level
scores in the AV condition. To accomplish this goal,
all testing was conducted in a multitalker back-
ground babble and signal-to-babble levels were set
individually for each participant and stimulus con-
dition (consonants, words, and sentences) so as to
produce approximately 50% correct in the A condi-
tion. For the words and sentences, the stimuli used
to establish the background babble level for a given
condition were the same type as those used in the
corresponding test condition (e.g., when words were
used as test items, the background babble level was
established using words). None of the stimuli used
for setting the background babble levels for words
and sentences were repeated in the test phase (see
below for details on stimuli used for setting back-
ground babble level with consonants). The multi-
talker babble was captured from the Iowa Audiovi-
sual Speech Perception Laserdisc (Tyler, Preece, &
Tye-Murray, 1986) using 16-bit digitization and a
sampling rate of 48 kHz.
Babble level was set independently for each par-
ticipant and each stimulus type using a modified
version of the procedure for establishing speech
reception thresholds (American Speech and Hearing
Association, 1988). Briefly, in the first phase of the
procedure a starting babble level was established by
initially setting signal-to-babble ratio to 20 dB.
Appropriate stimuli (consonants, words, or sen-
tences) were then presented as babble level was
increased in 10-dB steps until the participants first
incorrect response. The starting babble level for the
remainder of the tracking was then set at a level
10-dB less than the level of the first incorrect re-
sponse. In the next phase, babble level was incre-
mented from the starting level (i.e., 10 dB below the
level of the first incorrect response) in 2-dB steps
(with two stimuli at each level) until the participant
responded incorrectly for five of six trials. Babble
level for testing was then established by subtracting
the total number of correct responses from the
starting level and adding a correction factor of 1.
Mean signal-to-babble ratios for the three stimulus
types were as follows: consonants (younger, M
9.7, SD 2.1; older, M 4.8, SD 3.7); words
(younger, M 8.6, SD 1.2; older, M 6.5, SD
1.5); sentences (younger, M 7.5, SD 0.9;
older, M 5.6, SD 1.5). Independent-measures
t-tests indicated that older adults were tested at
significantly higher signal-to-babble ratios than
younger adults for all three types of stimuli (p
0.001 for all comparisons).
Consonants
Participants received 42 repetitions of 13 conso-
nants in an /iCi/ context. The consonants tested
were: /m/, /n/, /p/, /b/, //, /t/, /d/, /g/, /f/, /v/, /z/, /s/, and
/k/ (e.g., participants were presented with /imi/, ini/,
etc.). The same male talker produced all of the
stimuli for testing consonants. The stimuli were
digitized from existing Laserdisc recordings of the
Iowa Consonant Test (Tyler et al., 1986) by connect-
ing the output of the Laserdisc player (Laservision
LD-V8000) into a commercially available PCI inter-
face card for digitization (Matrox RT2000). Acquisi-
tion was controlled by software (Adobe Premier).
Video capture was 24-bit, 720 480 in NTSC-
standard 4:3 aspect ratio and 29.97 frames per
second to best match the original analog media.
Audio was captured at 16 bits and a sampling rate of
48 kHz.
Consonant identification in the A, V, and AV
conditions was measured using a 13-item closed-set
test. Testing order was counterbalanced such that
each modality was presented first, second, and third
equally often to approximately the same number of
participants in each age group. To familiarize par-
ticipants with the test stimuli, the 13 test conso-
nants were first presented (in the /iCi/ context)
using the A condition without any background bab-
ble. Participants were instructed to respond by
pressing the appropriate response area on a touch-
screen monitor (ELO ETC-170C). Once participants
were able to identify all 13 consonants presented in
quiet correctly, the background babble was added
for all conditions (including V) at the level estab-
lished during pretesting (i.e., at the level yielding
signal-to-babble ratios designed to produce approx-
imately 50% correct A consonant performance). Par-
ticipants then received 42 presentations of each
consonant with presentation order determined pseu-
dorandomly for each participant. No feedback was
provided during testing.
Words
Stimulus materials for testing words were digi-
tized from analog recordings of the Childrens Audi-
tory Visual Enhancement Test (CAVET) (Tye-Mur-
ray & Geers, 2001), using the same equipment and
procedures as for the consonants. The CAVET con-
sists of three lists of 20 words and a practice list
embedded in the carrier phrase (say the word)
followed by a one- to three-syllable word. Conse-
quently, scores for each of the individual conditions
(A, V, and AV) were based on a relatively small
number of presentations (20 in each condition) com-
pared with either the consonants or sentences. De-
spite this potential limitation, we elected to use the
CAVET for several reasons. First, the CAVET was
designed specifically to avoid floor and ceiling level
performance in the V condition. Second, each list is
considered equally difficult to speechread and uses
highly familiar words. The same female talker pro-
duced all stimuli. Participants were told the carrier
phrase and were instructed to identify the word after
the phrase. Lists were counterbalanced across the
three presentation modalities (A, V, and AV) such that
approximately equal numbers of older and younger
adults received each list in a given condition.
Before testing, participants were informed that
they would see, hear or both see and hear a talker
articulating a carrier phrase say the word followed
by a target word. Participants were told to say the
target word aloud and scoring was based on exact
phonetic matches (i.e., adding, deleting, or substi-
tuting a single phoneme counted as an incorrect
response). If participants were unsure of the target
word, they were encouraged to guess. Participants
received three practice trials in each condition (A, V,
and AV) before testing and none of the practice
words appeared during the actual testing. All test-
ing was performed using signal-to-babble levels es-
tablished for words during pretesting. Participants
received a total of 60 trials (20 each in the A, V, and
AV conditions) with modality test order counterbal-
anced across participants. Within a given test con-
dition (A, V, or AV) presentation order of the indi-
vidual stimuli was determined pseudorandomly.
Sentences
Sentences were digitized from Laserdisc record-
ings of the Iowa Sentence Test (Tyler et al., 1986),
using the same equipment and procedure as was
used for the consonants and words. One hundred
sentences (five lists of 20 sentences each) were
digitized. Two of the lists were used for practice
stimuli and the remaining three were used for
testing. Each sentence in a list was produced by a
different talker (half men, half women). Thus,
within a list participants saw and/or heard a new
talker on every trial. Lists were counterbalanced
across participants such that each of the three
lists was presented an equal number of times in
the A, V, and AV conditions. All testing was conducted
at signal-to-babble ratios established for sentences
during pretesting. After the sentence was presented,
participants were instructed to repeat as much of the
sentence as possible and were again encouraged to
guess if they were unsure of one or more words in the
sentence. Scoring was based on five to seven key words
presented in each sentence.
Calculating VE and AE
In the present study, VE was calculated relative
to an individuals A performance (expressed as pro-
portion correct) according to the following equation:
VE (AV A)/1 A
This measure of VE has been used in several inves-
tigations of AV performance (Grant & Seitz, 1998;
Grant et al., 1998; Rabinowitz, Eddington, Del-
horne, & Cuneo, 1992) because it provides a method
for comparing VE across a wide range of A and V
scores. For example, an individual scoring 80% in A
presentations and 90% with AV presentations would
have the same enhancement score as an individual
scoring 20% in A and 60% in AV despite large
absolute differences in overall A and AV perfor-
mance. Furthermore, this equation avoids the bias
inherent in the simple difference score AV-A, in
which higher values of A necessarily lead to lower
values of enhancement.
Although less common, we also calculated AE
according to the following equation:
AE (AV A)/1 V
In the present study, AE is less likely to be affected
by differences in unimodal performance than the
more traditional measure of VE because it normal-
izes for any age differences in V performance (recall
that different signal-to-babble ratios were used to
minimize age differences in A performance).
To further minimize differences in unimodal per-
formance in comparing both VE and AE as a func-
tion of age, we used analysis of covariance (AN-
COVA) to control for age differences in A or V scores.
That is, when comparing AE as a function of age we
used A scores as a covariate to control for any
differences in A performance that were not elimi-
nated by using different signal-to-babble ratios.
Similarly, when comparing VE as a function of age
we used V scores as a covariate to control for any age
differences in V performance.
RESULTS AND DISCUSSION
Performance in A, V, and AV Conditions
Figure 1 displays mean percent correct for older
and younger adults as a function of both stimulus
type and presentation modality. A three-way,
mixed-design analysis of variance was used to ex-
amine main effects and interactions for age, stimu-
lus type, and presentation modality. Age (younger
versus older) was treated as an independent-mea-
sures variable and stimulus type (consonants,
words, sentences) and presentation modality (audi-
tory, visual, and auditory-visual) were treated as
repeated-measures variables. Overall, performance
differed significantly across stimulus type [F(2, 164)
10.8, p 0.001]. Tukey honestly significant dif-
ferences post hoc pairwise comparisons with a Bon-
ferroni correction for multiple comparisons indi-
cated significant differences between identification
scores for all three types of stimuli such that conso-
nants (56.1%) were identified significantly better
than words (51.1%) and words were identified sig-
nificantly better than sentences (40%) (p 0.001 for
all comparisons). Significant differences were also
observed for presentation modality [F(2, 164)
12.3, p 0.001], with scores for AV presentations
(78.1%) significantly higher than for A (47.8%) and
scores for A significantly higher than for V (21.9%)
(p 0.001 for all comparisons). Recall, however,
that A scores were established using signal-to-bab-
ble ratios designed to produce approximately 50%
correct in this condition. Therefore, although the
data indicate that the procedure for establishing
appropriate signal-to-babble levels in A were suc-
cessful (overall A performance was very close to
50%), the measures do not reflect true group differ-
ences in A performance. Finally, older adults scored
significantly poorer overall (46.7%) than younger
adults [51.6%, F(1, 82) 16.2, p 0.001].
Of particular interest to the present investigation
is that the three way interaction of age stimulus
type presentation modality was significant [F(4,
328) 2.5, p 0.05]. A series of Tukey honestly
significant differences post hoc comparisons with a
Bonferroni correction for multiple comparisons indi-
cated no significant age differences in A perfor-
mance for the three types of stimuli (p 0.9 for all
comparisons). This finding provides additional evi-
dence that the procedure for equating A perfor-
mance levels was effective across age groups and
stimulus types. In the AV condition, however, older
adults exhibited significantly poorer identification
scores than younger participants for consonants and
words (p 0.05 for both comparisons) but not for
sentences. Older adults also exhibited significantly
poorer V scores for both consonants (p 0.001) and
words (p 0.05). The difference between V perfor-
mance for sentences was not significant (p 0.5),
but this finding should be interpreted cautiously
because absolute performance levels were relatively
low for both older and younger adults. Post hoc
comparisons of V performance as a function of stim-
ulus type for older and younger adults indicated that
younger adults had similar V scores for consonants
and words (p 0.3) but significantly lower scores for
sentences (p 0.001 for the differences between
consonants and sentences and between words and
sentences). For older adults, V performance was
Fig. 1. Percent correct identification of consonants (left), words (middle), and sentences (right) for younger (filled bars) and older
(open bars) adults. Presentation modality refers to visual only (V), auditory only (A), and auditory-visual (AV). Error bars indicate
standard deviation.
significantly better for words than for consonants
and significantly better for consonants than for
sentences (p 0.001 for all comparisons).
Comparison of VE and AE
As noted, evidence for age differences in the
ability to extract visual speech information is a
critical consideration when evaluating age-related
changes in both VE and AE. In the current study,
VE was examined using three separate ANCOVAs
(one for each stimulus type). In all analyses, age
(younger, older) served as an independent-measures
variable and the appropriate V condition served as
the covariate (i.e., performance in the consonant V
condition served as a covariate when comparing
consonant VE for older and younger adults). The left
panel of Figure 2 displays the adjusted means for VE
from the ANCOVAs as a function of age and stimu-
lus type. None of the differences between older and
younger adults for VE reached statistical signifi-
cance: for consonants [F(1, 81) 2.1, p 0.15; for
words F(1, 81) 1.4, p 0.23; for sentences F(1, 81)
1, p 0.93].
Analyses paralleling those for VE were also con-
ducted to examine age differences in AE. Age again
served as an independent-measures variable and
the corresponding A scores (rather than V scores)
served as the covariates. The right panel of Figure 2
displays the adjusted means for AE as a function
of stimulus type and age. Separate ANCOVAs
were conducted to examine age differences in AE
for the three stimulus types and none of the
comparisons reach statistical significance (all F-
values less than 1).
Correlations Between Measures
Correlations Between V Performance for Con-
sonants, Words, and Sentences In addition to
examining age-related changes in AE and VE, the
present study was also designed to investigate the
relationship between performance in the V and AV
modalities as a function of stimulus type. Figure 3
displays the relationships between V performance
for consonants, words, and sentences as a function of
both age and stimulus type. Overall, correlations
were higher for younger than for older adults and
were stronger when correlating V for words and
sentences than for either consonants and words or
consonants and sentences. All correlations were sig-
nificant at the .01 level except for the correlation
between words and sentences for younger adults (p
0.001). The correlation between consonants and
words for older adults (p 0.09) was not significant.
These findings suggest that V performance for the
three types of stimuli is mediated, at least in part,
by a set of common mechanisms that are a critical
component of speechreading and that operate inde-
pendent of lexical or semantic constraints.
Correlations Between Measures of AV Perfor-
mance Figure 4 displays scatterplots and correla-
tion coefficients for AV performance as a function of
stimulus type and age group. In contrast to the
findings with V scores, only one correlationthe
correlation between AV performance for words and
sentences in younger participantswas significant
(p 0.01). However, the magnitude of the correla-
tion (0.46) was relatively modest in that AV perfor-
mance for words accounted for just slightly greater
than 20% of the variance in AV performance with
Fig. 2. Left, Adjusted means for
visual enhancement for younger
(filled bars) and older (open
bars) adults as a function of stim-
ulus type. Right, Same as in left
panel except data are for audi-
tory enhancement. Error bars in-
dicate standard deviation.
sentences. For older adults, there was a positive
relationship between AV performance with words
and sentences but this did not reach statistical
significance. Correlations between consonants and
words and between consonants and sentences did
not reach significance (and were actually negative
for younger participants). One implication of these
findings is that AV performance for consonants is
mediated by mechanisms that are distinct from
those used to identify words and sentences. For
younger adults, the moderate correlation between
words and sentences indicates some overlap in the
mechanisms used to identify these two types of
stimuli when both auditory and visual speech infor-
mation is available. However, the relatively small
correlations obtained for older adults across all
three stimulus types suggests that they may rely on
different mechanisms for AV perception of conso-
nants, words, and sentences.
Relationship Between Measures of VE and AE
Figures 5 and 6 display the scatterplots and
correlation coefficients for VE and AE between con-
sonants, words, and sentences as a function of age
group. Overall, the findings are similar to those
obtained with AV presentations in that the stron-
gest positive correlations obtained were between
words and sentences. For VE, the correlation be-
tween consonants and sentences was significant for
older adults (p 0.05) but not for younger adults
who, similar to the results for AV presentations,
exhibited a nonsignificant (p 0.27) negative corre-
lation between VE for consonants and sentences.
For AE, the correlation between consonants and
sentences was not significant for older adults but
was significant and negative for younger adults.
Note that similar to the significant correlations
observed in the AV condition, the relative magni-
tude of the correlations for both VE and AE were
small to moderate, accounting for 8 to 14% of the
variance. Also similar to the AV results, older adults
demonstrated a small, but nonsignificant, positive
correlation for both VE and AE between consonants
and words but younger adults exhibited a small
negative relationship between these two stimulus
types. Taken together, these findings suggest that
the mechanisms mediating VE and AE for conso-
nants are largely distinct from those mediating VE
and AE for words and sentences and this is true for
Fig. 3. Scatterplots and Pearson product-moment correlations between visual-only scores for consonants and words (left),
consonants and sentences (middle), and words and sentences (right). Solid circles and clear triangles represent data for younger
and older adults, respectively. Darker solid line shows the best fitting regression line for the younger adult data and the lighter
line shows the best fitting regression line for the older adults. Single (*), double (**), and triple (***) asterisks indicate significance
levels of 0.05, 0.01, and 0.001, respectively.
both older and younger adults. The small but signif-
icant correlations between the two enhancement
measures (AE and VE) for words and sentences
suggest some overlap in the mechanisms underlying
enhancement for these two types of stimuli but also
indicate significant independence in the operations
mediating the benefits of combined auditory and
visual speech information for words and sentences.
Finally, the similar correlations for both VE and AE
between words and sentences in younger and older
adults is consistent with the use of similar abilities
in the two groups to improve word and sentence
perception when both auditory and visual speech
signals are available.
General Discussion
The present study was designed to investigate the
effects of age on the ability to benefit fromcombining
auditory and visual speech information. The find-
ings indicate that older and younger adults exhibit
comparable benefits, as indexed by measures of both
VE and AE, for all three types of stimuli tested
(consonants, words, and sentences). Moreover, older
adults were able to achieve similar enhancement
relative to either unimodal condition, despite signif-
icant reductions in V performance. Correlations be-
tween performance with the different stimulus types
indicated significant positive correlations between V
performance for consonants, words, and sentences,
with younger adults generally showing stronger
correlations than older adults. Correlations across
stimulus types for AV, VE, and AE, however, were
generally significant only for the relationship be-
tween words and sentences, with small (and some-
times negative) correlations between consonants
and words and between consonants and sentences.
Age and Speechreading Performance
The results for the V condition are in good agree-
ment with previous studies demonstrating age-re-
lated declines in speechreading (Dancer, et al., 1994;
Honneil, et al., 1991; Lyxell & Ronnberg, 1991;
Middelweerd & Plomp, 1987; Shoop & Binnie, 1979).
Shoop & Binnie (1979), for example, reported differ-
ences of approximately 13% between younger (40 to
50) and older (over age 71) adults on a CV speechre-
ading test. This age difference is similar to the
difference of approximately 17% between younger
and older adults that was observed for the V condi-
tion with VCV stimuli in the present study. The V
sentence data from the present study are also in
good agreement with previous investigations of age-
related changes in speechreading. Dancer et al.
(1994) and Honneil et al. (1991), for instance, both
Fig. 4. Same as Figure 3, except the data are for auditory-visual presentations.
reported declines of 8 to 10% in V sentence intelli-
gibility for older, compared with younger adults.
These values compare favorably with the 6% differ-
ence in V sentence performance that was observed
between older and younger adults in the present
study. Thus, the findings from both the present
study and previous investigations suggest that older
adults are less able than younger participants to
encode visual speech information.
To our knowledge, the current study also repre-
sents the first demonstration of age-related declines
in V performance using older participants with clin-
ically normal hearing (at least through 4 kHz).
Viewed in conjunction with previous findings of
declines in V performance for hearing-impaired
older adults (Dancer et al., 1994; Honneil et al.,
1991), our results suggest a dissociation between
hearing abilities and speechreading, at least in older
adults. Thus, it appears that aging is associated
with declines in one or more capacities that are
critical for successful encoding of V speech informa-
tion and that these impairments are independent of
hearing status. One implication of these findings is
that if speechreading training is to be successful
with older adults, future research must be directed
at identifying and correcting the mechanisms re-
sponsible for the poorer V performance for this
population.
Another interesting aspect of the present V data
is the reduced performance for sentences, compared
with either consonants or words. This pattern of
identification scores is exactly opposite what is ob-
served for A presentations, where identification of
words in sentences is generally higher than for
isolated words or CVs (Bernstein, Demorest, &
Tucker, 2000; Hutchinson, 1989; Nittrouer & Boo-
throyd, 1990; Sommers & Danielson, 1999). One
explanation for the differential pattern of findings
with A and V presentations is that the advantage for
sentences with A presentations is dependent on the
listeners ability to extract and use semantic and
syntactic information to increase the predictability
of target items (Miller, et al., 1951; Sommers &
Danielson, 1999). In the case of V presentations,
however, extraction of semantic and syntactic cues
is more difficult because identification of individual
words is often not possible. The reduced availability
of semantic and syntactic information combined
with the increased processing demands of under-
standing sentences (i.e., sentences present more
Fig. 5. Scatterplots and Pearson product-moment correlations between visual enhancement for consonants and words (left),
consonants and sentences (middle), and words and sentences (right). Solid circles and clear triangles represent data for younger
and older adults, respectively. Darker solid line shows the best fitting regression line for the younger adult data and the lighter
line shows the best fitting regression line for the older adults. Asterisks indicate significance at the 0.05 level.
information than either words or consonants) prob-
ably is the main reason that V performance for
sentences was the poorest of the three stimulus
types for both older and younger adults.
Despite the relatively poor performance with V
presentations, significant correlations were ob-
tained between V performance with consonants,
words, and sentences, and the magnitude of these
correlations is similar to what has been reported
previously (Bernstein, et al., 2000; Demorest, Bern-
stein, & DeHaven, 1996; Grant & Seitz, 1998).
Grant & Seitz (1998), for example, reported a corre-
lation coefficient of 0.49 between V scores for conso-
nants and sentences. This value is similar to the
correlations of 0.42 and 0.39 obtained for younger
and older adults, respectively, in the present study.
Bernstein et al. (2000) measured V performance for
phonemes, words, and sentences in normal-hearing
adults (18 to 45 years of age) and found significant
correlations between scores for phonemes and words
(0.39) and phonemes and sentences (0.43). Simi-
larly, Demorest et al. (1996) reported that V perfor-
mance for younger adult participants on nonsense
syllables correlated approximately .5 with V perfor-
mance on both words and sentences, with an even
stronger correlation (approximately 0.8) between V
performance with words and sentences. Again, the
magnitudes of these correlations are comparable to
those observed for younger adults in the V condition
of the present study. Thus, the picture that is
beginning to emerge from investigations of
speechreading using different types of stimulus ma-
terials is that there is some overlap in the mecha-
nisms used to extract V speech information from
consonants, words, and sentences and that this set
of core abilities is similar for both older and younger
adults.
Auditory and Visual Enhancement
The finding that older and younger adults exhibit
similar enhancement from combining auditory and
visual speech information replicates and extends
results from previous investigations examining the
effects of age on AV speech perception. To our
knowledge, the present investigation is the first to
examine age effects on auditory-visual enhancement
across three different types of stimuli after control-
ling for age-related changes in unimodal encoding.
Consistent with the absence of age differences in
enhancement that were obtained in the present
study, Helfer (1997; 1998) reported that visual ben-
efit for words presented in nonsense sentences was
similar for younger normal-hearing adults and older
participants with mild to moderate hearing impair-
ments. These findings are particularly relevant to
Fig. 6. Same as in Figure 5, except data are for auditory enhancement.
the current investigation because Helfer used one of
the same relative measures of enhancement, VE, as
in the present study and older adults were tested at
a slightly higher signal-to-noise ratio than younger
adults. Taken together, the findings from Helfer
(1998) and the present investigation suggests that
older and younger adults exhibit similar enhance-
ment from combining auditory and visual speech
signals over a relatively large range of A perfor-
mance. An important direction for future research
will be to examine whether the age equivalence in
enhancement observed in the present study main-
tains for more adverse listening conditions where A
scores are reduced and the need for integrating
auditory and visual speech signals is even greater.
The finding that older and younger adults exhib-
ited comparable enhancement from combining audi-
tory and visual speech information suggests that age
differences in AV performance reflect age-related
impairments in the ability to encode V speech sig-
nals (recall that A performance was manipulated to
be approximately equivalent in the two groups)
rather than a reduced ability to combine or integrate
information across the two modalities. Within the
model of AV perception proposed by Grant et al.
(1998), this hypothesis argues that when older and
younger adults are able to obtain similar amounts of
visual and auditory speech information, both groups
exhibit similar abilities to integrate the unimodal
percepts into a unified AV perception. That is, the
present findings suggest that any age differences in
AV perception do not result from age differences in
central integration capacities. Investigations are
currently underway in our laboratory to obtain more
direct measures of integration per se (i.e., indepen-
dent of unimodal encoding) with the working hy-
pothesis that older and younger adults will exhibit
similar integration efficiency.
Pattern of Correlations for VE and AE
In contrast to the significant correlations between
V performance for consonants, words, and sen-
tences, both VE and AE for these three types of
stimuli generally exhibited small, and in some cases,
even negative relationships. In interpreting these
results it is important to note that a number of
methodological differences between the conditions
may have contributed to the relatively low correla-
tions. Specifically, consonants were tested using a
closed-set format whereas words and sentences were
tested using an open-set format. In addition, a single
male talker produced the items for the consonant
test, a single female talker produced the items for
the word test, and both male and female talkers
were used to produce items for the sentence test.
Despite these methodological differences, however,
the findings are generally consistent with previous
measures of visual enhancement as a function of
stimulus type. Grant & Seitz (1998) also reported a
near-zero (0.06) correlation between visual enhance-
ment for consonants and sentences in a group of
hearing-impaired adults ranging in age from 41 to
76. This pattern of results is readily explained
within the model of AV perception proposed by
Grant et al. (1998). Specifically, the model proposes
that AV performance for consonants is determined
primarily by bottom-up capacities that serve to
extract appropriate linguistic cues from auditory
and visual speech signals. For words and sentences,
however, AV performance is determined by both
bottom-up extraction of auditory and visual cues
and by top-down lexical, syntactic, and semantic
processing. Thus, the small correlations between
enhancement (both AE and VE) for consonants and
enhancement for other types of stimuli may reflect
the increased importance of top-down processing
abilities for the more linguistically complex words
and sentences. Consistent with this proposal, the
strongest correlations for both AE and VE were
between words and sentences, suggesting some
overlap in the mechanisms mediating enhancement
for these two types of stimuli.
Finally, it is useful to consider the pattern of
correlations for both AE and VE as a function of
stimulus type in relation to the analogous correla-
tions for V and A. As noted, both the current study
and previous investigations have found moderate to
strong correlations between V performance for con-
sonants, words, and sentences. Similarly, Humes et
al. (1994) found that correlations between A perfor-
mance with the same three types of stimuli (conso-
nants, words, and sentences) ranged between 0.35
and 0.85. These findings suggest that the absence of
strong correlations for AE and VE across conso-
nants, words, and sentences is a consequence of the
mechanisms mediating integration rather than to
within-modality differences in the mechanisms un-
derlying identification of consonants, words and
sentences. An important goal for future research,
therefore, will be to identify the unique demands
imposed by integration and to specify why those
demands reduce or eliminate correlations between
enhancement for consonants, words, and sentences.
ACKNOWLEDGMENTS
Portions of these data were presented at the 144th meeting of the
Acoustical Society of America, Cancun, Mexico. This research was
supported by grant R01 AG 180294 from the National Institute
on Aging. The authors thank Arnold Heidbreder for technical
assistance.
Address for correspondence: Mitchell S. Sommers, Department of
Psychology, Washington University, Campus Box 1125, St. Louis,
MO 63130.
Received April 20, 2004; accepted December 12, 2004
REFERENCES
American National Standards Institute. (1989). Specification for
Audiometers. ANSI S3.61989. New York.
American Speech and Hearing Association (1988). Guidelines for
determining threshold levels for speech. American Speech and
Hearing Association, 8589.
Bernstein, L. E., Demorest, M. E., & Tucker, P. E. (2000). Speech
perception without hearing. Perception and Psychophysics, 62,
233252.
CHABA, Committee on Hearing and Bioacoustics, Working
Group on Speech Understanding and Aging. (1988). Speech
understanding and aging. Journal of the Acoustical Society of
America, 83, 859895.
Cienkowski, K. M., & Carney, A. E. (2002). Auditory-visual
speech perception and aging. Ear and Hearing, 23, 439449.
Dancer, J., Krain, M., Thompson, C., Davis, P., & Glenn, J.
(1994). A cross-sectional investigation of speechreading in
adults: Effects of age, gender, practice and education. Volta
Review, 96, 3140.
Dekle, D. J., Fowler, C. A., & Funnell, M. G. (1992). Audiovisual
integration in perception of real words. Perception & Psycho-
physics, 51, 355362.
Demorest, M. E., Bernstein, L. E., & DeHaven, G. P. (1996).
Generalizability of speechreading performance on nonsense
syllables, words, and sentences: Subjects with normal hearing.
Journal of Speech and Hearing Research, 39, 697713.
Folstein, M. F., Folstein, S. E., & McHugh, P. R. (1975). Mini-
mental state: A practical method for grading the cognitive state
of patients for the clinician. Journal of Psychological Research,
12, 189198.
Grant, K. W., & Seitz, P. F. (1998). Measures of auditory-visual
integration in nonsense syllables and sentences. Journal of the
Acoustical Society of America, 104, 24382450.
Grant, K. W., Walden, B. E., & Seitz, P. F. (1998). Auditory-visual
speech recognition by hearing-impaired subjects: Consonant
recognition, sentence recognition, and auditory-visual integra-
tion. Journal of the Acoustical Society of America, 103, 2677
2690.
Helfer, K. S. (1997). Auditory and auditory-visual perception of
clear and conversational speech. Journal of Speech, Language,
and Hearing Research, 40, 432443.
Helfer, K. S. (1998). Auditory and auditory-visual recognition of
clear and conversational speech by older adults. Journal of the
American Academy of Audiology, 9, 234242.
Honneil, S., Dancer, J., & Gentry, B. (1991). Age and speechre-
ading performance in relation to percent correct, eyeblinks,
and written responses. Volta Review, May, 207212.
Humes, L. E., Watson, B. U., Christensen, L. A., Cokely, C. G.,
Halling, D. C., & Lee, L. (1994). Factors associated with
individual differences in clinical measures of speech recogni-
tion among the elderly. Journal of Speech and Hearing Re-
search, 37, 465474.
Hutchinson, K. M. (1989). Influence of sentence context on speech
perception in younger and older adults. Journal of Gerontology,
44, 3644.
Lyxell, B., & Ronnberg, J. (1991). Word discrimination and
chronological age related to sentence-based speech-reading
skill. British Journal of Audiology, 25, 310.
McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing
voices. Nature, 264, 746748.
Middelweerd, M. J., & Plomp, R. (1987). The effect of speechre-
ading on the speech-reception threshold of sentences in noise.
Journal of the Acoustical Society of America, 82, 21452147.
Miller, G. A., Heise, G. A., & Lichten, W. (1951). The intelligibility
of speech as a function of the context of the test material.
Journal of Experimental Psychology, 41, 329335.
Morrell, C. H., Gordon-Salant, S., Pearson, J. D., Brant, L. J., &
Fozard, J. L. (1996). Age- and gender-specific reference ranges
for hearing level and longitudinal changes in hearing level.
Nittrouer, S., & Boothroyd, A. (1990). Context effects in phoneme
and word recognition by younger children and older adults.
Pelli, D., Robson, J., & Wilkins, A. (1998). The design of a new
letter chart for measuring contrast sensitivity. Clinical Vision
Science, 2, 187199.
Pichora-Fuller, M. K., Schneider, B. A., & Daneman, M. (1995).
How younger and older adults listen to and remember speech
in noise. Journal of the Acoustical Society of America, 97,
593608.
Rabinowitz, W. M., Eddington, D. K., Delhorne, L. A., & Cuneo,
P. A. (1992). Relations among different measures of speech
reception in subjects using a cochlear implant. Journal of the
Acoustical Society of America, 92, 18691881.
Sams, M., Manninen, P., Surakka, V., Helin, P., & Ktt, R.
(1998). McGurk effect in Finnish syllables, isolated words and
words in sentences: Effects of word meaning and sentence
context. Speech Communication, 26, 7587.
Shoop, C., & Binnie, C. A. (1979). The effects of age upon the
visual perception of speech. Scandinavian Audiology, 8, 38.
Sommers, M. S., & Danielson, S. M. (1999). Inhibitory processes
and spoken word recognition in younger and older adults: The
interaction of lexical competition and semantic context. Psy-
chology and Aging, 14, 458472.
Sumby, W. H., & Pollack, I. (1954). Visual contributions to speech
intelligibility in noise. Journal of the Acoustical Society of
America, 26, 212215.
Summerfield, Q. (1987). Some preliminaries to a comprehensive
account of audio-visual speech perception. In B. Dodd and R.
Campbell (Eds.), Hearing by Eye: The Psychology of Lip-
reading (pp. 351). Hillsdale, NJ: Lawrence Erlbaum Associ-
ates.
Tye-Murray, N. & Geers, A. (2001). Childrens audio-visual en-
hancement test, Central Institute for the Deaf, St. Louis,
Missouri.
Tyler, R. D., Preece, J., & Tye-Murray, N. (1986). The Iowa laser
videodisk tests. University of Iowa Hospitals: Iowa City, IA.
Walden, B. E., Busacco, D. A., & Montgomery, A. A. (1993).
Benefit from visual cues in auditory-visual speech recognition
by middle-aged and elderly persons. Journal of Speech and
Hearing Research, 36, 431436.

Auditory Visual Perception

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Auditory Visual Perception

Загружено:

Авторское право:

Доступные форматы

Auditory-Visual Speech Perception and Auditory-

Visual Enhancement in Normal-Hearing Younger

Вам также может понравиться