Вы находитесь на странице: 1из 13


Int. J. Human-Computer Studies 64 (2006) 489–501


Methods for inclusion: Employing think aloud protocols in

software usability studies with individuals who are deaf
Vera Louise Robertsa,, Deborah I. Felsb
Adaptive Technology Resource Centre, University of Toronto, 130 St George Street, Toronto, Ont. 416-978-4360, Canada M5S 1A5
Ryerson University, 350 Victoria St. Toronto, Canada M5B 2K3
Received 7 July 2004; received in revised form 20 September 2005; accepted 7 November 2005
Available online 20 December 2005
Communicated by J. Jacko


Usability is an important step in the software and product design cycle. There are a number of methodologies such as talk aloud
protocol, and cognitive walkthrough that can be employed in usability evaluations. However, many of these methods are not designed to
include users with disabilities. Legislation and good design practice should provide incentives for researchers in this field to consider more
inclusive methodologies. We carried out two studies to explore the viability of collecting gestural protocols from sign language users who
are deaf using the think aloud protocol (TAP) method. Results of our studies support the viability of gestural TAP as a usability
evaluation method and provide additional evidence that the cognitive systems used to produce successful verbal protocols in people who
are hearing seem to work similarly in people who speak with gestures. The challenges for adapting the TAP method for gestural language
relate to how the data was collected and not to the data or its analysis.
r 2005 Elsevier Ltd. All rights reserved.

Keywords: Usability; Usability evaluation methods; Deaf; Think aloud protocol; Gestural think aloud protocol

1. Introduction actions along with his/her verbalizations are recorded for

later analysis.
While additions to software such as preferences setting, Questions about inclusion arise from an examination of
frequency of use menus and operating system accessibility this popular and useful UEM: What if the intended user of
options indicate that there is some movement in software a product does not speak through oral means? What
design towards adherence to universal design principles, equivalent UEM is available for individuals who, for
usability evaluation methods (UEMs) have not undergone example, communicate with a gestural language such as
a comparable shift. Consequently, a question emerges: how American Sign Language (ASL)? The think aloud method
well do the current UEMs evaluate the usability of has not been widely used with individuals who are users of
software developed for multiple user profiles? sign language yet mandates for inclusive educational
One common methodology used to study software and/ technologies require that research methods appropriate to
or product usability is talk aloud or think aloud protocol this group be developed and tested. A gestural think aloud
(TAP). This protocol requires the product user to speak protocol (GTAP) that utilizes ‘‘self-sign’’ in the same way
aloud his/her thoughts while focusing on and manipulating TAP utilizes ‘‘self-talk’’ may prove to be an important
the product in prescribed ways (Ericsson and Simon, 1984; methodology in assessing the usability and quality of
Ericsson and Simon, 1998). The interaction between the online educational materials for ASL users. Furthermore,
user and the software is often video taped so that the user’s there may be more to learn about expanding existing
research protocols such as TAP to include gestural/visual
Corresponding author. Tel.: +01 416 978 4360; fax: +01 426 971 2629. languages. This research explores the use of TAP
E-mail addresses: vera.roberts@utoronto.ca (V. Louise Roberts), methodologies with ASL users and seeks to provide
dfels@ryerson.ca (D.I. Fels). support for use of this inclusive UEM.

1071-5819/$ - see front matter r 2005 Elsevier Ltd. All rights reserved.
490 V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501

The theoretical underpinning applicable to the TAP way, the protocol method may be ideal for speakers of ASL
method and to this research is that information held or who will be able to articulate their thoughts in their first
recently held in short term (STM) and working memory is language using their known words, grammar and syntax.
accurately revealed through verbalizations. What is im- For most native ASL speakers, oral language and its
portant for GTAP to be successful, is that there is a trace of written form is a second language in which they are less
recent thought in working memory that may be articulated fluent (Schirmer, 2000). For this reason, written question-
via sign language. Research with ASL participants (Bellugi naires and responses or paper-based surveys are not always
et al., 1974; Campbell and Wright, 1990; MacSweeney, an appropriate means for equitable user testing. Gestural
1998; Wilson and Emmorey, 1998, 2003) provide support TAP enables spontaneous user feedback about how a task
for STM or working memory processes in individuals who is being accomplished without the additional and poten-
are deaf that are similar in function to individuals that have tially interfering cognitive processes required for individuals
hearing. Also, these studies indicate that participants are to work in a language that is not their first language. Also,
able to access and articulate items in working memory. research has shown that the concurrent verbal protocols
Thus, the use of gestural TAP with individuals who are obtained using TAP methods are superior to retrospective
deaf and sign language speakers is further supported. protocols obtained after the task is completed (Nisbett and
There have been two other studies with participants who Wilson, 1977; Ericsson and Simon, 1984; Kuusela and Paul,
were deaf that have been characterized as think aloud 2000). Furthermore, in a study that compared four usability
studies: the reading comprehension studies of Andrews and methods (logged data, questionnaires, interviews and verbal
Mason (1991) and Schirmer (2003). The former asked protocol analysis), TAP was shown to be ‘‘the most
participants to read a close passage and upon reaching the effective single method at highlighting usability problems’’
omitted word, to think their thoughts aloud. At the same (Henderson et al., 1995, p. 426). Indeed, in some cases,
time, the researchers engaged the participant in conversa- when UEM methods were paired, the number of unique
tion and asked probing questions. This study was not a problems revealed by the paired UEMs still failed to
typical think aloud method as it was more likely to draw identify as many problems as were identified through TAP
introspective as well as retrospective responses and did not alone (Henderson et al., 1995).
follow the standard procedure where prompts to partici- Signed languages are complete, natural languages that
pants are limited to ‘‘keep talking’’ or ‘‘thoughts?’’ are distinct from spoken languages (Stokoe, 2001) and
Schirmer (2003) asked participants to think aloud at have a distinct phonology, morphology, syntax and
natural breaks in the text such as page breaks rather than vocabulary (Messing, 1999). Thus ASL has the same
a concurrent thought stream. While it is clear that TAP is a potential to provide rich protocols as do spoken languages.
very useful tool in understanding reading strategies, there is Although oral and gestural languages rely on different
no body of research that explicitly verifies that gestural modes for communication, the underlying neural structures
TAP and verbal TAP are equivalent cognitively or produce are actually very similar (MacSweeney, 1998). In hearing
the same results. individuals, the left hemisphere of the brain is responsible
The think aloud method or protocol (TAP) was first for speech production and language comprehension
used in research on problem-solving behaviour (Someren (McNeill, 1992). Corina (1998) reviewed 16 studies of
et al., 1994). Pioneered by Newell and Simon (1972) and individuals who communicate with sign and have suffered
refined by Ericsson and Simon (1984), TAP required that brain lesions and found that the studies provided sufficient
participants speak their thoughts aloud as they complete an evidence to support the notion that the left hemisphere of
assigned task. the brain is also responsible for sign language comprehen-
The TAP method is not associated with any one model sion and sign production.
of cognition or memory (Ericsson and Simon, 1984; Green, Indeed, this notion that language and gesture rely on
1998) and may be best viewed as a method for testing similar brain regions and processes is supported by
hypotheses (Ericsson and Simon, 1984; Green, 1998). For research with individuals with aphasia (McNeill, 1992).
example, the usability researcher tests hypotheses held by Furthermore, Bates and Dick (2002) in their review of
developers about how certain features will be used by the language and gesture research found that ‘‘compelling links
audience or how well the needs and expectations of have been observed, involving specific aspects of gesture
the audience are met by the product. Furthermore, ‘‘one that precede or accompany each of the major language
of the uses of think aloud protocol is to assist in forming milestones from 6 to 30 months’’ (p. 293), and ‘‘that there
hypotheses about areas where not much is known’’ is research to support links between gesture and language
(Wiedenbeck et al., 1989, p. 25). Hence, TAP is particularly development in both typical and atypical populations’’
useful for the iterative development cycle of software. (p. 295). The researchers conclude that ‘‘the division of
As an inclusive UEM, TAP has an advantage over labor in the brain does not seem to break down neatly into
structured elicitation techniques such as interview and language versus nonlanguage’’ (p. 305). Thus there is
surveys: the think aloud method makes it easy for the support for the notion that the collection of gestural
subjects because they are allowed to use their own language protocols makes similar cognitive demands on the partici-
(Van Someren et al., 1994, p. 26) and own words. In this pant as the collection of verbal protocols.
V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501 491

1.1. Scope

This paper sets out a methodology for inclusive

collection of verbal protocols and is not intended as
further validation of the think aloud method for collecting prete only)
Inter n t s
verbal protocols. The goal is to develop a UEM that is a
inclusive. We will present the results of two studies
designed to explore the use of gestural protocol as a
usability method. The first study addresses the mismatch
between the movement towards barrier-free designs and
the UEMs available to test them. In our second study, we
implemented GTAP in a usability evaluation of new
methods for presenting ASL translations of online videos.
Thus, we were looking for similar results across the two
studies carried out and were not attempting to validate the
method in each study. This research will advance our Participant
knowledge of how gestural and verbal protocols compare
as well as further our understanding of how to carry out a
gestural TAP procedure. TAPs are used often and are C
widely accepted as a standard UEM; validation of the
proposed GTAP method is still needed. We carried out two
studies to explore the use of gestural TAP as a viable
method of user testing for deaf individuals and intend to
show that existing TAP UEMs may be effectively and
reasonably adapted for use with individuals who commu-
nicate with gestural languages such as ASL.

2. Method
Fig. 1. Lab configuration for Solitaire Game Study.

This research contained two components: (1) a simple

game study and (2) an actual usability study carried out
using the GTAP method. Each study was designed to 2.1.2. Subjects
examine different aspects of using the GTAP method. The Participants for each study were recruited at different
first study enabled comparison between verbal and GTAPs times. Participants were permitted to participate in both
and the later enabled evaluation and testing of the GTAP studies. For the game study, there were nine individuals (3
method. This research addresses the following questions: male, 6 female; ages 26–45) who were deaf and fluent in
ASL and nine hearing individuals (5 male, 4 female; ages
(1) What is the relationship of gestural language to 26–55), and ranged in education level from some college to
outcomes of TAP. The main hypothesis to consider graduate degree for deaf participants and high school to
for this question is that there is a difference between graduate level for hearing participants. All subjects were
TAP outcomes between ASL speakers and oral familiar with the Solitaire card game. In both groups, one
language speakers. individual was younger than 30 years, six individuals were
(2) What is the viability of gestural TAP as a usability 31–40 years and two individuals were older than 40 years.
evaluation method? Is it viable and if not, how may it
be made viable? and 2.1.3. Procedure
(3) What is the relationship of gestural language to the Task. Following training, individuals were asked
protocol data that is collected, and how the data is to play Solitaire until their participation in the study
collected? reached ten minutes. The session closed with a short
questionnaire and discussion.

2.1. Solitaire game study Think aloud protocol. Participants in the game
study were asked to speak/sign their thoughts at any time
2.1.1. Materials or at least before making a move/using the mouse. After
For the game component, Microsoft Solitaire on a short periods of silence (about 10 s), several moves without
laptop computer was used. One video camera was used to utterances, or indications of thought such as facial
record the screen and spoken English interpretation expressions or head nodding, the investigator would
simultaneously (see Fig. 1 for schematic). remind the participant to ‘‘keep talking.’’
492 V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501

2.1.4. Outcome measures and data analysis

The game study is a between-subjects design. Two
groups of subjects, ASL speakers (ASL group) and English
speakers (Oral group) played Microsoft Solitaire. The

grouping factor is language. Protocols collected for
the game study are categorized as belonging to one of the
following categories: (1) responses to the cards (e.g., prete
expression of need); (2) play (e.g., card value); (3) strategy
(e.g., plan for best play); (4) general comments about
Solitaire in general; (5) error (e.g., having made a mistake);
(6) TAP (e.g., effect of TAP on play); (7) technical
(comment about hardware); and (8) usability (e.g., com-
ments about game software environment).
The verbal protocols were transcribed from videotape, Participant
chunked into segments and then coded for analysis. The
Intraclass Correlation statistic was used to determine the C
reliability of the coding strategy between two separate
coders. The ASL was translated and recorded simulta-
neously during the study.
Participants were asked to complete a pre-study ques-
tionnaire. The Solitaire pre-study questionnaire asked
some background information questions as well as ques-
tions about the participant’s experience with Solitaire and
skill level. The participant had the option to have any
written material translated into ASL and to respond in

2.2. Law video usability study

Fig. 2. Lab configuration for Law Video Usability Study.

2.2.1. Materials
A software video player interface was developed and
tested for the usability component. The interface allowed logging software. Two video cameras were used to record
users to view an educational video about traffic court along the signs made by the participant and by the investigator so
with an ASL interpretation of the video. The original that they could be translated at the close of the study.
education video was produced as a standard National Please see Fig. 2 for schematic.
Television Systems Committee (NTSC) video with tradi-
tional verbatim closed captions. The ASL interpretation 2.2.2. Subjects
was provided in two formats, an Acted ASL version where For the usability study, there were 17 (9 female and 8
actors in costume played the parts of the original actors male) participants who were deaf and fluent in ASL and
and a Standard ASL version using a single translator for all ranged in education level from high school to college/
of the parts. The Acted ASL version showed ASL actors in university. The Canadian Learning Television video on
two separate video windows along with the original video. traffic court was aimed at the age and education level of the
The translated version showed the translator in one participants.
separate window along with the original video.
Video controls in the interface allowed the user to play, 2.2.3. Procedure
pause, move forward or rewind the video. Viewing Task. Individuals in the usability study were
preferences could also be set by the user and included asked to watch a 10-min video segment of a traffic court
adjustment of video and interpretation window positions, educational video. The first viewing of the video was
proximity of the windows, relative size of the windows and followed by the comprehension test. Next, participants
use of borders around the windows. A short training video were allowed to adjust viewing preferences before viewing
was provided to assist participants in learning the interface the same educational video segment but with a different
as well as in practicing GTAP. The educational video was ASL interpretation format from the first viewing. The
provided by Canadian Learning Television and is part of a session closed with a short questionnaire and discussion.
full social studies curriculum.
For the usability study, subjects were provided with a PC Think aloud protocol. Usability participants were
that included keyboard, mouse, speakers and 17-in colour asked to sign their thoughts about the task as they were
monitor. Each PC was further equipped with key-stroke performing it. After short periods of silence (about 10 s), or
V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501 493

indications of thought such as facial expressions or head complete move such as playing a card and coded by two
nodding, the investigator would remind the participant to independent raters. Inter-rater reliability (ICC) of the
‘‘keep talking.’’ chunking was high at 76% for 416 sample utterances,
and ICC for the coding components was high ranging from
2.2.4. Outcome measures and data analysis 68 to 100% for all categories. Chunking and coding tasks
The law video usability study is a within-subjects design. were then continued by only one of the raters.
The experimental factors of interest in this study are The average number of comments for the play category
viewing order and interpretation type (Acted ASL and was nearly double for the Oral group than for the ASL
Standard ASL). The law video data is aggregated into four group (53 and 28, respectively). However, a 1 min snapshot
overarching categories that relate to the specific research of the data at the third minute of each session shows no
questions of the usability study: (1) interpretation, (2) difference in the number of play comments. This 3rd min
format, (3) technical issues and, (4) content. All of the was randomly selected to provide a normalized unit of time
analyses are based on these macro-categories. To under- to eliminate differences in counts due to time elapsed
stand the quality of the comments in these macro- during play which varied from participant to participant.
categories further sub-categories of positive and negative The mean number of comments for each category are
are used. shown in Tables 1 and 2.
A counterbalanced order for the levels of presentation The standard deviation for all category counts is high for
technique was established and participants received the both full session and minute-three protocols.
order that was assigned to their subject number. Partici- Multivariate analysis of the dependent variable, lan-
pants were assigned subject numbers consecutively and the guage, showed no effect of group; however, the observed
order of subjects was simply determined by the order that
arose from the subjects’ availability for the study.
As in the game study, video data was transcribed and Table 1
coded. The Intraclass correlation statistic was used to Descriptive group data (n ¼ 9) for category counts for minute three of
session protocol
verify the consistency and codability of the data. In this
study, the ASL was translated after the study Participants Category ASL Oral
were asked to complete a pre-study questionnaire. The
Mean Count SD Mean Count SD
usability pre-study questionnaire asked questions about the
participant’s experience with online video, computers and Play 4.33 39 4.33 4.33 39 2.74
ASL interpretations of video material. The participant had Strategy .22 2 .44 .33 3 .50
the option to have any written material translated into Cards 1.44 13 1.33 2.22 20 1.39
Solitaire .22 2 .44 .22 2 .44
ASL and to respond in ASL.
Error .11 1 .33 .33 3 .50
TAP .11 1 .33 .11 1 .33
2.3. Statistical issues Distracted .56 9 1.33 .00 0 .00
Technical .11 1 .33 .00 0 .00
Non-parametric tests were used to compare the means of Usability .00 0 .00 .00 0 .00
Conversation .89 8 1.05 1.11 10 1.27
the two groups in both studies as the number of subjects is
No. of utterances 7.44 76 3.00 7.11 78 2.37
considered small. A higher alpha level for small groups will
increase the power of the test and improve the chance of
making a correct decision about the null hypothesis so the
level for tests of both sets of data was set to a ¼ :10
(Stevens, 1996, pp. 6 and 175). However, because the data Table 2
is multivariate, the alpha level for univariate analysis will Descriptive group data (n ¼ 9) for category counts for complete session
be calculated as .01 (roughly a/p, where p ¼ the number of
categories) in order to reduce the risk of a type I false Category ASL Oral
rejection error (Stevens, 1996, p. 160). Power is particularly
Mean SD Mean SD
important in the Solitaire game study since the expectation
is that there will be no difference between the two groups. Play 27.67 18.76 52.89 26.90
It should be noted that no data is removed in order to Strategy 1.67 1.94 2.78 2.22
increase power. Cards 10.56 6.06 14.67 8.76
Solitaire 1.11 1.36 2.22 1.56
Error 1.11 1.45 2.00 1.73
3. Results TAP 1.67 1.66 .56 1.01
Distracted 2.56 5.66 .00 .00
3.1. Solitaire game study Technical .33 .50 .33 .50
Usability .56 .88 .22 .44
Conversation 11.44 6.19 5.78 5.19
A sample set of the transcripts for the protocols were
No. of utterances 53.56 16.45 72.56 27.12
divided into utterance ‘‘‘chunks’’ that represented a
494 V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501

power was low at .33. Wilcoxon t-tests were carried out for A Wilcoxon t-test for the two aggregate categories
each of the categories for utterances made for the entire showed no significant difference between the groups.
session and for utterances made during the 3rd min slice
only. The individual results are summarised in Tables 3 3.2. Law video usability study
and 4. No significant difference (po:01 level) on number of
comments produced in a category was found in either the This task is a case study of the adapted TAP method in a
whole session or minute-three data. usability study. The grouping factors in this study were
Participants in both groups averaged 7 utterances during interpretation treatment (acted and standard interpreta-
minute three of the verbal protocol. Category sums related tion) and viewing order of the video interpretation
to playing the game (play, strategy, cards, error) and treatment.
categories not related to the game (solitaire, distracted, Due to technical difficulties, only 13 participants of the
TAP, technical, usability, conversation) were aggregated initial 17 have full data sets (26 ten-min video sessions) and
and are shown in Table 5. The overall ratio of game only these full sets are included in the analysis. Only the
comments to non-game comments was 4:1. verbal protocols for the law video viewing sessions are
analysed because these sessions constitute the GTAP
usability portion of the study.
Table 3 An initial coding scheme of nine categories (see Table 6)
Wilcoxon t-test results for whole session data was developed. Seven of the categories were aggregated
Category Test statistic Asymptotic significance
into four categories: the ASL interpretation (1 and 2),
format (3, 4 and 7), technical issues (5) and content (6) of
Play 64 .06 the law video. Categories 8 and 9 stood alone. Comments
Strategy 73.5 .28 were coded as positive or negative for each aggregate
Cards 75.5 .38
category and simply counted in categories 8 and 9. The
Solitaire 69.5 .14
Error 73.5 .26 ICC results for the four main categories which include all
TAP 68.5 .11 of the sub-categories are high and are shown in Table 7.
Distracted 72 .07 Table 8 shows the average divided category counts for
Technical 85.5 1.00 the data for each interpretation type and for each viewing
Usability 79 .47
Conversation 63.5 .05
order. T-tests were used to determine whether viewing
No. of utterances 68.5 .13 order and interpretation type affected the quantity of
positive and negative comments. There were significantly
more positive comments during the second session
(tð22Þ ¼ 2:08, po:05). No effect of interpretation type
Table 4 on number of comments was found.
Wilcoxon t-test results for minute three data. Protocol analysis Multivariate analysis of the ten grouping variables was
conducted to determine whether order or interpretation
Category Test statistic Asymptotic significance
type had an effect on the number of comments made. A
Play 80.5 .66 significant effect of order was found with alpha at the .10
Strategy 82.5 .78 level [F ð10; 13Þ ¼ 2:31, po:1]. The observed power of the
Cards 81 .61 effect is moderate at .81.
Solitaire 85.5 1.00
The categories were also grouped by order only and
Error 76.5 .27
TAP 85.5 1.00 interpretation type only and Wilcoxon tests were run.
Distracted 76.5 .15 Table 9 shows the findings for the test that are significant
Technical 81 .32 or approached significance.
Usability 85.5 1.00 Many issues were identified by the participants. For
Conversation 73 .25
example, two participants identified seven different issues
No. of utterances 83.5 .86
for the acted interpretation and nine different issues for the
standard interpretation; their comments are shown in
Table 10.
Table 5
Each of the 26 ten-min video sessions yielded at least one
Aggregate category sums for minute three of verbal protocol coded comment with the average number of comments per
session being 8.4. Table 11 provides a summary of the
Group total Average Ratio number of comments in each category. Examples of
Game Non-game Game Non-game comments made by participants are shown in Fig. 3.
No significant correlation between interpretation type or
ASL 56 16 6.22 1.78 3.5:1 viewing order and total number of comments was found;
Oral 64 14 7.11 1.56 5.6:1
however, correlations between these variables and three of
Total 120 30 13.33 3.33 4:1
the four coding categories were found (see Table 12).
V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501 495

Table 6
Protocol analysis for usability study and sample comments

Category Sample comment

1. Quality of interpreter/actors presence and expression +ve Good facial expression.

ve The character looks mad but the interpreter is not
showing it.
2. Interpretation quality (speed, novel signs) +ve The interpreter keeps up well.
ve That is the wrong sign.
3. Ease of using video interface (video control buttons) +ve The reverse button is useful.
ve How do I stop this thing?
4. Viewing preferences (position, border, size, proximity) +ve I like being able to move the ASL window.
ve There are too many choices.
5. Technical issues with video (synchronicity, visibility) +ve Good lighting.
ve The interpretation is not in synch with the video.
6. Content of video +ve It makes sense.
ve Boring.
7. Acted vs interpreted ASL content (comparison, busyness)
Acted +ve Easier to keep track with costumes.
ve Harder to follow two windows than one
Interpreted +ve It is less work with fewer windows to watch
ve Confusing to track who is speaking.
8. Closed captioning requested You should have captions too.
9. Recommendation They signed situation. Details leading to the ‘‘situation’’ should be explained

Table 7 data, and allow us to put forward recommendations for

Mixed two way, single measure intraclass correlation coefficient statistics including gestural language users in usability studies.
for ASL law video protocol coding categories not split for positive and
negative utterances
4.1. The relationship of gestural language to tap outcomes
Category ICC (3,1) Confidence interval
As shown by the results of the Solitaire game study, the
ASL .9077 .8545–.9420
verbal protocol outcomes between ASL speakers and oral
Technical .9261 .8829–.9538
Format .9189 .8718–.9492 speakers seem very similar in a controlled and known task
Content .8622 .7858–.9127 environment such as Solitaire.
While there were some difference trends between ASL
and oral speakers in the verbal protocol outcomes in the
Solitaire game study, there were no significant differences
4. Discussion in any Wilcoxian t-test comparison of the means in any
category. Therefore we cannot support rejection of the null
The analysis of the results of these two studies provide hypothesis that there was no difference of verbal protocol
some insight into the use of GTAP as a usability testing outcomes between ASL and oral language speakers for this
method that includes people who are gestural or sign particular task. Even though the mode of communication
language users. Specifically, the Solitaire game study is for the two groups was quite different, no corresponding
designed to examine differences between TAP data significant difference in the number and kind of comments
collected from spoken language users and gestural lan- was found.
guage users. In answering the first research question posed The descriptive analysis shows that except for the play
in this paper regarding the relationship of gestural and distracted categories, the number of comments made
language to TAP outcomes, we begin to gain some by the two groups in each category is remarkably close.
understanding of the use of gestural TAP as a viable and For the distracted category, one person in the ASL group
inclusive usability method. was concerned about a work issue with a subordinate and
Analysing the usability data gathered in the law video contributed 17 out of the 23 comments. This one
usability study demonstrates the feasibility of gestural TAP individual’s contribution skewed the results to indicate a
in an actual usability study. Some of the interpretations of trend where there may not actually be one. Further subjects
this data are presented here to illustrate the treatment of are required to determine the extent of any trend of ASL
such data and answer our second research question speakers making comments that are not related to the
regarding the viability of using the GTAP methods in study task. The difference trend found in the play category
usability testing. Finally, new insights gained from the use is more interesting and its cause is less obvious.
of GTAP inform our research question on gestural Although the statistical analysis does not show any
language and method of collection and analysis of protocol differences between ASL and oral speakers in this category,
496 V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501

Table 8
Mean and standard deviation for split category counts divided by interpretation type and viewing order

Viewing order Video ASL Format Technical Content Caption Recommend

+  +  +  + 

First Interpreted n ¼ 8 Mean 1.38 3.00 .00 .38 .00 1.50 .13 .75 .38 2.25
SD 1.51 3.38 .00 1.06 .00 1.41 .35 1.75 .74 5.20
Acted n ¼ 5 Mean 1.40 1.80 .00 2.40 .00 3.20 .20 .80 .00 .60
SD 2.19 1.48 .00 1.67 .00 2.86 .45 1.10 .00 .89
Total n ¼ 13 Mean 1.38 2.54 .00 1.15 .00 2.15 .15 .77 .23 1.62
SD 1.71 2.79 .00 1.63 .00 2.15 .38 1.48 .60 4.09
Second Interpreted n ¼ 5 Mean 1.40 .40 2.40 1.40 .00 1.80 .00 .00 .00 .40
SD 1.95 .55 2.19 1.34 .00 1.30 .00 .00 .00 .89
Acted n ¼ 8 Mean 2.62 .63 1.50 2.38 1.00 1.13 .13 .38 .00 1.88
SD 4.27 .92 1.77 2.07 1.07 1.89 .35 .52 .00 2.17
Total n ¼ 13 Mean 2.15 .54 1.85 2.00 .62 1.38 7.69E-02 .23 .00 1.31
SD 3.51 .78 1.91 1.83 .96 1.66 .28 .44 .00 1.89

Total Interpreted n ¼ 13 Mean 1.38 2.00 .92 .77 .00 1.62 7.69E-02 .46 .23 1.54
SD 1.61 2.92 1.75 1.24 .00 1.33 .28 1.39 .60 4.12
Acted n ¼ 13 Mean 2.15 1.08 .92 2.38 .62 1.92 .15 .54 .00 1.38
SD 3.56 1.26 1.55 1.85 .96 2.43 .38 .78 .00 1.85
Total n ¼ 26 Mean 1.77 1.54 .92 1.58 .31 1.77 .12 .50 .12 1.46
SD 2.73 2.25 1.62 1.75 .74 1.92 .33 1.10 .43 3.13

Table 9 that the participant is giving effort to expressing thoughts

Significant Wilcoxon test results for category means grouped by either of greater complexity. In this particular study, ASL
viewing order or interpretation type
speakers seemed to have been engaged in more conversa-
Grouping Category Test statistic p tion comments particularly with the translator. This is a
common practice with translators during the beginning of a
Viewing order Negative ASL W ¼ 130:5 .016 translation session in order to establish context, cultural
Positive format W ¼ 117a .000
preferences and rapport (Jones and Pullen, 1992).
Positive technical W ¼ 143 .015
Researchers have shown that the same regions of the
Video Negative format W ¼ 131:5 .017 brain are used for communication in the oral mode as in
Positive technical W ¼ 143 .015
the gestural mode (McNeill, 1992; Corina, 1998; Green,
Meets more stringent a level (po:01) applied to repeated measures 1998; MacSweeney, 1998). In addition, the literature also
tests. suggests that ASL, like English, is a fully developed
language with grammar and syntax. The failure to find
differences in number and kind of comments between the
there is a noteworthy difference trend. Oral speakers seem two groups therefore is not that surprising especially given
to make more comments than ASL speakers in this the ability of the signing participants to manipulate the
category (mean of 52.9 comments for oral speakers mouse, use the game interface and sign their thoughts at
compared with 27.7 comments for ASL speakers). How- the same time. Since physical formation of the utterance is
ever, the majority of these comments occurred at the the primary difference between oral and gestural language,
beginning of the session as the means for the minute-three it may have been expected that this physical difference
sampling point after the game play had normalized are between the two modes of communication would also
equal (4.33 comments for each group). cause a difference in the number of comments made by the
From a usability standpoint, the play category represents two groups. Indeed, the trend in the play category data
a kind of background ‘‘noise’’ category: a demonstration may reflect this difference, however, when controlled for
that indeed the participant has some knowledge of how to session length differences, the total number of all category
use the type of interface or of the details of what is utterances for each group are within 1 point of each other.
contained in short term memory. Similar kinds of Essentially, the two groups made the same number of
utterances are found in the Tower of Hanoi protocols utterances. It is perhaps not surprising then that no other
described by Ericsson and Simon (1984) and the reading significant differences were found between these two
comprehension coding scheme protocol examples devel- groups.
oped by Green (1998) where some utterances are descrip- One factor that could have caused these results was that
tion of moves or reading of statements and not goals or too few subjects were used. However, several steps were
processes. Reduced utterances in this category may indicate taken to increase the power of the analysis. First, only
V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501 497

Table 10
Issues identified in two randomly selected protocols for each interpretation mode

Acted ASL issues identified Standard ASL issues identified

Participant A 1. The interpreter is not clear here 1. Too dark

2. That sign is unfamiliar 2. Interpreter is too slow
3. They are too fast, and both are speaking at the same time 3. Camera is too far from the interpreter
4. The camera is too far away 4. This section is better with one interpreter
5. This section is better with two interpreters
6. This part is confusing

Participant B 1. The interpretation and movie are not synchronized 1. Closed captioning would make this easier
2. Both are speaking at the same time 2. This is boring
3. The popping in is difficult to follow 3. The popping in and out is awkward
4. This is boring

Table 11
Count of coded comments for each category

Acted Interpreted

Quality of interpreter/actors presence and expression +ve 19 9

ve 4 8
Interpretation quality (speed, novel signs) +ve 9 9
ve 10 18
Ease of using video interface (video control buttons) +ve 0 0
ve 0 0
Viewing preferences (position, border, size, proximity) +ve 0 0
ve 2 3
Technical issues with video (synchronicity, visibility) +ve 8 0
ve 25 21
Content of video +ve 2 1
ve 7 6
Acted vs interpreted ASL content (comparison, busyness)
Acted +ve 1 11
ve 9 2
Interpreted +ve 11 1
ve 20 5
Closed captioning requested 0 3
Recommendations 18 20
Total 145 117

They should have deaf actors do the whole thing measured in this study. Also, all deaf participants were
Boring/long culturally deaf and not hard of hearing or hearing
They should have put window on bottom like close-captioning impaired. Furthermore, all deaf participants used ASL
I liked the straight caption better than the acted one for communication. Next, the alpha level was adjusted so
Dark background makes it hard to read the interpreter
that the probability of having a type II error decreased.
I lose concentration trying to look up at the interpreter
I would like to change the border colour. These considerations ensure a powerful test given the
It would be good to be able to set the span of the frames to own setting [not pre- small sample size. Indeed, the analysis indicates no
determined setting]. difference between the groups despite statistically optimiz-
Why does the actor disappear? They should just freeze the frame. ing the chances of finding a difference.
The person is still talking, why aren’t they signing?
In the Solitaire study there may be a linguistic preference
for certain general types of comments (e.g., it is easier for
Fig. 3. Sample participant comments from law video usability sessions.
ASL speakers to converse with the translator than produce
game-type comments). To examine this, two overarching
individuals who knew how to play Solitaire and had played categories are considered in the protocol analysis: game
the computer version before were selected to reduce the comments and non-game comments. The game category is
possibility that any comments were related to learning how composed of those coding categories that specifically refer
to play the game as learning was not one of the factors to an aspect of playing the Solitaire game such as play,
498 V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501

Table 12
Correlation matrix for split categories by order and interpretation

ASL Format Technical

+  +  + 

Order r-value .14 .45* .58* .25 .43* .20

Sig. (2-tailed) .48 .02 .00 .22 .03 .32
Interpretation r-value .14 .21 .00 .47* .43* .08
Sig. (2-tailed) .48 .32 1.0 .02 .03 .69

n ¼ 26.
*Correlation is significant at the .05 level (2-tailed).

strategy, cards and error. The non-game category is categories is good indicating that the categorization
composed of the coding categories that relate to the process and the resulting data are reliable.
situation such as solitaire, distracted, TAP, technical, In the law video usability study, there were a total of 262
usability and conversation. When controlled for session comments generated by 17 participants (or approximately
length differences, on average, oral language participants 15 comments per participant) in all categories and sub-
made 1.11 more game comments and only a fraction fewer categories. The majority of comments appeared in the
non-game comments per session than did gestural language negative sub-category of technical issues (25 for the Acted
participants. However, they are not significantly different. and 21 for the Standard). The fewest comments were made
Thus, it is likely that both groups were able to concentrate in the positive sub-category of the Content of Video
on the game equally well while carrying out the think aloud category (2 for Acted and 1 for Standard). There were three
method. Gesture does not seem to impede ability to play sub-categories where no comments were generated for
the game or ability to produce protocols. either the Acted or Standard groups (positive Viewing
Preferences and negative and positive Ease of Using Video
4.2. The viability of GTAP for usability evaluation Interface). As demonstrated in Table 10, participants were
able to identify a large number of issues with the
We found that using GTAP in a usability study context, interpretation modes. For example, from the randomly
the law video usability study, produced usability results selected protocols of just two participants, seven different
that are viable and useful. This study was a particularly issues in the acted interpretation and nine different issues in
interesting testing ground for GTAP as an inclusive UEM the standard interpretation were identified. These two
because it involved the complex management and pre- participants identified the same issue only once. The
sentation of multiple video windows (one video without protocols of these two participants exemplify the varia-
ASL and a second video with the ASL interpretation) as bility between participant comments and the way that this
well as a user interface for customization of the presenta- variability benefits that usability process.
tion. The study involved presentation of a novel ASL Differences in interpretation type and viewing order
interpretation format (acted interpretation) as well as factors were found in four categories (these differences
introduction to new viewing preferences choices. The video approached significance at the .01 level with a Wilcoxian t-
interpretation and customization formats were specifically test). Correlation coefficients seen in Table 12 also show
designed for individuals who communicate with ASL. The these relationships. For viewing order, negative ASL and
following analysis provides an interpretation of the GTAP positive Technical approached significance (p ¼ :016
usability data collected for this study. and .015, respectively) and for video, negative format and
The experimental factors of interest for the law video positive technical approached significance (p ¼ :017 and
usability study were viewing order and interpretation type. .015, respectively). The results shown in Table 8 indicate
The coding categories for the law video usability study that for the positive technical category, all comments were
arose from the questions being asked by developers and made during the second viewing but only when the Acted
researchers. The greatest concern for the researchers was ASL video was played (there were no positive technical
the participant’s response to Acted ASL interpretation as comments for the Standard video version in either order
well as to the available viewing preferences (size of video nor were there positive technical comments for the Acted
boxes, position of boxes, proximity of boxes and outline of ASL video when it was the first video played). The
boxes). Developers were further concerned with the viewer technical category definition included anything that might
interface controls/buttons. Potential confounds for parti- be controlled in production such as synchronizing acted
cipant responses were identified as production/technical interpretation to main video, lighting, interpreter position,
issues such as lighting and video content that may not costumes/clothing. The Acted ASL video violated more
interest all participants. The inter-rater reliability for all user expectations with regards to costumes, interpreter
V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501 499

position and synchronicity. Indeed producers of the Acted any specific use of the user interface during the playing of
ASL video deliberately broke standard presentation style the video. Standards for optimum number of comments
of ASL interpretation video in order to explore new per session do not exist for the TAP method and
methods of presenting interpretation of existing videos and establishing a standard is beyond the scope of this study.
their possible benefit to viewers. When viewers were in a However, it seemed reasonable that the number of
position to compare the acted interpretation to a standard comments produced would vary from task to task and
interpretation they made both positive and negative the objective of our study was not to compare the
comments about the technical aspects of the acted video. frequency of comments between tasks but to show that
During the first viewing, when direct comparison between comments would be generated. For the law video sessions,
the acted interpretation and a standard interpretation was the production of 8.4 coded comments per session was
not available, no positive technical comments were made substantial given the more passive role of the participant.
about the acted interpretation.
Another possible cause of this result is that there is a 4.3. The relationship between gestural language type and
difference in the production quality of the two different method of data collection
versions. The Acted ASL version may have been perceived
as higher quality than the standard interpretation and The results of our studies also seem to indicate that the
therefore it drew more favourable comments than without TAP method requires very little adaptation to allow for the
the comparison. Interestingly, at no time were positive collection of gestural protocols. The greatest modification
technical comments made for the standard interpretation. in the conventional TAP protocol is the addition of an
However, this lack of favourable technical comments is not interpreter (i.e., there are now two observers/experimenters
necessarily indicative of a low-quality production video. involved). However, these modifications are relatively
Rather, it may be that the standard interpretation met the minor and do not cause much disruption to the flow of
normal expectations of the participants as it is the the protocol as long the interpreter is adequately briefed
traditional style of presenting ASL interpreted video and prepared.
material. Thus, the standard interpretation video did little In the Solitaire game study the interpreter provided real-
to surprise the viewer and did little to elicit positive time translation during the study. This translation was
comments regarding production. recorded along with the participant’s actions as they
It is unlikely that a practice effect as a result of repetition occurred during the study. In the usability study, a more
of the video dialogue caused an increase in positive conventional usability approach was taken to data collec-
technical comments since this effect was not shown for tion and recording; the participant actions and comments
both video categories. This discrepancy indicates that were recorded together and translation/transcription oc-
comparison between the two videos is a likely contributor curred after the study was completed. We found that
to the finding that positive technical comments were made having simultaneous translation during the actual study
only during Acted ASL videos shown after a Standard simplified the analysis process because it did not involve
ASL video. using a two-step process. However, simultaneous transla-
Participants produced more negative ASL comments tion has the potential to introduce slight time delays
during the first video session than the second. This because the translator must wait until enough has been said
difference may be attributed to practice effects. Both by the signer to correctly translate the syntax and
videos have the same script so a viewer would be more grammatical structures. This may cause difficulties in
likely to have a better understanding of the interpretation studies where accurate time date for beginning and ending
in the second session when they are seeing the script signed of thoughts is required.
for the second time. This ‘‘practice’’ of the script means
that the participant would be less likely to notice or have 4.3.1. Preliminary guidelines for using GTAP
focus drawn to aspects of the interpretation that would be There are some simple steps that we used and would
otherwise difficult to understand. recommend to optimize collection of a gestural protocol.
The objective for using TAP in a software usability study First, it is useful to explain the rationale of the TAP
was to gather rich data that sheds light on how an method to the interpreter. This step enables the interpreter
individual is responding to and using the user interface. to better assist the researcher in obtaining the desired
The concurrent nature of the data helped the researcher to results since interpretation to ASL from English (and vice
make connections between specific actions or aspects of the versa) is not a direct one to one mapping.
interface and thoughts voiced by the participant. In the Second, it is important to discuss the handling of
usability study reported here, participants reported their speaking prompts to the interpreter and participant before
thoughts with an average frequency that approached 1/min the data collection begins. The interpreter should under-
(.84/min). In addition each verbal protocol yielded at least stand the importance of using brief, neutral cues such as
one coded comment and the average number of comments ‘‘thoughts?’’ and ‘‘keep talking’’ when prompting a
per session was 8.4. Participants were only required to participant who has fallen silent. Another way to handle
watch the video segments and were not instructed to make prompting is to tell the participant that they will be
500 V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501

physically tapped when they fall silent for too long. In this Acknowledgements
way, the researcher may be positioned behind the
participant an ideal location from which to operate the Funding for this project, Creating Barrier-Free, Broad-
camera, to observe the user and screen without being a band Learning Environments Project #34, is provided by
visual distraction for the participant and to prompt the the E-Learning program of CANARIE. The authors wish
participant with a shoulder tap when necessary. to acknowledge gratefully Earl Woodruff and Richard
Third, the researcher should consider the importance of Volpe of the University of Toronto for their assistance with
text interactions with the participant and remove them this project, the Canadian Hearing Society for assistance in
whenever possible. For example, it is customary to have recruiting participants, and all of the participants who gave
hearing participants read a passage aloud as a warm-up for up their valuable time to participate in our studies.
thinking aloud. For deaf participants, it may make more
sense to practice signing thoughts while performing a simple
task on or off the computer. Participants may also be asked
to sign rote phrases such as counting or the alphabet.
Fourth, some deaf individuals have residual hearing, thus, Andrews, J.F., Mason, J.M., 1991. Strategy usage among deaf and hearing
when an interface has an audio component, speakers and readers. Exceptional Children 57, 536–545.
sound should be on. Finally, when taping the actual gestures Bates, E., Dick, F., 2002. Language, gesture, and the developing brain
of the participants, have a lab coat available for participants (Special issue: Converging method approach to the study of develop-
mental science). Developmental Psychobiology 40 (3), 293–310.
to wear. On video and even in person, the ease of viewing
Bellugi, U., Klima, E.S., Siple, P., 1974. Remembering in signs. Cognition
signs is hampered by patterned tops. A covering such as a 3 (2), 93–125.
lab coat will create a neutral background making the signs Campbell, R., Wright, H., 1990. Deafness and immediate memory for
more visible to the interpreter and anyone viewing the video pictures: dissociations between ‘‘inner speech’’ and the ‘‘inner ear’’?
of the gestural protocol. Understanding and attending to Journal of Experimental Child Psychology 50 (2), 259–286.
these variables and how they affect the gestural protocol are Corina, D.P., 1998. Sign language aphasia. In: Coppens, P. (Ed.), Aphasia
in Atypical Populations. Erlbaum, Hillsdale, NJ, pp. 261–309.
keys to using the method optimally. Corina, D.P., McBurney, S.L., 2001. The neural representation of
Gestural TAP is laid on the foundation of research that language in users of American Sign Language. Journal of Commu-
supports TAP as a valid and reliable UEM. The collection nication Disorders 34 (6), 455–471.
of gestural protocols requires only minimal change to the Ericsson, K.A., Simon, H.A., 1984. Protocol Analysis: Verbal Reports as
TAP method: an interpreter and some consideration of Data. MIT Press, Cambridge, MA.
Ericsson, K.A., Simon, H.A., 1998. How to study thinking in everyday
hand requirements during the task. Equipment, coding and life: contrasting think-aloud protocols with descriptions and explana-
analysis requirements are virtually unchanged when con- tions of thinking. Mind, Culture, & Activity 5 (3), 178–186.
current interpretation of the utterances are collected. This Green, A., 1998. Verbal Protocol Analysis in Language Testing Research:
minor change means that the method may be readily a Handbook. Cambridge University Press, Cambridge, UK.
Henderson, R., Podd, J., Smith, M., Varela-Alvarez, H., 1995. An
applied by field practitioners, particularly as replications of
examination of four user-based software evaluation methods. Inter-
this research further support the similarity between spoken acting with Computers 7 (4), 412–432.
and gestural protocols. Jones, L., Pullen, G., 1992. Cultural differences: deaf and hearing
Gestural TAP enables inclusive use of TAP, the most researchers working together. Disability & Society 7 (2), 189–196.
effective UEM available (Corina and McBurney, 2001). Kuusela, H., Paul, P., 2000. A comparison of concurrent and retrospective
These methods are important to enable developers to meet verbal protocol analysis. American Journal of Psychology 113 (3),
inclusive technology mandates and to help foster an MacSweeney, M., 1998. Cognition and deafness. In: Gregory, S. (Ed.),
environment of universal design. There is ample evidence Issues in Deaf Education. D. Fulton Publishers, London, p. xi 292.
that retrofitting environments whether they are physical or McNeill, D., 1992. Hand and Mind: What Gestures Reveal About
digital to accommodate individuals with special needs is Thought. University of Chicago Press, Chicago.
Messing, L.S., 1999. An introduction to signed languages. In: Messing,
more costly than building inclusive environments in the
L.S., Campbell, R. (Eds.), Gesture, Speech and Sign. Oxford
first place. Developers however, cannot effectively meet the University Press, Oxford, UK, New York, p. xxv 227.
needs of disabled users unless they include these users in Newell, A., Simon, H.A., 1972. Human Problem Solving. Prentice-Hall,
their usability evaluations. This GTAP research and Englewood Cliffs, NJ.
research on other inclusive UEMs are urgently needed so Nisbett, R.E., Wilson, T.D., 1977. Telling more than we can know: verbal
that developers and usability engineers are equipped with reports on mental processes. Psychological Review 84 (3), 231–259.
Schirmer, B.R., 2000. Language and Literacy Development in Children
methods that have been tested and refined. Who are Deaf, second ed. Allyn and Bacon, Boston.
This study began with the research question: are the Schirmer, B.R., 2003. Using verbal protocols to identify the reading
outcomes of ASL speakers comparable to English speak- strategies of students who are deaf. Journal of Deaf Studies & Deaf
ers’ outcomes? Certainly the failure to find significant Education 8 (2), 157–170.
differences between the groups on most of the variables Someren, M.W.v., Barnard, Y.F., Sandberg, J., 1994. The Think Aloud
Method: a Practical Guide to Modelling Cognitive Processes.
lends support to the idea that oral and gestural language Academic Press, London, San Diego.
verbal protocols are similar especially given the steps taken Stevens, J., 1996. Applied Multivariate Statistics for the Social Sciences,
to increase statistical power and precision. third ed. Lawrence Erlbaum Associates, Mahwah, NJ.
V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501 501

Stokoe, W.C., 2001. The study and use of sign language. Sign Language Wilson, M., Emmorey, K., 1998. A ‘‘word length effect’’ for sign language:
Studies 1 (4), 369–406. further evidence for the role of language in structuring working
Van Someren, M.W., Barnard, Y.F., Sandberg, J., 1994. The Think Aloud memory. Memory & Cognition 26 (3), 584–590.
Method: a Practical Guide to Modelling Cognitive Processes. Wilson, M., Emmorey, K., 2003. The effect of irrelevant visual input on
Academic Press, London, San Diego. working memory for sign language. Journal of Deaf Studies & Deaf
Wiedenbeck, S., Lampert, R., Scholtz, J., 1989. Using protocol analysis to Education 8 (2), 97–103.
study the user interface. Bulletin of the American Society for
Information Science June/July, 25–26.