Вы находитесь на странице: 1из 5

CogInfoCom 2012 3rd IEEE International Conference on Cognitive Infocommunications December 2-5, 2012, Kosice, Slovakia

Investigating the Use of Non-verbal Cues in


Human-Robot Interaction with a Nao Robot
JingGuang Han*, Nick Campbell*, Kristiina Jokinen** and Graham Wilcock**
*

Trinity College Dublin/Speech Communications Laboratory, Dublin, Ireland


** University of Helsinki, Helsinki, Finland
*hanf, nick@tcd.ie
**Kristiina.Jokinen, Graham.Wilcock@helsinki.fi

AbstractThis paper discusses a new method to investigate


the use of Non-Verbal Cues in Human-Robot Interaction
with the Nao platform built from a number of sensors,
controllers and programming interfaces. Using this
platform a set of pilot experiments were carried out and
conducted by 12 users. A multimodal corpus was recorded
using several cameras and microphones placed on and
around the robot. People were asked to interact with the
Nao robot freely with some instructions of how to use the
commands. A set of specific questions were asked for the
feedback and evaluation. Preliminary results show that the
non-verbal cues aid the human-robot interaction;
furthermore we found that people were more likely to
interact with a robot which is capable of utilizing non-verbal
channels for understanding and communication.
Index Termshuman and robot interaction, non-verbal
communication, image processing

I.

Figure 1. The SENA living assist robotic wheelchair

communication studies, especially in human and robot


interactions. Humans can easily express and recognize
emotional information, that are primarily conveyed
through the use of non-verbal channels, such as facial
expressions, hand gestures and body movements.
Additionally, conversational timing like starting or
stopping the talk or changing topics is something that
humans can do very well but that machines have a
problem doing easily. Furthermore this still remains
challenges for a cognitive robot to understand and
recognize those natural and cognitive signals and
information in the communications and interactions with a
human. This paper discusses a novel method of using nonverbal
cognitive
information
in
human-robot
communications with the Nao Robot Platform [4], see
Figure 2.

INTRODUCTION

A. Background
With recent developments in hardware and software
technologies, robotic devices are becoming more
ubiquitous in the daily life of a range of people; from
assisted living devices like robotic wheelchairs [1] (see
Figure 1) to auto-drive cars and smart personal electrical
assistants. Information communications plays a major part
in those applications. However, they are mostly task
oriented. Given non-interactive and restricted commands,
robotic devices can perform a programmed action
accordingly. While this is suitable for some applications,
ideally the information communications should be more
natural and cognitive with the users giving feedback and
suggestions on a range of context relevant topics. A
number of different research strands have examined a
multi-modal approach of perceiving and delivering
information in relation to human-computer and humanrobot cognitive and interactive communications such as
speech recognition and synthesis, object detection and
tracking, linguistics and phonetics [2]. Recent research
has suggested that over 60% of all the information
delivered during communications is through non-verbal
channels [3]. The cognitive non-verbal channels are
becoming a more and more important aspect in

Figure 2. the Nao robot platform and the attached sensors and
controllers

978-1-4673-5188-1/12/$31.00 2012 IEEE

679

J. Han et al. Investigating the use of non-verbal cues in human-robot interaction with a Nao robot

Figure 3: Herme Platform in Science Gallery

Figure 4. The Nao robot in our experiments interacting with people

B. The Nao Robot Platform


In the past decade, there has been a lot of work going
into making human-robot interactions more natural and
the robot more socially and contextually aware, such as
IBug [5], SEMAINE [6], Greta [7], Max [8]. Campbell
and Han investigated how to use non-verbal channels to
perceive and deliver cognitive information using the
platform Herme [9]. During its 3-month exhibition in
Science Gallery of Trinity College Dublin, the researchers
showed strong evidence of the usefulness of using the
non-verbal cognitive information in human and robot
interactions, see Figure 3. Research has also indicated that
vision based channels are becoming an important
component in these interactions [9] [10] [11].
As part of the ENTERFACE 2012 workshop in Metz
France [12], we carried out a set of experiments to
examine the use of non-verbal cues in order to enhance
the user experience in human-robot interactions using the
Nao robot platform. We focused particularly on the
channels that allow the robot sense and express social and
contextual elements of interaction. As saw in Figure 2, the
Nao robot supports multiple sensors and controllers; two
cameras attached to the head and jaw, two sonar sensors
on the chest, a number of movement motors on the neck,
hands and feet, three colour LEDs in red, green and blue
on the eyes, and tactile sensors on the head and feet. As
part of the development process, we used only the
featured modules and programming interfaces which
provided by the Nao platform. The Wikitalk system [12]
that supports open-domain conversations using Wikipedia
as a knowledge source was used as a basis for the
interaction [13]. This greatly enhanced Nao's interaction
capabilities and allowed it to maintain long conversations
with the participants. [14] [15] During the workshop, we
examined a number of different non-verbal
communication channels and methods which will be
discussed in details in the following chapters. At the end
of the workshop we also evaluated the Nao WikiTalk
system with users from the workshop, and thus collected
approximately 130GB of multimodal recordings and 12
participant questionnaires. The questionnaire asked about
each users overall feelings and feedback about the
interaction, and according to the results of experiments,
we found that the interlocutor is more likely willing to
interact with the robot which is capable of understanding
and
delivering
messages
through
non-verbal

communication channels. We also found out that not all


the nonverbal modules of the Nao robot worked well or
could be useful in the interactions. Figure 4 is a screenshot
took from one of the video recordings when the Nao robot
interacting with people.
In this paper we will discuss the different non-verbal
communication channels and methods used with the Nao
WikiTalk. Section II describes the different experiments
with a variety of communication channels in details: Face
Detection and People Tracking (Section A), Head
Nodding and Shaking Detection (Section B),
Conversational Triggers and Sonar Sensors (Section C),
Methods of Interrupting the Conversation with Tactile
Sensors and Object Recognition (Section D). We also
describe a small experiment to Explore and Measure the
Best Distance Range in Human and Robot Interaction
(Section E). Conclusions are drawn in Section III.
II.

EXPERIMENTS AND RESULTS

A. Face Detection and People Tracking


The human face conveys a lot of information in human
to human interactions. The face detection is a real-time
processing with the interval of less than a second between
adjacent frames [16]. People tend to look at and follow
each others faces and try to understand the feelings and
emotions during a conversation. Face detection is an
efficient way to locate the interlocutor in active human
and robot interaction. It can be used as a trigger of a
conversation and to locate the target which the robot
should face to. It is also normally the first step in a series
of further sensing and detection procedures. Once the face
position is found, the body detection area which is better
known as Region of Interest, is much reduced from a
general scene, which can save the computing time of a
computer significantly. The initial step can be followed by
further processing like gesture recognition or facial
expression detection more quickly and efficiently [16].
The Nao platform has a pre-built module to perform
face detection using the ViolaJones algorithm [17] This
is a face detection algorithm which is based on Haar-like
features which can efficiently and rapidly recognize
objects from the video stream of either the robots head or
jaw camera. Once it finds a face, it writes a set of numbers
representing the face location in a 3D space to its internal

680

CogInfoCom 2012 3rd IEEE International Conference on Cognitive Infocommunications December 2-5, 2012, Kosice, Slovakia

Distance Detection
using Sonar Sensors
Waiting for Signal

Waiting for
Conversation

Interaction

Object Detected

Start Face Detection


and Tracking

Waiting for Signal

Object Lost
Face Lost

Face Detected

Start
Conversation

Sound Detected

Sound Direction
Detection

End
Conversation

Figure 5. Conversational Triggers Flow Chart

Figure 6. Object recognition training phase of the Nao robot

memory, and passes the horizontal and vertical


recommended angles to the neck motor in order to make
the head turn to "look at" the detected face. By checking
distance and adjusting its head position at every half a
second, the robot can track the interlocutor during a
conversation.
This feature has one disadvantage. When the face
detection is combined with other modules that send
commands to the same motor before the motor has
completed its previous task, like request to nod the robot's
head, the head movement appears jerky, due to
conflicting signals. After some exploration, we found a
way to overcome this problem by deploying conflicting
modules into separate threads using multi-threaded
programming techniques in Python [18]. This
unsubscribed the modules that may send conflicting
commands leaving only one module running at a time, and
re-subscribed them when the motor completed the current
task.

program checks the distance parameters every half a


second. In our experimental setup the distance from a
static wall is about 2.4 meters. If there is no object
between the robot and the wall, the parameters were read
around 2.4 meters for both the sonar sensors. Once the
parameter fell below a threshold value of 2 meters, we
know there is an object existing in the current frame.
When it detects an object and the object stays for 5
continuous frames i.e. the checked parameters are less
than the threshold value for more than 5 frames, we know
that potentially there is a person trying to talk to the robot.
Then we used the face detector to find and locate the
person and started the conversation. The distance data
were recorded and written to a file from which we can
measure distance of human and the Nao robot interaction,
and see the variance during the interaction. This will be
discussed further in Section E.
Speech direction detection is another approach to be
used as the trigger of a conversation. When a sound whose
intensity is over the background noise level is detected
from the direction in front of the robot, it means that
people may want to start a conversation with the robot.
The Nao robot has built-in programming interfaces which
use the microphone array to measure the intensity of
speech and to estimate the direction of sound. Similar to
the face tracking module, the sound direction module
continues writing the direction data to its internal memory
which can be accessed through the programming
interfaces in the same manner. A sound intensity threshold
was set to distinguish the speech from the background
noise. Figure 5 illustrates the process flow of the use of
conversational triggers.

B. Head Nodding and Shaking Detection


People naturally move their head when they speak face
to face. Research suggests that head movement not only
tells us if the person agrees or disagrees with the subject
by nodding or shaking their heads, but the movement
pattern also has linguistic and phonetic meanings which
can help us better understand the communication [2] [19]
Considering this, head movement tracking is very useful
in human and robot interaction as well. It can be used to
roughly measure the engagement and agreement of an
interlocutor [20], and also as a mechanism to send
commands to the robot during the conversation, for
example, interrupting a long monolog in the conversation.
As mentioned earlier, the face detection module writes the
face position parameters to its internal memory. By
checking and comparing the coordinates in adjacent
frames, we can measure the movement of the head
vertically (nodding) and horizontally (shaking).

D. Methods of Interrupting the Conversation with


Tactile Sensors and Object Recognition
Given the fact that WikiTalk returns a long paragraph
of content as a response to the user's search question, it is
important that the user can intrrupt the robot's long
monologue. Thus for the robot, knowing and detecting the
correct timing of interruptions is one of the most crucial
factors in the interaction.
The Nao robot platform provides three sets of major
detection sensors which can be used for interruption
detection, video cameras, tactile sensors and sonar
sensors. We explored and experimented on three different
interruption methods accordingly. The Nao python API

C. The Conversational Triggers and Sonar Sensors


The Nao robot has two chest sonar sensors which can
be used to detect the distance from the front closest object.
Once an object is found, the module will write to its
internal memory with two distance parameters
representing the measurement from its left and right sonar
sensors. By comparing the changes of the adjacent frames,
we can know when an object or a person is present. The
frame rate was set as half a second which means that the

681

J. Han et al. Investigating the use of non-verbal cues in human-robot interaction with a Nao robot

(application programming interface) has a static object


learning and recognition module which can be used to
detect a pre-learnt object with its cameras. The first
method we tried was to use a palm gesture in front of the
Nao robot as a sign of interruption. Once the Nao robot
detects the gesture, it will ask if the user would like to stop
the presentation from Wikipedia.
In order to make the robot learn the palm gesture, we
took a static image of a gesture from a frame of its head
camera video stream, and marked the palm shape contour
manually with continuous dots on its corners. Figure 6
illustrates this method. After the outlined object image
was sent to the detection database, the Nao robot was able
to detect similar objects from its video stream. However,
after some experiments, we found out that the detection
accuracy highly depends on the lighting environment and
background colour. When we tested the detection in a
different environment with darker lighting or a lighter
background colour, the accuracy drops sharply. We found
this method couldnt be generalized and was not robust
enough for interruption detection.
The second method we examined was to use the sonar
sensors to assist the gesture recognition. When the palm
gesture is within a range of 0.5 meters, the sonar sensors
will trigger the interruption of the conversation. However,
this conflicted with the hand gestures of the Nao robot
itself: when it performed some hand gestures like
welcoming the interlocutor, the detector could trigger
interruption by mistakenly interpreting Nao's own hand as
the user's hand. Similar to the method that we used for
solving the face tracking and face motors conflicts, we
used multithreaded programming technique to unsubscribe
the sonar detector during the period of hand gesture
performance.
The Nao robot also has a few very sensitive tactile
points around its body. These sensors give us the third
way to interrupt the monologue. When the interlocutor
wants to switch to a new topic, he or she could simply
touch the robot on his head. In fact, this mechanism was
used in the final Nao evaluations to allow the user to
interrupt Nao's speaking.

III.

DISCUSSION AND CONCLUSION

From the user evaluation studies we conducted in the


workshop, we know that people liked interacting with the
Nao WikiTalk robot, and that they especially found its
non-verbal gesturing engaging [23] [15]. Since people use
a wide range of non-verbal communication channels in
their natural communication, it is understandable that they
also prefer to interact with a robot that utilizes these same
channels. It is thus important to explore what kind of cues
humans use in their communication, and how
technological enablement can support human-robot
interaction to fully deploy all these possibilities.
In this paper we have studied the Nao platform, and
discussed in detail the different methods and technologies
that it provides for a non-verbal human-robot interaction.
The interaction system was built on the basis of WikiTalk
that allows the user to interact with the Nao robot and get
information from Wikipedia. In this context, we studied
Nao's built-in technologies to detect face and to track
people, as well as to detect the user's head movements like
nodding and shakes. We also explored the use of sonar
sensors and speech direction detection as conversational
triggers to enable the robot to infer if there are users close
by who may want to start to talk to the robot. Finally, we
investigated different methods for interrupting the
conversation, and used tactile sensors and an object
recognition method. We found that not all the non-verbal
modules of the Nao robot can be used in the interaction
setup; especially the object recognition module highly
depends on the environment, and cannot be generalized to
various real world situations in a robust way. Concerning
the best distance range in human and robot interaction, we
empirically tested this, and found out that in our setup, the
best communication distance for human and Nao
interaction is about 0.9 meters.
The future work will be mainly focused on the study of
how to combine the data from multiple modules in order
make higher level inferences on the understanding of
human and robot interaction. The work will also concern
exploring and improving non-verbal technologies that will
enable us to support more natural and robust interactions
with humans and robots.

E. Explore and Measure the Best Distance Range in


Human and Robot Interaction
Space and social distance is an important factor in
human communications. For instance, they can affect
personal business relations [21], and are also important in
multiparty conversations where the participants tend to
form spatial patters, f-formations [22]. Knowing the best
communication distance and distance range in human and
robot interaction can help in many ways, e.g. in deciding
how much room should be left for the interlocutor, and
what are the parameters to set up the robots initial
position. We tested this empirically by asking people to
find the most convenient position to talk to the robot. The
people could move freely during the interaction with the
robot, and the distance changes were recorded using sonar
sensors. From this data we know that in our setup, the
most convenient distance for the human and the Nao robot
interaction ranged between 0.77m 1.12m, with most
distances parameters around 0.90m.

ACKNOWLEDGMENT
The work was carried out in Metz France during the
workshop ENTERFACE 2012. We would like to thank
the committee and Bienvenue sur le site de Supelec for
providing us the chance to work together. We also thank
the other researchers in the project, Adam Csapo,
Jonathan Grizou, Emer Gilmartin, Raveesh Meena and
Dimitra Anastasiou who worked on other aspects of the
system such as dialogue management and using gestures
with speech. The travel cost for the first author was
funded by FASTNET project of Science Foundation
Ireland. We also would like to thank them.
REFERENCES
[1]
[2]

[3]

682

Mapir Group. http://mapir.isa.uma.es/mapir/.


K. G. Munhall, J. A. Jones, D. E. Callan, T. Kuratate, and E.
Vatikiotis-Bateson, Visual prosody and speech intelligibility
head movement improves auditory speech perception,
Psychological Science, vol. 15, no. 2, pp. 133137, 2004.
Isa N. Engleberg, Communication Principles and Strategies. My
Communication Kit Series. Page 133: , 2006.

CogInfoCom 2012 3rd IEEE International Conference on Cognitive Infocommunications December 2-5, 2012, Kosice, Slovakia

[4]
[5]
[6]
[7]
[8]
[9]

[10]
[11]
[12]

[13]

[14]

[15] A. Csapo, E. Gilmartin, J. Grizou, J. Han, R. Meena, D.


Anastasiou, K. Jokinen, and G. Wilcock, Multimodal
Conversational Interaction with a Humanoid Robot, in Cognitive
Infocommunications (CogInfoCom), 2012 3rd International
Conference on, 2012, pp. 16.
[16] L. Brethes, P. Menezes, F. Lerasle, and J. Hayet, Face tracking
and hand gesture recognition for human-robot interaction, in
Robotics and Automation, 2004. Proceedings. ICRA 04. 2004
IEEE International Conference on, 2004, vol. 2, pp. 1901 1906
Vol.2.
[17] P. Viola and M. J. Jones, Robust real-time face detection,
International journal of computer vision, vol. 57, no. 2, pp. 137
154, 2004.
[18] Python Programming Language. http://python.org/.
[19] M. Boholm and J. Allwood, Repeated head movements, their
function and relation to speech, in Proceedings of the Workshop
on Multimodal Corpora: Advances in Capturing, Coding and
Analyzing Multimodality. LREC, 2010.
[20] K. Jokinen and G. Wilcock, Multimodal signals and holistic
interaction structuring, in Proceedings of the 24th International
Conference on Computational Linguistics, 2012.
[21] E. T. Hall, The hidden dimension, vol. 6. Doubleday, 1966.
[22] A. Kendon, Spacing and orientation in co-present interaction,
Development of Multimodal Interfaces: Active Listening and
Synchrony, pp. 115, 2010.
[23] R. Meena, K. Jokinen, and G. Wilcock, Integration of gestures
and speech in human-robot interaction, in Cognitive
Infocommunications (CogInfoCom), 2012 3rd International
Conference on, 2012, pp. 16.

The Nao Robot Platform. http://www.aldebaran-robotics.com/en/.


ibug. http://ibug.doc.ic.ac.uk/.
Semaine Project. http://www.semaine-project.eu/.
Greta, Embodied Convcersational Agent. http://perso.telecomparistech.fr/.
Max http://cycling74.com/products/max/.
J. G. Han, J. Dalton, B. Vaughan, C. Oertel, C. Dougherty, C. De
Looze, and N. Campbell, Collecting multi-modal data of humanrobot interaction, in Cognitive Infocommunications
(CogInfoCom), 2011 2nd International Conference on, 2011, pp.
14.
W. H. Allen, Audio-visual communication research, The
Journal of Educational Research, vol. 49, no. 5, pp. 321330,
1956.
A. Jaimes and N. Sebe, Multimodal humancomputer
interaction: A survey, Computer Vision and Image
Understanding, vol. 108, no. 1, pp. 116134, 2007.
K. Jokinen and G. Wilcock, Constructive Interaction for Talking
about Interesting Topics in Proceedings of Eighth International
Conference on Language Resources and Evaluation (LREC
2010), Istanbul, 2012.
G. Wilcock and K. Jokinen, Adding speech to a robotics
simulator, in Proceedings of the Paralinguistic Information and
its Integration in Spoken Dialogue Systems Workshop, 2011, pp.
375380.
K. Jokinen and G. Wilcock, Emergent verbal behaviour in
human-robot interaction, in Cognitive Infocommunications
(CogInfoCom), 2011 2nd International Conference on, 2011, pp.
14.

683

Вам также может понравиться