Вы находитесь на странице: 1из 30

Grounding Words in Perception and Action:

A Robotics Perspective
Kadir Firat Uyanik
kadir@ceng.metu.edu.tr

Cogs 534: Cognition, Perception and Action


offered by Dr. Annette Hohenberger

kovanresear

19.12.2011

Outline
1. Introduction 2. Words and physical world 3. Words and perceptual categories 4. Words and context dependency 5. Word learning from audio-visual inputs 6. Grounding verbs in action 7. Grounding nouns in perception and action 8. Grounding concepts through social interactions 9. Proposed Framework 10. Conclusions

2/21

Introduction

3/21

Words and physical world


According to the results obtained in cognitive/neuro-science (especially embodied cognition) studies,
language develops in parallel with the interactive actions that we generate. Lakoff3, and Gallese and Lakoff4 : metaphors are grounded Zwaan and Taylor5 : role of action in language comprehension Hauk et al.6 : effect of listened verbs on motor activations Kaschak et al. 7 : effect of listened motion sentences on sentence comprehension Chambers et al. 8 : effect of bodily capabilities on grammatical analysis of the listened sentence. For a detailed review, see Glenberg9 language should be grounded on something that is not symbolic. Our sensory, action, and emotion systems of our bodies provide that grounding ( Harnads symbol merry-go-round argument [10]).
4/21

Words and physical world contnd

Thus, meaning of round is grounded in visual features of exemplars, Push in motor control structures, Heavy in haptic features, or, they are grounded in combinations/interrelations of all these features.

5/21

Words and perceptual categories

Most of the language grounding systems are only capable of labeling the similar clusters
1. Convert continuous sensory input into discrete feature vectors, 2. Cluster similar feature vectors, 3. Label them according to a linguistic convention

Usually, these systems are not context aware, and fixed category models cannot capture context sensitive details (Mojsilovics color associator16.)

Red wine and black wine. Red and black refers to the same object in two different linguistic convention.

6/21

Words and context dependency

Gardenfors17 proposed a model in which the relation between context independent color prototypes and wine colors are shown. This model also shows why red and white cannot be used interchangebly but red and black can be used to refer same wine color in different linguistic conventions.

Red wine and black wine. Red and black refers to the same object in two different linguistic convention.

7/21

Words and context dependency contnd

Ragier18 showed that simple words such as above or near may correspond to rather implicit features of the environment. He found two main features to model this above spatial relationship which closely matches human judgments. However, models like Gardenfors and Ragiers are insensitive to functional contexts.

Different levels of aboveness. The concept of above becomes less and less comfortable from left to right if the circle is above the block statement is considered.

8/21

Word learning from audio-visual inputs


CELL ( Cross-channel Early Lexical Learning)

Roy, D. and Pentland, A. (2002) Learning words from sights and sounds: A computational model. Cogn. Sci. 26, 113146

9/21

Word learning from audio-visual inputs contnd

CELL assumes that object-of-interest is available. Yu, Ballard and Aslin developed a system that processes spoken input paired with visual images of multiple objects combined with the speakers eye gaze direction.

Yu, C. Ballard D.H., Aslin R.N. The role of embodied intention in early lexical acquisition. Cogn. Sci. vol.29, issue.6, pp 961-1005, 2005

10/21

Grounding verbs in action


Siskinds schemas

In Siskinds perceptually grounded model, the semantics of basic verbs are modeled using temporal schemas that define expected sequences of force dynamic interactions between objects.
E.g. hand picks up block : table-supports-block, hand-contacts-block, hand-attached-block, hand-supports-block

Time durations are not specified by the schemas, enabling the model to classify observations across varying timescales. Higher level actions are defined in terms of these lower level schemas. Thus move is defined as the ordered sequence of the schemas corresponding to pick up followed by put down.
11/21

Grounding verbs in action contnd


Bailey et al.s x-schemas

How to distinguish push and shove by using Siskinds schemas ?


It is not possible since we need action parameters or at least time information to differentiate these kinds of actions.

Bailey et al. addressed this issue by developing a system that learns verb semantics in terms of action control structures, called x-schemas, which control sequences of movements of a simulated manipulator arm. A verb is defined by its associated x-schema and control parameters.
The verbs pick up and put down are distinguished by the structure of their associated x-schemas, push and shove are distinguished by different force or velocity control parameters applied to the structurally identical x-schema.

12/21

Grounding nouns in perception and action


Roys framework

Verbs: sensory-motor control programs similar to x-schemas. Adjectives: sensory expectations relative to specific actions. E.g.,
red : is not simply a color category, but rather a color category linked to the motor program for directing active gaze towards an object. Heavy : haptic expectations associated with lifting actions.

Locations are encoded in terms of body-relative coordinates. Objects: bundles of properties tied to a particular location along with encodings of motor affordances for affecting the future location of the bundle. E.g. ball : subsumes both the meaning of round (which is one of its expected properties along with color, size, etc.), and all of the actions that may affect the ball.
Deb Roy, Semiotic schemas: A framework for grounding language in action and perception, Artificial Intelligence, Volume 167, Issues 1-2, September 2005, Pages 170-205,

13/21

Grounding nouns in perception and action contnd


Affordances framework1

Revised Definition: An affordance is an acquired relation between a <(entity, behavior)> tuple of an agent such that the application of the <behavior> on the <entity> generates a certain <effect> [2].

environment
<entity>

agent
<behavior>

<effect>
(<effect>, <(entity, behavior)>)
14/21

Grounding nouns in perception and action contnd


Affordances framework: Overview
Entity Includes various perceptual features sensed through distinct sensors
Effect Includes changes in the features representing an object-of-interest Behavior Id of the pre-coded action Affordance <(entity, behavior), effect> nested relation between these three properties
15/21

Grounding nouns in perception and action contnd


Affordances framework: How to say <do this> to <that thing> ?
Lets try lift the cup !

16/21

Grounding nouns in perception and action contnd


Affordances framework: How to say <do this> to <that thing> ?

Verb (<do this>) Getting <this> action done actually doesnt depend on the way the action is applied. It is more about the effect generated on <that thing>.
Instead of representing verbs with behaviors, represent them with the effect clusters.

17/21

Grounding nouns in perception and action contnd


Affordances framework: How to say <do this> to <that thing> ?

Noun (<that thing>) A robot can learn which feature of the object doesnt change by applying various actions on that object.
These stable features are actually good indicators of the object itself, and what it actually is. Therefore, these stable features can be used to call the object as <that thing>, and variable features can be used to predict what it is going to happen (<effect>) if the robot realize <do this>.
18/21

Grounding concepts through social interactions


Human-Robot Interaction

Left: A scene from R.U.R(1921), showing three robots *11+. Right: A scene from Sayonara (2010) *12+. 19/21

Grounding concepts through social interactions


Human-Robot Interaction

HumanRobot Interaction (HRI) is a field of study dedicated to understanding, designing, and evaluating robotic systems for use by or with humans [13]. HRI is a highly interdisciplinary field which requires collaboration between the groups from cognitive science, linguistics, psychology, engineering, mathematics, computer science etc. Unfortunately, robots are still far from being able to interact with humans in a smooth, natural way (Breazeal14, Fong15)

20/21

Proposed Framework

A common framework for interaction Learning affordances either by directly acting in the environment, or observing others acting, or even acting collaboratively. Understanding what is meant to do on what !
Verbs identify the action (in fact the effect), Nouns identify the entity to apply action upon

Assumptions:
Robots action repertoire is pre-coded, Object-of-interest is available to the robot.

21/21

Proposed Framework
Experimental Setup: Overview

22/21

Proposed Framework
Experimental Setup: Tabletop 3D object segmentation & identification

23/21

Proposed Framework
Experimental Setup: Tabletop 2D object segmentation & identification

24/21

Proposed Framework
Experimental Setup: Tactile Sense

25/21

Proposed Framework
Experimental Setup: Experiment
iCub: Please tell me what to do!

Human: iCub reach object one


iCub: I'm perceiving iCub: I guess object one is going to be reached" iCub: Please give me object one" iCub: Please tell me what happened

Human:iCub object one is reached"

26/21

Proposed Framework
Experimental Setup: Preliminary Results

27/21

Conclusion

Our purpose is
To enable emergence of verbs and nouns from the interactions of the robot with the environment, To enable emergence of the same concepts through observation of others or interacting with a human collaboratively.

At the end, our robot is supposed to be able to interact with a human partner in a reasonable way to accomplish a given task, and learn from demonstration how to get it done.

28/21

References
[1] J. J. Gibson (1977), The Theory of Affordances. In Perceiving, Acting, and Knowing, Eds. Robert Shaw and John Bransford, ISBN 0-470-990147. [2] E. Sahin, M. Cakmak, M.R.Dogar, E. Ugur , G. Ucoluk, To Afford or Not to Afford: A New Formalization of Affordances Toward AffordanceBased Robot Control, Adaptive Behavior , 2007 pp: 447-472 [3] Lakoff G. Women, Fire, and Dangerous Things: What Categories Reveal about the Mind. Chicago, IL: University of Chicago Press; 1987. [4] Gallese V, Lakoff G. The brains concepts: the role of the sesnsory-motor system in conceptual knowledge. Cogn Neuropsychol 2005, 22:455479. [5] Zwaan RA, Taylor LJ. Seeing, acting, understanding: motor resonance in language comprehension. J Exp Psychol Gen 2006, 135:111. [6] Hauk O, Johnsrude I, Pulvermu ller F. Somatotopic representation of action words in human motor and premotor cortex. Neuron 2004, 41:301307. [7] Kaschak MP, Madden CJ, Therriault DJ, Yaxley RH, AveyardM, et al. Perception of motion affects language processing. Cognition 2005, 94:B79B89. [8] Chambers CG, Tanenhaus MK, Magnuson JS. Actions and affordances in syntactic ambiguity resolution. J Exp Psychol Learn Mem Cogn 2004, 30:687696. [9] Glenberg, Arthur M. Embodiment as a unifying perspective for psychology, Wiley Interdisciplinary Reviews: Cognitive Science, vol.1 issue.4, 2010 [10] Harnad S. The symbol grounding problem. Physica D 1990, 42:335346. [11] http://www.umich.edu/~engb415/literature/pontee/RUR/RURsmry.html [12] http://www.seinendan.org/en/special/2011/europe/ [13] Goodrich MA , Schultz AC. HumanRobot Interaction: A Survey, Foundations and TrendsR in HumanComputer Interaction Vol. 1, No. 3 (2007) 203275 [14] Breazeal, C. (2003). Toward sociable robots. Robotics and Autonomous Systems, 42(3-4), 167175. Elsevier. [15] Fong, T., Nourbakhsh, I., & Dautenhahn, K. (2003). A survey of socially interactive robots. Robotics and autonomous systems. Elsevier [16] Mojsilovic, A. (2005) A computational model for color naming and describing color composition of images. IEEE Trans. Image Process. 14, 690699 [17] Gardenfors, P. (2000) Conceptual Spaces: The Geometry of Thought,MIT Press [18] Regier, T. (1996) The Human Semantic Potential, MIT Press 29/21

Thank you for listening

30/21

Вам также может понравиться