Академический Документы
Профессиональный Документы
Культура Документы
A Robotics Perspective
Kadir Firat Uyanik
kadir@ceng.metu.edu.tr
kovanresear
19.12.2011
Outline
1. Introduction 2. Words and physical world 3. Words and perceptual categories 4. Words and context dependency 5. Word learning from audio-visual inputs 6. Grounding verbs in action 7. Grounding nouns in perception and action 8. Grounding concepts through social interactions 9. Proposed Framework 10. Conclusions
2/21
Introduction
3/21
Thus, meaning of round is grounded in visual features of exemplars, Push in motor control structures, Heavy in haptic features, or, they are grounded in combinations/interrelations of all these features.
5/21
Most of the language grounding systems are only capable of labeling the similar clusters
1. Convert continuous sensory input into discrete feature vectors, 2. Cluster similar feature vectors, 3. Label them according to a linguistic convention
Usually, these systems are not context aware, and fixed category models cannot capture context sensitive details (Mojsilovics color associator16.)
Red wine and black wine. Red and black refers to the same object in two different linguistic convention.
6/21
Gardenfors17 proposed a model in which the relation between context independent color prototypes and wine colors are shown. This model also shows why red and white cannot be used interchangebly but red and black can be used to refer same wine color in different linguistic conventions.
Red wine and black wine. Red and black refers to the same object in two different linguistic convention.
7/21
Ragier18 showed that simple words such as above or near may correspond to rather implicit features of the environment. He found two main features to model this above spatial relationship which closely matches human judgments. However, models like Gardenfors and Ragiers are insensitive to functional contexts.
Different levels of aboveness. The concept of above becomes less and less comfortable from left to right if the circle is above the block statement is considered.
8/21
Roy, D. and Pentland, A. (2002) Learning words from sights and sounds: A computational model. Cogn. Sci. 26, 113146
9/21
CELL assumes that object-of-interest is available. Yu, Ballard and Aslin developed a system that processes spoken input paired with visual images of multiple objects combined with the speakers eye gaze direction.
Yu, C. Ballard D.H., Aslin R.N. The role of embodied intention in early lexical acquisition. Cogn. Sci. vol.29, issue.6, pp 961-1005, 2005
10/21
In Siskinds perceptually grounded model, the semantics of basic verbs are modeled using temporal schemas that define expected sequences of force dynamic interactions between objects.
E.g. hand picks up block : table-supports-block, hand-contacts-block, hand-attached-block, hand-supports-block
Time durations are not specified by the schemas, enabling the model to classify observations across varying timescales. Higher level actions are defined in terms of these lower level schemas. Thus move is defined as the ordered sequence of the schemas corresponding to pick up followed by put down.
11/21
Bailey et al. addressed this issue by developing a system that learns verb semantics in terms of action control structures, called x-schemas, which control sequences of movements of a simulated manipulator arm. A verb is defined by its associated x-schema and control parameters.
The verbs pick up and put down are distinguished by the structure of their associated x-schemas, push and shove are distinguished by different force or velocity control parameters applied to the structurally identical x-schema.
12/21
Verbs: sensory-motor control programs similar to x-schemas. Adjectives: sensory expectations relative to specific actions. E.g.,
red : is not simply a color category, but rather a color category linked to the motor program for directing active gaze towards an object. Heavy : haptic expectations associated with lifting actions.
Locations are encoded in terms of body-relative coordinates. Objects: bundles of properties tied to a particular location along with encodings of motor affordances for affecting the future location of the bundle. E.g. ball : subsumes both the meaning of round (which is one of its expected properties along with color, size, etc.), and all of the actions that may affect the ball.
Deb Roy, Semiotic schemas: A framework for grounding language in action and perception, Artificial Intelligence, Volume 167, Issues 1-2, September 2005, Pages 170-205,
13/21
Revised Definition: An affordance is an acquired relation between a <(entity, behavior)> tuple of an agent such that the application of the <behavior> on the <entity> generates a certain <effect> [2].
environment
<entity>
agent
<behavior>
<effect>
(<effect>, <(entity, behavior)>)
14/21
16/21
Verb (<do this>) Getting <this> action done actually doesnt depend on the way the action is applied. It is more about the effect generated on <that thing>.
Instead of representing verbs with behaviors, represent them with the effect clusters.
17/21
Noun (<that thing>) A robot can learn which feature of the object doesnt change by applying various actions on that object.
These stable features are actually good indicators of the object itself, and what it actually is. Therefore, these stable features can be used to call the object as <that thing>, and variable features can be used to predict what it is going to happen (<effect>) if the robot realize <do this>.
18/21
Left: A scene from R.U.R(1921), showing three robots *11+. Right: A scene from Sayonara (2010) *12+. 19/21
HumanRobot Interaction (HRI) is a field of study dedicated to understanding, designing, and evaluating robotic systems for use by or with humans [13]. HRI is a highly interdisciplinary field which requires collaboration between the groups from cognitive science, linguistics, psychology, engineering, mathematics, computer science etc. Unfortunately, robots are still far from being able to interact with humans in a smooth, natural way (Breazeal14, Fong15)
20/21
Proposed Framework
A common framework for interaction Learning affordances either by directly acting in the environment, or observing others acting, or even acting collaboratively. Understanding what is meant to do on what !
Verbs identify the action (in fact the effect), Nouns identify the entity to apply action upon
Assumptions:
Robots action repertoire is pre-coded, Object-of-interest is available to the robot.
21/21
Proposed Framework
Experimental Setup: Overview
22/21
Proposed Framework
Experimental Setup: Tabletop 3D object segmentation & identification
23/21
Proposed Framework
Experimental Setup: Tabletop 2D object segmentation & identification
24/21
Proposed Framework
Experimental Setup: Tactile Sense
25/21
Proposed Framework
Experimental Setup: Experiment
iCub: Please tell me what to do!
26/21
Proposed Framework
Experimental Setup: Preliminary Results
27/21
Conclusion
Our purpose is
To enable emergence of verbs and nouns from the interactions of the robot with the environment, To enable emergence of the same concepts through observation of others or interacting with a human collaboratively.
At the end, our robot is supposed to be able to interact with a human partner in a reasonable way to accomplish a given task, and learn from demonstration how to get it done.
28/21
References
[1] J. J. Gibson (1977), The Theory of Affordances. In Perceiving, Acting, and Knowing, Eds. Robert Shaw and John Bransford, ISBN 0-470-990147. [2] E. Sahin, M. Cakmak, M.R.Dogar, E. Ugur , G. Ucoluk, To Afford or Not to Afford: A New Formalization of Affordances Toward AffordanceBased Robot Control, Adaptive Behavior , 2007 pp: 447-472 [3] Lakoff G. Women, Fire, and Dangerous Things: What Categories Reveal about the Mind. Chicago, IL: University of Chicago Press; 1987. [4] Gallese V, Lakoff G. The brains concepts: the role of the sesnsory-motor system in conceptual knowledge. Cogn Neuropsychol 2005, 22:455479. [5] Zwaan RA, Taylor LJ. Seeing, acting, understanding: motor resonance in language comprehension. J Exp Psychol Gen 2006, 135:111. [6] Hauk O, Johnsrude I, Pulvermu ller F. Somatotopic representation of action words in human motor and premotor cortex. Neuron 2004, 41:301307. [7] Kaschak MP, Madden CJ, Therriault DJ, Yaxley RH, AveyardM, et al. Perception of motion affects language processing. Cognition 2005, 94:B79B89. [8] Chambers CG, Tanenhaus MK, Magnuson JS. Actions and affordances in syntactic ambiguity resolution. J Exp Psychol Learn Mem Cogn 2004, 30:687696. [9] Glenberg, Arthur M. Embodiment as a unifying perspective for psychology, Wiley Interdisciplinary Reviews: Cognitive Science, vol.1 issue.4, 2010 [10] Harnad S. The symbol grounding problem. Physica D 1990, 42:335346. [11] http://www.umich.edu/~engb415/literature/pontee/RUR/RURsmry.html [12] http://www.seinendan.org/en/special/2011/europe/ [13] Goodrich MA , Schultz AC. HumanRobot Interaction: A Survey, Foundations and TrendsR in HumanComputer Interaction Vol. 1, No. 3 (2007) 203275 [14] Breazeal, C. (2003). Toward sociable robots. Robotics and Autonomous Systems, 42(3-4), 167175. Elsevier. [15] Fong, T., Nourbakhsh, I., & Dautenhahn, K. (2003). A survey of socially interactive robots. Robotics and autonomous systems. Elsevier [16] Mojsilovic, A. (2005) A computational model for color naming and describing color composition of images. IEEE Trans. Image Process. 14, 690699 [17] Gardenfors, P. (2000) Conceptual Spaces: The Geometry of Thought,MIT Press [18] Regier, T. (1996) The Human Semantic Potential, MIT Press 29/21
30/21