ABSTRACT Title of dissertation: MANIPULATION ACTION UNDERSTANDING FOR OBSERVATION AND EXECUTION Yezhou Yang, Doctor of Philosophy, 2015 Dissertation directed by: Professor Yiannis Aloimonos Department of Computer Science Modern intelligent agents will need to learn the actions that humans perform. They will need to recognize these actions when they see them and they will need to perform these actions themselves. We want to propose a cognitive system that in- terprets human manipulation actions from perceptual information (image and depth data) and consists of perceptual modules and reasoning modules that are in inter- action with each other. The contributions of this work are given along two core problems at the heart of action understanding: a.) the grounding of relevant infor- mation about actions in perception (the perception - action integration problem), and b.) the organization of perceptual and high-level symbolic information for inter- preting the actions (the sequencing problem). At the high level, actions are repre- sented with the Manipulation Action Context-free Grammar (MACFG) , a syntactic grammar and associated parsing algorithms, which organizes actions as a sequence of sub-events. Each sub-event is described by the hand (as well as grasp type), movements (actions) and the objects and tools involved, and the relevant informa- tion about these quantities is obtained from biological-inspired perception modules. These modules track the hands and objects and recognize the hand grasp, actions, segmentation, and action consequences. Furthermore, a probabilistic semantic pars- ing framework based on CCG (Combinatory Categorial Grammar) theory is adopted to model the semantic meaning of human manipulation actions. Additionally, the lesson from the findings on mirror neurons is that the two processes of interpreting visually observed action and generating actions, should share the same underlying cognitive process. Recent studies have shown that gram- matical structures underlie the representation of manipulation actions, which are used both to understand and to execute these actions. Analogically, understanding manipulation actions is like understanding language, while executing them is like generating language. Experiments on two tasks, 1) a robot observing people per- forming manipulation actions, and 2) a robot then executing manipulation actions accordingly, are presented to validate the formalism. The technical parts of this thesis are devoted to the experimental setting of task (1), while the task (2) is given as a live demonstration. MANIPULATION ACTION UNDERSTANDING FOR OBSERVATION AND EXECUTION by Yezhou Yang Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2015 Advisory Committee: Professor Yiannis Aloimonos, Chair/Advisor Dr. Cornelia Fermu¨ller, Co-Advisor Professor John Baras Professor Hal Daume´ III Professor Don Perlis c© Copyright by Yezhou Yang 2015 Preface John McCarthy, who started the term Artificial Intelligence back in 1955 [1], defines it as the “science and engineering of making intelligent machines”, in which an intelligent agent is a system that perceives its environment and takes actions, such as manipulating objects, that maximize its chances of success. After 60 years of advancement, we are now in the year of 2015, and the AI research is becoming highly technical and specialized. Deeply divided subfields emerge that it seems almost impossible to find connections from each other. The study of AI loses its unity and many in the field get hold of one or some aspects of the original pursuit to enjoy for themselves. The case is like the senses of hearing, sight, smell, and taste, which have specific functions, but cannot be interchanged. The school of computational perception, including computer vision and speech recognition, focuses on the aspect “...perceives its environment...”. The school of symbolic AI research focuses on symbolic reasoning to “maximize its chances of success”, the school of statistical learning or machine learning, focuses on “maximize its chances of success” by methods based on probability and mathematical optimization. The school of robotics, on the other hand, focuses on creating or building “an intelligent agent”, either physical or virtual, that “takes actions, such as manipulating objects”. The school of cognitive systems, including some parts of humanoid and human-machine interaction research, takes a view beyond not only “making intelligent machines”, but also studying human beings through the methodology of reverse engineering. Even within a specific school of method, such as deep learning, researchers are usually divided into subgroups. Some care more about the empirical results on ii specific applications, or the performance without, while some care more about the method’s innate relation to the biological and physical systems, or the principle within. Such severe division and the long tradition of conducting vertical research make it extremely difficult to have a whole picture of AI and pursuit the ultimate goal: principle within and performance without. I started my journey in the land of AI by creating robots playing soccer (RoboCup), a specialized system aiming at a specific application when I was an undergraduate student. Later I was fortunate enough to start my research in Com- puter Vision and Computational Linguistics at University of Maryland Computer Vision Lab by conducting research on the topic of combining vision and language. This initiates my horizontal viewpoint of AI. With the involvement of the European Union cognitive system project, some collaborations with symbolic AI researchers and the deployment of humanoid robots (Baxter robot) in the lab, I set my mind to conduct a horizontal PhD thesis instead of a usual vertical one. At the beginning it seemed overwhelmingly difficult, after I set the focus on human manipulation actions, it started to become feasible. In this thesis, we present researches that are conducted in the fields of Computer Vision, Computational Linguistics, Robotics and even Common-sense Reasoning, that all surround the central theme: from un- derstanding to executing manipulation actions for intelligent agents. I want to say that the horizontal study presented in this thesis is by no means close to the ultimate unity of AI. It might advance itself through a continuing practice, and this is, hopefully, the future work of my research career. iii Dedication To my beloved wife and parents. iv Acknowledgments I would like to thank: Dr. Yiannis Aloimonos, for his continuous guidance and support; Dr. Cornelia Fermu¨ller, for calibrating my study with extreme patience; Dr. John Baras, Dr. Hal Daume´ III, Dr. Don Perlis and Dr. Chitta Baral, for their time and expertise to improve my work; Dr. Yi Li, Dr. Ching Teo, Dr. Xiaodong Yu and Dr. Douglas Summers-Stay, for numerous discussions and debates; The robotic visual learner team: Konstantinos Zampogiannis, Yi Zhang, Yuchen Zhou and Michael Stevens, for not abandoning me because of my moodiness; My fellow peers: Aleksandrs Ecins, Austin Myers, Anupam Guha, Ren Mao, Fang Wang, Somak Aditya and others, for the joy of working together; Telluride workshop folks, Dr. Francisco Barranco, Dr. Michael Pfeiffer, Dr. Ryad Benosman, Dr. Andreas Andreou, Dr. Tobi Delbruck and many others, for rocking me out of the comfort zone in my subfield; Poeticon++ project collabo- rators, Dr. Katerina Pastra, Dr. Giulio Sandini, Dr. Luciano Fadiga and Dr. Vadim Tikhanoff and many others, for passing on their passion towards cognitive robots; Qualcomm Innovation Fellowship mentors: Dr. Ashwin Sampath and Sne- hesh Shrestha, for giving me a chance to glance the commercial high-tech world; SPAR Workshop team, Dr. Eren Aksoy, Dr. Neil Dantam, Dr. Karinne Ramirez Amaro and Dr. Tamim Asfour, for sharing insight and inspiration; My uncle, Dr. Jun Ye, for influencing me from almost every aspect of life; And ultimately my lifelong project partner and wife, Dr. Yang Wen, for building resilience together with me under stress and sharing happiness with me afterwards. v Table of Contents List of Tables x List of Figures xi 1 Introduction 1 1.1 Problem Statement and Motivation . . . . . . . . . . . . . . . . . . . 1 1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Contributions of the thesis . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.1 The perception aspect . . . . . . . . . . . . . . . . . . . . . . 3 1.3.2 The reasoning aspect . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.3 The execution aspect . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Outline of the Thesis: The Road Map . . . . . . . . . . . . . . . . . . 8 2 “Grasp it the First Time”: A Modern Perspective on Grasp Type for Ma- nipulation Actions 10 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.1 Human Grasp Types . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.2 CNN for Grasp Type Recognition . . . . . . . . . . . . . . . . 18 2.3.3 Human Action Intention . . . . . . . . . . . . . . . . . . . . . 19 2.3.4 From Grasp Type to Action Intention . . . . . . . . . . . . . . 21 2.3.5 Grasp Type Evolution . . . . . . . . . . . . . . . . . . . . . . 22 2.3.6 Finer segment action using grasp type evolution . . . . . . . . 24 2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4.1 Grasp Type Recognition in Static Images . . . . . . . . . . . . 25 2.4.2 Inference of Action Intention from Grasp Type . . . . . . . . . 27 2.4.3 Manipulation Action Fine Level Segmentation using Grasp Type Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . 31 vi 3 “Get Your Act Together”: Language-Guided Manipulation Action Recogni- tion and Scene Understanding 35 3.1 Language-Guided Action Recognition . . . . . . . . . . . . . . . . . . 35 3.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.1.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.1.2.1 Language as a predictor of actions . . . . . . . . . . 42 3.1.2.2 Active tool detection strategy . . . . . . . . . . . . . 45 3.1.2.3 Action features . . . . . . . . . . . . . . . . . . . . . 46 3.1.3 Using Language to guide recognition . . . . . . . . . . . . . . 47 3.1.3.1 Unsupervised learning of a joint tool-action model . . 48 3.1.3.2 Supervised action classification . . . . . . . . . . . . 51 3.1.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.1.5 The UMD Sushi-Making Dataset . . . . . . . . . . . . . . . . 52 3.1.5.1 Baseline: Vision-only Recognition . . . . . . . . . . . 53 3.1.5.2 Adding Language . . . . . . . . . . . . . . . . . . . . 54 3.1.5.3 Comparison with state of art action features . . . . . 55 3.1.5.4 Discussion: the effects of adding language . . . . . . 56 3.2 Language Guided Scene Understanding for Robots . . . . . . . . . . . 58 3.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.2.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.2.2.1 Image Dataset . . . . . . . . . . . . . . . . . . . . . 66 3.2.3 Object and Scene Detections from Images . . . . . . . . . . . 67 3.2.3.1 Corpus-Guided Predictions . . . . . . . . . . . . . . 69 3.2.3.2 Determining T ∗ using HMM inference . . . . . . . . 73 3.2.3.3 Sentence Generation . . . . . . . . . . . . . . . . . . 77 3.2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.2.4.1 Sentence Generation Results . . . . . . . . . . . . . . 78 3.2.4.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 80 4 “Can’t Make an Omelette without Breaking Eggs”: Detection of Manipula- tion Action Consequences 82 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.2 Why Consequences and Fundamental Types . . . . . . . . . . . . . . 84 4.3 Visual Semantic Graph (VSG) . . . . . . . . . . . . . . . . . . . . . . 87 4.4 Active Segmentation and Tracking . . . . . . . . . . . . . . . . . . . 89 4.4.1 The Attention Field . . . . . . . . . . . . . . . . . . . . . . . 90 4.4.2 Color Distribution Model . . . . . . . . . . . . . . . . . . . . . 91 4.4.3 Weights of the Tracked Point Set . . . . . . . . . . . . . . . . 92 4.4.4 Weighted Graph Cut . . . . . . . . . . . . . . . . . . . . . . . 93 4.4.5 Active Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.4.6 Incorporating Depth and Optical Flow . . . . . . . . . . . . . 95 4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.5.1 Deformation and Division . . . . . . . . . . . . . . . . . . . . 98 4.5.2 The MAC 1.0 Dataset . . . . . . . . . . . . . . . . . . . . . . 98 4.5.3 Consequence Detection on MAC 1.0 . . . . . . . . . . . . . . . 100 vii 4.5.4 Video Classification on MAC 1.0 . . . . . . . . . . . . . . . . 103 5 The Syntax: A Syntactical Grammar for Understanding Human Manipula- tion Actions 105 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.3 A Cognitive System For Understanding Human Manipulation Actions 110 5.3.1 A Context-Free Manipulation Action Grammar . . . . . . . . 110 5.3.2 Cognitive MACFG Parsing Algorithms . . . . . . . . . . . . . 113 5.3.3 Attention Mechanism with the Torque Operator . . . . . . . . 116 5.3.4 Hand Tracking, Grasp Classification and Action Recognition . 119 5.3.5 Object Monitoring and Recognition . . . . . . . . . . . . . . . 121 5.3.6 Detection of Manipulation Action Consequences . . . . . . . . 123 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6 The Semantics: Learning Manipulation Action Semantics through Proba- bilistic Combinatory Categorial Grammar Parsing 131 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.3 A CCG Framework for Manipulation Actions . . . . . . . . . . . . . 136 6.3.1 Manipulation Action Semantics . . . . . . . . . . . . . . . . . 136 6.3.2 Combinatory Categorial Grammar . . . . . . . . . . . . . . . 138 6.3.3 Functional application . . . . . . . . . . . . . . . . . . . . . . 140 6.4 Learning Model and Semantic Parsing . . . . . . . . . . . . . . . . . 141 6.4.1 Learning Approach . . . . . . . . . . . . . . . . . . . . . . . . 141 6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.5.1 Manipulation Action (MANIAC) Dataset . . . . . . . . . . . . 142 6.5.2 Training Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . 144 6.5.3 Learned Lexicon . . . . . . . . . . . . . . . . . . . . . . . . . 145 6.5.4 Deducing Semantics . . . . . . . . . . . . . . . . . . . . . . . 147 6.5.5 Reasoning Beyond Observations . . . . . . . . . . . . . . . . . 149 7 Procedural Learning: Robot Learning Manipulation Action Plans by “Watch- ing” Unconstrained Videos from the World Wide Web 152 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 7.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 7.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 7.3.1 CNN based visual recognition . . . . . . . . . . . . . . . . . . 157 7.3.1.1 Convolutional Neural Network . . . . . . . . . . . . . 157 7.3.1.2 Grasping Type Recognition . . . . . . . . . . . . . . 158 7.3.1.3 Object Recognition and Corpus Guided Action Pre- diction . . . . . . . . . . . . . . . . . . . . . . . . . . 160 7.3.2 From Recognitions to Action Trees . . . . . . . . . . . . . . . 161 7.3.2.1 Manipulation Action Grammar . . . . . . . . . . . . 162 7.3.2.2 Parsing and tree generation . . . . . . . . . . . . . . 163 viii 7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 7.4.1 Dataset and experimental settings . . . . . . . . . . . . . . . . 165 7.4.2 Grasping Type and Object Recognition . . . . . . . . . . . . . 166 7.4.3 Visual Sentence Parsing and Commands Generation for Robots167 7.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 8 Concluding Remarks and Future Work 171 8.1 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 8.3 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 Bibliography 180 ix List of Tables 2.1 Precision (P) and Recall (R) for each grasp type category and overall accuracy. (A):HoG+BoW+SVM; (B):HoG+BoW+RF; (C): CNN . . 25 2.2 Precision (P) and Recall (R) for each intention category and overall accuracy. GL: Grasp type Label; GT: Grasp Type belief distribution. 30 3.1 Classification accuracy: STIP versus our approach. . . . . . . . . . . 56 3.2 The set of objects, actions (first 20), scenes and preposition classes considered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.3 Samples of synonyms for 3 object classes. . . . . . . . . . . . . . . . . 70 3.4 Sentence generation evaluation results with human gold standard. Human R1 scores are averaged over the 5 sentences using a leave one out procedure. Values in bold are the top scores. . . . . . . . . . . . . 79 5.1 A Manipulation Action Context-Free Grammar. . . . . . . . . . . . . 112 5.2 “Hands”, “Objects” and “Actions” involved in the experiments. . . . 126 6.1 Example annotations from training corpus, one per manipulation ac- tion category. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 7.1 The list of the grasping types. . . . . . . . . . . . . . . . . . . . . . . 159 7.2 The list of the objects considered in our system. . . . . . . . . . . . . 160 7.3 A Probabilistic Extension of Manipulation Action Context-Free Gram- mar. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 7.4 Incorrect entities learned are marked in red. . . . . . . . . . . . . . . 169 x List of Figures 1.1 How general vision, purposive vision and industrial vision fit together, and where the manipulation action observation locates in the problem space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 The Road Map of the thesis. . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 (a) Rest or Extension on the handlebar vs. (b) Firmly power cylin- drical grasping the handlebar. . . . . . . . . . . . . . . . . . . . . . 11 2.2 (a) Power Hook Grasp a knife vs. (b) Precision Lumbrical Grasp a knife. (c) A natural reaction when seeing scene (b) is to open the hand to receive the knife. . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Sample outputs. PoC: Power Cylindrical; PoS: Power Spherical; PoH: Power Hook; PrP: Precision Pinch; PrT: Precision Tripod; PrL: Pre- cision Lumbrical; RoE: Rest or Extension . . . . . . . . . . . . . . . . 13 2.4 The grasp types considered. Grasps which cannot be categorized into the six types here are considered as the “Rest and Extension” (no grasping performed). . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5 Human action intention categories. . . . . . . . . . . . . . . . . . . . 21 2.6 Inference of human action intention from grasp type recognition. . . . 23 2.7 Grasp type evolution (right hand) in a manipulation action. . . . . . 23 2.8 Category pairwise confusion matrix for grasp type classification. . . . 28 2.9 Examples of correct and false classification. PoC: Power Cylindrical; PoS: Power Spherical; PoH: Power Hook; PrP: Precision Pinch; PrT: Precision Tripod; PrL: Precision Lumbrical; RoE: Rest or Extension. 29 2.10 Clear action intention vs. an ambigous one . . . . . . . . . . . . . . . 29 2.11 Correct examples of predicting action intention. . . . . . . . . . . . . 30 2.12 Failure cases of predicting action intention. The label at the bottom denotes the human labeling. . . . . . . . . . . . . . . . . . . . . . . . 31 2.13 Left and right hand grasp type recognition along timeline and video segmentation results compared with ground truth segments. . . . . . 32 2.14 1st row: sample hand localization on first frame using [2]. 2nd to 5th row: two sample sequences of hand patches extracted using meanshift tracking [3]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 xi 3.1 (Top) Ambiguities in action recognition: similar trajectories for dif- ferent actions. Tools considered in isolation can only suggest possible actions. (Below) Language can predict, given the tool and action trajectories, the most likely action label. . . . . . . . . . . . . . . . . 36 3.2 Key components of the approach.: (a) Training the language model from a large text corpus. (b) Detected tools are queried into the language model. (c) Language model returns prediction of action. (d) Action features are compared and beliefs updated. . . . . . . . . . 40 3.3 Enlarging the word class to contain synonyms yields more reasonable counts: cup only connects weakly with drink. By clustering other closely related words together, their combined counts increase the desired association between cup and drink. . . . . . . . . . . . . . . 43 3.4 Gigaword co-occurrence matrix of tools and predicted actions. . . . . 44 3.5 (Best viewed in color) Overview of the tool detection strategy: (1) Optical flow is first computed from the input video frames. (2) We train a CRF segmentation model based on optical flow + skin color. (3) Guided by the flow computations, we segment out hand-like re- gions (and removed faces if necessary) to obtain the hand regions that are moving (the active hand that is holding the tool). (4) The active hand region is where the tool is localized. Using the PLS detector (5), we compute a detection score for the presence of a tool. . . . . . 46 3.6 Detected hand trajectories. x and y coordinates are denoted as red and blue curves respectively. . . . . . . . . . . . . . . . . . . . . . . . 47 3.7 (Best viewed in color) (Left) Unsupervised EM: accuracy at each it- eration. (Right) Scatterplots of action label assignments at selected iterations. We see that with each iteration, the assignment label clus- ters approaches the ground truth label (boxed in red). Note that we used PCA to reduce the action feature dimensions to 2 for visualization. 55 3.8 (a) Unsupervised recognition accuracy: no language (K-Means) ver- sus language (EM). (b) Classification accuracy: no language versus language. All reported results have variances within ±0.5%. . . . . . 58 3.9 Some predicted action and tools using EM. The wrong prediction (in red and italicized) of the sprinkle action is due to a high co- occurrence with bowl in PL(V |N). . . . . . . . . . . . . . . . . . . . . 59 3.10 The processes involved for describing a scene. . . . . . . . . . . . . . 59 3.11 Illustration of various perceptual challenges for sentence generation for images. (a) Different images with semantically the same content. (b) Pose relates ambiguously to actions in real images. . . . . . . . . 60 3.12 Overview of our approach. (a) Detect objects and scenes from input image. (b) Estimate optimal sentence structure quadruplet T ∗. (c) Generating a sentence from T ∗. . . . . . . . . . . . . . . . . . . . . . 64 3.13 Samples of images with corresponding annotations from the UIUC scene description dataset. . . . . . . . . . . . . . . . . . . . . . . . . 66 xii 3.14 (a) [Top] The part based object detector from [4]. [Bottom] The graphical model representation of an object, for e.g. a bike. (b) Examples of GIST gradients: (left) an outdoor scene vs (right) an indoor scene [5]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.15 (a) Selecting the ROOT verb from the dependency parse ride reveals its subject woman and direct object bicycle. (b) Selecting the head noun (PMOD) as the scene street reveals ADV as the preposition on 69 3.16 Example of how ranked log-likelihood values (in descending order) suggest a possible T : (a) λnvn for n1 = person, n2 = bus predicts v = ride. (b) λns and λvs for n = bus, v = ride then jointly predicts s = street and finally (c) λps with s = street predicts p = on. . . . 73 3.17 The HMM used for optimizing T . The relevant transition and emis- sion probabilities are also shown. See text for more details. . . . . . . 74 3.18 Four test images (left) and results. (Right-upper): Sentence structure T ∗ predicted using Viterbi and (Right-lower): Generated sentences. Words marked in red are considered to be incorrect predictions. . . . 76 4.1 Graphical illustration of the changes for Condition (1-6). . . . . . . 88 4.2 Flow chart of the proposed active segmentation and tracking method for object monitoring. . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.3 Upper: (1) Sampling of tracked points sampling and filtering; (2) Weighted graph cut. Lower: Segmentation with different initial fix- ations. Green Cross: initial fixation. . . . . . . . . . . . . . . . . . . . 94 4.4 (a): Incorporating optical flow into segmentation. (b): Incorporating optical flow into tracking. . . . . . . . . . . . . . . . . . . . . . . . . 97 4.5 (a): Deformation Invariance: upper: state-of-the-art appearance based tracking [6]; middle: tracking without updating target model; lower: updating target model. (b): Division Invariance: synthetic cell divi- sion sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.6 “Division” detection on “cut cucumber” sequence. Upper row: Origi- nal sequence with segmentation and tracking; Middle and lower right: VSG representations; Lower left: Division consequences detection. . . 101 4.7 “Assemble” detection on “make sandwich 1” sequence; 1st row: Orig- inal sequence with segmentation and tracking; 2nd row: VSG repre- sentation; 3rd row: Distance between each two segments (red line: bread and cheese, magenta line: bread and meat, blue line: cheese and meat; 4th row: Assemble consequence detection. . . . . . . . . . 102 4.8 “Deformation” detection on “close book 1” sequence; 1st row: Orig- inal sequence with segmentation and tracking; 2nd row: VSG repre- sentation; 3rd row: appearance description (here color histogram) of each segment; 4th row: measurement of appearance change; 5th row: Deformation consequence detection. . . . . . . . . . . . . . . . . . . 102 4.9 ROC curve of each sequence by categories: (a) TRANSFER, (b) DEFORM, (c) DIVIDE, and (d) ASSEMBLE. . . . . . . . . . . . . 103 4.10 Video classification performance comparison. . . . . . . . . . . . . . . 104 xiii 5.1 Overview of the manipulation action understanding system, includ- ing feedback loops within and between some of the modules. The feedback is denoted by the dotted arrow. . . . . . . . . . . . . . . . . 107 5.2 The (a) construction and (b) destruction operations. Fine dashed lines are newly added connections, crosses are node deletion, and fine dotted lines are connections to be deleted. . . . . . . . . . . . . . . . 114 5.3 Here the system observes a typical manipulation action example, “Cut an eggplant”, and builds a sequence of six action trees. . . . . . 118 5.4 (a) Torque for images, (b) a sample input frame, and (c) torque oper- ator response. Crosses are the pixels with top extreme torque values that serve as the potential fixation points. . . . . . . . . . . . . . . . 119 5.5 (a) Bones of the human hand. (b) Arches of the hand: (1) one of the oblique arches; (2) one of the longitudinal arches of the digits; (3) transverse metacarpal arch; (4) transverse carpal arch. Source: [7]. . . 120 5.6 (a) One example of fully articulated hand model tracking, (b) a 3-D illustration of the tracked model, and (c-d) examples of grasp type recognition for both hands. . . . . . . . . . . . . . . . . . . . . . . . . 121 5.7 The second row shows the hand tracking and object monitoring. The third row shows the object recognition result, where each segmenta- tion is labelled with an object name and a bounding box in different color. The fourth and fifth rows depict the hand speed profile and the Euclidean distances between hands and objects. The sixth row shows the consequence detection. . . . . . . . . . . . . . . . . . . . . 127 5.8 The tree structures generated from the “Make a Sandwich” sequence. Figure 5.7 depicts the corresponding visual processing. Since our sys- tem detected six triplets temporally from this sequence, it produced a set of six trees. The order of the six trees is from left to right. . . . 129 5.9 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.1 A CCG based semantic parsing framework for manipulation actions. . 135 6.2 Example of conventional tree structure. . . . . . . . . . . . . . . . . . 139 6.3 System output on complex chained manipulation testing sequence one. The segmentation output and detected triplets are from [8] . . . 148 6.4 System output on the 18th complex chained manipulation testing sequence. The segmentation output and detected triplets are from [8] 149 7.1 The integrated system reported in this work. . . . . . . . . . . . . . . 157 7.2 Confusion matrices. Left: grasping type; Right: object. . . . . . . . . 166 7.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 8.1 Images retrieved from 3 verbal search terms: ride,sit,fly. . . . . . 176 8.2 A hallucination process of contour completion (paint stone sequence in MAC 1.0). Left: original segments; Middle: contour hallucina- tion with second order polynomials fitting (green lines); Right: final hallucinated contour. . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 xiv Chapter 1: Introduction 1.1 Problem Statement and Motivation Intelligent agent, robots, and cognitive systems interacting with humans need to be able to interpret human actions. Here we are concerned with manipulation actions, that are actions performed by agents (humans or robots) on objects, which result in some physical change of the objects. The sensory-motor bridge connecting the two tasks is essential, and a great amount of attention in AI, Robotics as well as Neurophysiology has been devoted to understanding it. Experiments conducted on primates have discovered that certain neurons, the so-called mirror neurons, fire during both observation and execution of identical manipulation tasks ( [9,10]). This suggests that the same process is involved in both the observation and execution of actions. From a functionalist point of view, such a process should be able to first build up a semantic structure from observations, reasoning over action goals, and then the decomposition of same structure should occur when the intelligent agent executes tasks. Thus in this thesis, we study the three aspects of a such cognitive system, namely 1) Perception of various aspects of manipulation actions, such as action consequences, grasp types etc; 2) Reasoning within a grammatical framework to model manipulation actions in both syntactical and semantic way; 3) Execution 1 of manipulation tasks on a humanoid platform (Baxter research robot). We mainly focus on the first two aspects, and provide some seminal study and results for the third aspect. 1.2 Related Work The topic of analyzing human manipulation actions for robotic applications has been studied from multiple perspectives in the last decade. Due to its innately interdisciplinary nature, related study span over the fields of Computer Vision, Robotics, Computational Linguistics and Cognitive Systems. At visual signal pro- cessing level, [11–13] proposed to use statistical methods to model the relationship between different entities involved in manipulation actions for better visual ground- ing. [14, 15] proposed a semantic event chain (SEC) as middle level representation to model and learn the semantic segment-wise relationship transition from spatial- temporal video segmentation. In the field of Robotics, [16] studied the transferring of manipulation skills to robot through a semantic representation obtained from observing human activities. [17] first discussed a Chomskyan grammar for under- standing complex actions as a theoretical concept, [18] provided an implementation of such a grammar using as perceptual input only objects. [19] also modeled the robot imitation learning using probabilistic activity grammars. In the field of cog- nitive systems, [20] proposed an integration of information from natural-language statements and the simultaneous resolution of both visual and linguistic ambiguity for robot manipulation. 2 1.3 Contributions of the thesis 1.3.1 The perception aspect The input to the systems for interpreting manipulation actions is perceptual data, specifically sequences of images and depth maps. Therefore, a crucial part of our system are the vision processes, which obtain atomic symbols from perceptual data. Figure 1.1: How general vision, purposive vision and industrial vision fit together, and where the manipulation action observation locates in the problem space. The proposed formalism for the modules to visually interpret manipulation actions lies in the category of purposive vision due to the inherited nature of the problem itself. Problems such as object recognition and action recognition, which are considered as general problems in the sense their goal is a complete grounding of the perceived space. For example, a general object recognition problem asks to return all possible object areas from the dynamic scene. The object recognition 3 problem that we consider under manipulation action setting is specific, in the sense we only focus on the objects that is used, or under manipulation, or touched by human hands. Similarly the problem of action recognition that we consider under manipulation action setting is also a specific problem, because we only care about these actions that lead to physical change of the object, and therefore sometimes not even the actions themselves but only their consequences. However, when we make assumptions towards manipulation actions, they are general assumptions like “an object can only move or change when one or more effectors interact with it”, because indeed the world satisfies it. The Figure. 1.1 from [21] shows how general vision, purposive vision and industrial vision fit together, and we show where the problem, visually interpretation of manipulation actions, locates in the problem space. Hands are the main driving force in human manipulation actions. First of all, during the observation of manipulation actions, human beings pay attention to the hands and their surrounded area, indicating a considerable computational power is devoted to hands. Secondly, the change of grasp types of hands characterize a segmentation of the action into semantically meaningful finer components. From the viewpoint of processing perception data, the grasp type contains information about the action itself, and it can be used for prediction or as a feature for recognition. In this thesis, we investigate two different ways of tracking and analyzing hands. 1) A model based approach for controlled lab setting: a state-of-the-art markerless hand tracking system is used to obtain fine grain skeleton models of both hands. Using this data, the manner in which the human grasps the objects is classified into primitive categories; and 2) A feature based approach for unconstrained input 4 setting: a convolutional neural network framework is used to recognize a patch of area around each hand into grasp types. Further we show that grasp type can also be used to infer human action intention at higher level. The second crucial part of manipulation actions is about objects. Here we denote objects as both tools and the objects under manipulation. In this thesis, we also investigate two ways to recognize objects. 1) An attention driven object monitoring process for controlled lab environment: to obtain objects, first a contour based attention mechanism locates the object. Next, the manipulated object is mon- itored using a process that combines stochastic tracking with active segmentation. Then, the segmented objects are recognized. Finally, with the aid of the monitoring process, the effect of the object during action is checked and classified into four types of “consequences” (which are used in the description of the action). 2) A feature based approach for unconstrained input setting: we train general object detectors from labeled training data and associate candidate object patches with the left or right hand, respectively depending on which has the smaller Euclidean distance. The third part of manipulation action is the action itself. Unlike general action recognition problem, we are considering these actions that are performed by hands, segmented with grasp type evolution and characterized by the tool and object under manipulation. In this thesis, again we investigate two different action recognition settings. 1) A trajectory based approach for controlled lab data, which makes use of the trajectories of both hands as well as recognized tool and objects. 2) An inference approach for unconstrained scenes that based on large natural language corpus to predict the most likely action happening given the detected subject and patient. We 5 further show that a natural language description can be generated using a hidden Markov modeling over the detected visual components for both static scenery and manipulation action clips. 1.3.2 The reasoning aspect Beyond visual modules, our formalism for describing manipulation actions uses a structure similar to natural language. What do we gain from this formal description of action? This is equal to asking what one gains from a formal descrip- tion of language. Chomsky’s contribution to language was the formal description of language through his generative and transformational grammar [22]. This rev- olutionized language research, opened up new roads for its computational analysis and provided researchers with common, generative language structures and syntac- tic operations on which language analysis tools were built. A grammar for action would contribute to providing a common framework of the syntax and semantics of action so that basic tools for action understanding can be built. These tools would allow researchers to build on when developing action interpretation systems, without having to start development from scratch. The vision processes produce a set of symbols: the “Subject”, “Action” and “Object” triplets, which serve as input to the reasoning module. However, since perceptual events do not suffice, then how do we determine the beginning and end of action segments, and how do we combine the individual segments into longer segments corresponding to a manipulation action? An essential component in the 6 description of manipulations is the underlying goal. The goal of a manipulation ac- tion is the physical change inducing on the object. To accomplish it the hands need to perform a sequence of sub-actions on the object. A sub-action is referring to a single movement that hand grasps or releases the object, or the hand changes grasp type during a movement. Centered around this idea, we developed a grammatical formalism for parsing and interpreting manipulation action sequences, and investi- gated the vision modules to obtain from videos the symbolic information used in the grammatical structure. We studied both the syntactic and semantic modeling of manipulation actions. The syntactic part of our reasoning module is the Manipulation Action Context- Free Grammar (MACFG). This grammar comes with a set of generative rules and a set of parsing algorithms. The parsing algorithms have two main operations: “con- struction” and “destruction”. These algorithms dynamically parse a sequence of tree (or forest ) structures made up from the symbols provided by the vision mod- ule. The sequence of semantic tree structures could then be used by the cognitive system to perform reasoning and prediction. We investigated the using of action grammar to parse both controlled lab data with depth sensing and unconstrained on-line instructional videos without depth information. Following the well-known penn-tree bank format, we created a manipulation action tree bank for a set of manipulation actions that can serve as a knowledge base for robot execution. The semantic part of our reasoning module is a Manipulation Action Categorial Combinatory Grammar (MACCG). Here we present an approach for learning the semantic meaning of manipulation action through a probabilistic semantic parsing 7 framework based on CCG theory. The advantage of our approach is twofold: 1) Learning semantic representations from annotations helps an intelligent agent to enrich automatically its own knowledge about actions; 2) The logic representation of the action could be used to infer the object-wise consequence after a certain manipulation, and can also be used to plan a set of actions to reach a certain action goal. Equipped with the λ representation for manipulation actions, our system is able to reason beyond observation and deduce “hidden” action consequences. 1.3.3 The execution aspect At the end of the thesis, we conduct experiments on a research humanoid plat- form (Baxter robot) to have it observe human doing manipulation actions, extract the action representations for reasoning, and then perform the task using its own set of faculties. We adopt widely used humanoid control techniques such as visual ser- voring and dynamic movement primitives to control the humanoid. We will briefly introduce our implementation and show the performance of the robot. 1.4 Outline of the Thesis: The Road Map The rest of the thesis starts with a study (Chapter 2) on grasp type recognition for manipulation actions [23]. It is then followed by a study (Chapter 3) on language- guided manipulation action recognition and scene understanding [24, 25]. After action recognition, the study (Chapter 4) continues with checking and monitoring action consequences [26]. In Chapter 5 and Chapter 6, we report the learning of 8 syntactical [27] and semantics [28] of manipulation actions respectively. At the end, in Chapter 7, we present a system of leaning procedural manipulation knowledge from on-line instructional videos for robot execution [29]. Fig. 1.2 depicts the road map. Figure 1.2: The Road Map of the thesis. 9 Chapter 2: “Grasp it the First Time”: A Modern Perspective on Grasp Type for Manipulation Actions 2.1 Introduction The grasp type contains fine-grain information about human action. Consider the two scenes in Fig. 2.1 from the VOC challenge. Current computer vision systems can easily detect that there is one bicycle and one cyclist (human being) in the image. Through human pose estimation, the system can further confirm that these two cyclists are riding the bike. But humans can tell that the cyclist on the left side literally is not “riding” the bicycle since his hands are posing in a “Rest or Extension” grasp next to the handlebar while the cyclist on the right side is racing because his hands firmly hold the handlebar with a “Power Cylindrical” grasp. In other words, the recognition of grasp type is essential for a more detailed analysis of human action, beyond the processes of current state-of-the-art vision systems. Moreover, recognizing grasp type can help an intelligent system predict the human action intention. Consider an intelligent agent looking at the two scenes in Fig. 2.2(a) and (b). Current state-of-the-art computer vision techniques can accurately recognize many visual aspects from both of these scenes, such as the fact 10 (a) (b) Figure 2.1: (a) Rest or Extension on the handlebar vs. (b) Firmly power cylindrical grasping the handlebar. that there must be a human being standing in the outdoor garden scene, with a knife in his/her hand. However, we human beings will react dramatically different when experiencing the two different scenes, because of our ability to recognize immediately the different ways the person is handling the knife, i.e., the grasp type. We can effectively infer the possible activity the man is going to do based on his way of grasping the knife. After seeing scene Fig. 2.2(a), we could believe this man is going to cut something hard, or even might be malicious, since he is “Power Hook” grasping the knife. After seeing scene Fig. 2.2(b), we may react with a movement to acquire the knife (shown in Fig. 2.2(c)) since the man is “Precision Lumbrical” grasping the knife indicating a passing action. From this example we can see that the grasp type is a strong cue for us to infer the human action intention. These are two examples demonstrating how important it is for us to be able to 11 (a) (b) (c) Figure 2.2: (a) Power Hook Grasp a knife vs. (b) Precision Lumbrical Grasp a knife. (c) A natural reaction when seeing scene (b) is to open the hand to receive the knife. recognize grasp types. The grasp type is an essential component in the characteri- zation of human actions of manipulation ( [30]). From the viewpoint of processing videos, the grasp contains information about the action itself, and it can be used for prediction or as a feature for recognition. It also contains information about the beginning and end of action segments, and thus it can be used to segment videos in time. If we are to perform the action with an intelligent agent, such as a humanoid robot, grasp is one crucial primitive action ( [31]). Knowledge about how to grasp the object is necessary so the robot can arrange its effectors. For example, consider a humanoid with one parallel gripper and one vacuum gripper. When a power grasp is desired, the robot should select the vacuum gripper for a stable grasp, but when a precision grasp is desired, the parallel gripper is a better choice. Thus, knowing the grasp type provides information to plan the configuration of robot’s effectors, or even the type of effector to use ( [29]). Here we present a study centered around human grasp type recognition and its applications in computer vision. The goal of this research is to provide intelligent systems with the capability to recognize the human grasp type in unconstrained 12 static or dynamic scenes. To be specific, our system takes in an unconstrained im- age patch around the human hand, and outputs which category of grasp type is used (examples are shown in Fig. 2.3). In the rest of the chapter, we show that this capability 1) is very useful for predicting human action intention and 2) helps to further understand human action by introducing a finer layer of granularity. Further experiments on two publicly available dataset empirically support that we can 1) infer human action intention in static scenes and 2) segment videos of human manip- ulation actions into finer segments based on the grasp type evolution. Additionally, we provide a labeled grasp type image data set and a human intention data set for further research. Figure 2.3: Sample outputs. PoC: Power Cylindrical; PoS: Power Spherical; PoH: Power Hook; PrP: Precision Pinch; PrT: Precision Tripod; PrL: Precision Lumbrical; RoE: Rest or Extension 13 2.2 Related Work Human hand related: One way to recognize grasp type is through model based hand detection and tracking [32]. Based on the estimated articulated hand model, a set of biologically plausible features such as the arches formed by fingers [33] were used to infer the grasp type involved [30]. These approaches normally use RGB Depth data and require a calibration phase, which is not applicable or is too fragile for real world situations. Also a lot of research has been devoted to hand pose or gesture recognition with promising experimental results [34, 35]. The goal of these works is to recognize poses such as “POINT”, “STOP” or “YES” and “NO”, not considering the interaction with objects. When it comes to recognizing grasp type from unconstrained visual input, inevitably our system has to deal with the additional challenges introduced by the interaction with unknown objects. Later in the chapter we will show that the large variation in the scenery will not allow traditional feature extraction and learning mechanism to work robustly on public available hand patch testing beds. The robotics community has been studying perception and control problems of grasping for decades [36]. Recently, several learning based systems were reported that infer contact points or how to grasp an object from its appearance [37, 38]. However, the desired grasping type could be different for the same target object, when used for different action goals. The acquisition of grasp information from natural static or dynamic scenes is still considered very difficult because of the large variation in appearance and the occlusions of the hand from objects during 14 manipulation. Vision beyond appearance: The very small number of works in com- puter vision, which aim to reason beyond appearance models, are also related to this chapter. [39] proposed that beyond state-of-the-art computer vision techniques, we could possibly infer implicit information (such as functional objects) from video, and they call them “Dark Matter” and “Dark Energy”. [26] used stochastic track- ing and graph-cut based segmentation to infer manipulation consequences beyond appearance. [40] used a ranking SVM to predict the persuasive motivation (or the intention) of the photographer who captured an image. More recently, [41] seeks to infer the motivation of the person in the image by mining knowledge stored in a large corpus using natural language processing techniques. Different from these fairly general investigations about reasoning beyond appearance, our chapter seeks to infer human action intention from a unique and specific point of view: the grasp type. Convolutional neural networks: The recent development of deep neural networks based approaches revolutionized visual recognition research. Different from the traditional hand-crafted features [42,43], a multi-layer neural network architec- ture efficiently captures sophisticated hierarchies describing the raw data [44], which has shown superior performance on standard object recognition benchmarks [45,46] while utilizing minimal domain knowledge. The work presented in this chapter shows that with the recent developments of deep neural networks, we can learn a model to recognize grasp type from unconstrained visual inputs with robustness. We believe we are among the first to apply deep learning on grasp type recognition. 15 2.3 Our Approach First, we briefly summarize the basic concepts of Convolutional Neural Net- works (CNN), and then we present our implementations for grasp type recognition, human action intention prediction and fine level manipulation action segmentation using the change of grasp type over time. 2.3.1 Human Grasp Types A number of grasping taxonomies have been proposed in several areas of re- search, including robotics, developmental medicine, and biomechanics, each focusing on different aspects of action. In a recent survey, Feix et al. [47] reported 45 grasp types in the literature, of which only 33 were found valid. In this work, we use a categorization into seven grasp types. First we distinguish, according to the most commonly used classification (based on functionality), into power and precision grasps [48]. Power grasping is used when the object needs to be held firmly in order to apply force, such as “grasping a knife to cut”; precision grasping is used in order to do fine grain actions that require accuracy, such as “pinch a needle”. We then further distinguish among the power grasps, whether they are cylindrical, spheri- cal, or hook. Similarly, we distinguish the precision grasps into pinch, tripodal and lumbrical. Additionally, we also consider a Rest or Extension position (no grasping performed). Fig. 2.4 illustrates the grasp categories. Humans, when looking at a photograph, can more or less tell what kind of grasp the person in the picture is using. The question becomes, whether using the 16 Figure 2.4: The grasp types considered. Grasps which cannot be categorized into the six types here are considered as the “Rest and Extension” (no grasping performed). current state-of-the-art computer vision technique, whether we can develop a system that learns the pattern from human labeled data and recognizes grasp type from a patch around each hand? In the following section, we present our take and show that a grasp type recognition model with decent robustness can be learned using Convolutional Neural Network (CNN) techniques. 17 2.3.2 CNN for Grasp Type Recognition Convolutional Neural Network (CNN) is a multilayer learning framework, which may consist of an input layer, a few convolutional layers and an output layer. The goal of CNN is to learn a hierarchy of feature representations. Response maps in each layer are convolved with a number of filters and further down-sampled by pooling operations. These pooling operations aggregate values in a smaller region by down-sampling functions including max, min, and average sampling. In this work we adopt the softmax loss function which is given by: L(t, y) = − 1 N N∑ n=1 C∑ k=1 tnk log( ey n k ∑C m=1 e ynm ) (2.1) where tnk is the n-th training example’s k-th ground truth output, and y n k is the value of the k-th output layer unit in response to the n-th input training sample. N is the number of training samples, and since we consider 7 grasp type categories, C = 7. The learning in CNN is based on Stochastic Gradient Descent (SGD), which includes two main operations: Forward and Back Propagation. The learning rate is dynamically lowered as training progresses. Please refer to [49] for details. We used a five layer CNN (including the input layer and one fully-connected perception layer for regression output). The first convolutional layer has 32 filters of size 5× 5 with max pooling, the second convolutional layer has 32 filters of size 5 × 5 with average pooling, and the third convolutional layer has 64 filters of size 5×5 with average pooling, respectively. Convolutional layer convolves its input with a bank of filters, then applies point-wise non-linearity and max or average pooling 18 operation. The final fully-connected perception layer has 7 regression outputs. Fully- connected perception layer applies linear filters to its input, then applies point-wise non-linearity. Our system considers 7 grasp type classes. For testing, we pass each target hand patch to the trained CNN model, and obtain an output of size 7× 1: PGraspType. In the action intention and segmentation experiments we use the classification for both hands to obtain PGraspType1 for the left hand, and PGraspType2 for the right hand, respectively. To have a fully automatic fine level manipulation segmentation approach, we need to localize the input hand patches from videos and then recognize grasp types using CNN. We use the hand detection method of [2] to detect hands in the first frame, and then apply a meanshift algorithm based tracking method [3] on both hands to continuously extract the image patch around each hand. 2.3.3 Human Action Intention Our ability to interpret other people’s actions hinges crucially on predicting their intentionality. Even 18-month-old infants behave altruistically when they ob- serve an adult accidentally dropping a marker on the floor but out of his reach, and they can predict his intention to pick up the marker [50]. From the point view of machine learning for intelligent systems and human-robot collaboration, due to the differences in the embodiment of humans and robots, a direct mapping of action signals is problematic. One solution is that the robot predicts the intent of the ob- 19 served human activity and implements the same intention using its own sensorimotor apparatus [51]. Previous studies showed that there are several key factors that affect the grasp type [52]. One crucial deciding factor for the selection of the grasp type to use is the intended activity. We choose here a categorization into three human action intentions, closely related to the functional classification discussed above (Fig. 2.4). The first category reflects the intention to apply force onto the physical world, such as for example “cut down a tree with an ax”, and we refer to it as “Force-oriented”. The second category reflects fine-grained activity where sensitivity and dexterity are needed, such as “tie shoelaces”, and we refer to it as “Skill-oriented”. The third category has no intention of specific action, such as “showcasing and posing”, and we call it “Casual”. Fig. 2.5 illustrates the action intention categories by showing one typical example of each. We should note that the three categories: “force- oriented”, “skill-oriented” and “casual” are closely related to the three functional categories “power” “precision”, and “rest”, respectively (Fig. 2.4). We used a different labeling, because we encounter a larger variety of hand poses in the static images used for intention classification than in the videos of human manipulation activities used for functional categorization. We investigate the causal relation between human grasp type and action in- tention by training a classifier using grasp types of both hands as input, and the category of action intention as output. As shown next, our experiment demonstrates a strong link. We want to point out that certainly a finer categorization is possible. For example, “Force oriented” intention can be further divided into sub classes such 20 Figure 2.5: Human action intention categories. as “Selfish” or “Altruistic” and so on. However, such a classification would require other dynamic observations. Here we show that from the grasp type in a single image a classification into basic intentions (shown in Fig. 2.5) is possible. 2.3.4 From Grasp Type to Action Intention Our hypothesis is that the grasp type is a strong indicator of human action intention. In order to validate this, we train an additional classifier layer. The procedure is as follows. For each training image, we first pass the target hand patches (left hand and right hand, if present) of the main character in the image to the trained CNN model, and we obtain two belief distributions: PGraspType1 and PGraspType2. We concatenate these two distributions and use them as our feature vector for training. We train a support vector machine (SVM) classifier f , which 21 takes as input the grasp type belief distributions and derives as output an action intention distribution PInt of size 3× 1: PInt = f(PGraspType1, PGraspType2|θ), (2.2) where θ are the model parameters learned from labeled pairs. Fig. 2.6 shows a diagram of the approach. We need to point out that in the human action intention recognition we use belief distributions instead of final class labels of the two hands as input feature vectors. Thus, a certain category of grasp type does not directly indicate a certain action intention in our model. A further experiment using detected grasp type labels of both hands (the grasp type with the highest belief score) to infer action intention achieves a slightly worse performance, which confirms our claim here. 2.3.5 Grasp Type Evolution In manipulation actions involving tools and objects, the details of the small sub actions contain rich semantics. Current computer vision methods do not consider them. Consider a typical kitchen action, as shown in Fig. 2.7. In most approaches the whole sequence would be denoted as “sprinkle the steak”, and the whole segment would be considered an atomic part for recognition or analysis. However, within this around 15 second long action, there are several finer segments. The gentleman first “Pinch” grasps the salt to sprinkle the beef, then he “Extends” to point at the oil bottle, and later he “Power Spherical” grasps a pepper bottle to further sprinkle black pepper onto the beef. Here we can see that the dynamic changes of grasp type 22 Figure 2.6: Inference of human action intention from grasp type recognition. characterize the start and end of these finer actions. Figure 2.7: Grasp type evolution (right hand) in a manipulation action. In order to see if grasp type evolution actually can help with a finer segmenta- tion of manipulation actions, we first recognize the grasp type of both hands, frame by frame, and then output a segmentation at the points in time when any of the hands has a change in grasp type. We design a third experiment on a public cooking 23 video dataset from Youtube for validation. 2.3.6 Finer segment action using grasp type evolution We adopt a straightforward approach. Let’s denote the sets of grasp types along the time-line of an action of length M as Gl = {G1l , G 2 l ...G M l } for the left hand and asGr = {G1r, G 2 r...G M r } for the right hand. Assuming that during a manipulation action the grasp type evolves gradually, we first apply a one dimensional mode filter to smooth temporally. Each grasp type detection at time t is replaced by its most common neighbor in the window of [t− δ/2, t+ δ/2], where δ is the window size. Then, whenever at a time instance t ∈ [1,M ], if Gtl 6= G t+1 l or G t r 6= G t+1 r , our system outputs one segment at t, denoted as St. The set St yields a finer segmentation of the manipulation action clip. 2.4 Experiments The theoretical framework we presented suggests three hypotheses that deserve empirical tests: (a) the CNN based grasp type recognition module can robustly classify input hand patches into correct categories; (b) hand grasp type is a reliable cognitive feature to infer human action intention; (c) the evolution of hand grasp types is useful for fine-grain segmentation of human manipulation actions. To test the three hypotheses empirically, we need to define a set of perfor- mance variables and how they relate to our predicted results. The first hypothesis relates to visual recognition, and we can empirically test it by comparing the de- 24 Methods PoC PoS PoH PrP PrT PrL RoE Overall P R P R P R P R P R P R P R Accu (A) .44 .46 0 NaN 0 NaN 0 0 0 NaN 0 NaN .81 .41 .42 (B) .50 .40 0 NaN 0 NaN .03 .17 0 0 0 0 .62 .36 .36 (C) .59 .60 .38 .62 .38 .58 .62 .60 .56 .66 .36 .40 .69 .56 .59 Table 2.1: Precision (P) and Recall (R) for each grasp type category and overall accuracy. (A):HoG+BoW+SVM; (B):HoG+BoW+RF; (C): CNN tected grasp type labels with the ground truth ones using the precision and recall metrics. We further compare the method with a traditional hand-crafted feature based approaches to show the advantage of our approach. The second hypothesis relates to the inference of human action intention, and we can also empirically test it by comparing the predicted action intention with the ground truth ones on a testing set. The third hypothesis relates to manipulation action segmentation, and we can test it by comparing the computed key segment frames with the ground-truth ones. We used two publicly available datasets: (1) the Oxford hand dataset [2] and (2) a unconstrained cooking video dataset (YouCook) [53]. 2.4.1 Grasp Type Recognition in Static Images Dataset and Experimental protocol The Oxford hand dataset is a comprehensive dataset of hand images collected from various different public image data set sources with a total of 13050 annotated 25 hand instances. Hand patches larger than a fixed area of (a bounding box of 1500 sq. pixels) were considered sufficiently ‘large’ and were used for evaluation. This way we obtained 4672 hand patches from the training set and 660 hand patches from the testing set (VOC images). We then further augmented the dataset with new annotations. We categorized each patch into one of the seven classes by considering its functionality given the image context and its appearance following Fig. 2.4. We followed the training and testing protocol from the dataset. For training the grasping type, the image patches were resized to 64 × 64 pixels. The training set contains 4672 image patches and was labeled with the seven grasping types. We used a GPU based CNN implementation [54] to train the neural network, following the structure described above. We compared our approach with traditional hand-crafted feature based ap- proaches. One was the histogram of oriented gradients (HoG) + Bag of Words (BoW) + SVM classification, the other HoG + BoW + Random Forest. The num- ber of orientations we selected for HoG was 32, and the number of dictionary entries for BoW was 100. The parameters for the baseline methods were tuned to have the best performance. Experimental results We achieved an average of 59% classification accuracy using the CNN based method. Table 2.1 shows the performance metrics of each grasp type category and the overall performance in comparison to baseline algorithms. It can be seen that the CNN based approach has a decent advantage. To provide a full picture of our CNN based classification model, we also show the confusion matrix in Fig. 2.8. Our 26 system mainly confused “Power Cylindrical Grasp” with “Rest or Extension”. We believe that this is mostly because the fingers form natural curves when resting and this makes the hand look very similar to a cylindrical grasp with large diameter. Also our model does not perform well on “Precision Lumbrical” grasp due to the relatively small amount of training samples in this category. Fig. 2.9 shows some correct grasp type predictions (denoted by black boxes), and some failure examples (denoted by red and blue bounding boxes). Blue boxes denote a correct prediction of the underlying high-level grasp type in either the “Power” or “Precision” cate- gory, but incorrect recognition in finer categories. Red boxes denote a confusion between“Power” and “Precision” grasp. Intuitively, the blue marked errors should be penalized less than the red marked ones. 2.4.2 Inference of Action Intention from Grasp Type Dataset and Experimental protocol A subset of 200 images from the Oxford hand dataset serves as testing bed for action intention classification. Since not every image in the test set contains an action intention that falls into one of the three major categories described above, the subset was selected with the following rules: (1) at least one hand of the main character can be seen from the image and (2) the main character has a clear action intention. For example, we can infer that the character from Fig. 2.10(a) is going to perform a skill-oriented actions that requires accuracy, while this is not clear from the character in Fig. 2.10(b) (pull the rope with force or just posing casually?). 27 Figure 2.8: Category pairwise confusion matrix for grasp type classification. We labeled the 200 images into the three major action intention categories and used them as ground truth. The grasp type CNN model was used to extract a 14 dimension belief distribution as grasp type feature (which is due to data from both hands of the main character). A 5 folds cross validation protocol was adopted and we trained each fold using a linear SVM classifier. Experimental results We achieved an average 65% prediction accuracy. Table 2.2 reports precision and recall metrics for each category of action intention. We also run the same experiment using grasp type labels instead of belief distributions (GL+SVM). We 28 Figure 2.9: Examples of correct and false classification. PoC: Power Cylindrical; PoS: Power Spherical; PoH: Power Hook; PrP: Precision Pinch; PrT: Precision Tripod; PrL: Precision Lumbrical; RoE: Rest or Extension. (a) (b) Figure 2.10: Clear action intention vs. an ambigous one can see that it achieves slightly worse performance than using belief distributions. Fig. 2.11 shows some interesting correct cases, and Fig. 2.12 shows several failure predictions. We believe that the failure cases are mostly due to the wrong grasp 29 Methods F-O S-O C Overall P R P R P R Accu GL+SVM .54 .35 .73 .59 .80 .89 .63 GT+SVM .61 .35 .82 .71 .82 .83 .65 Table 2.2: Precision (P) and Recall (R) for each intention category and overall accuracy. GL: Grasp type Label; GT: Grasp Type belief distribution. Figure 2.11: Correct examples of predicting action intention. type recognition inherited from the previous section. Because of the small amount of pairs with ground truth, we were not able to train for comparison a converging CNN model, that would predict action intention directly from hand patches. 30 Figure 2.12: Failure cases of predicting action intention. The label at the bottom denotes the human labeling. 2.4.3 Manipulation Action Fine Level Segmentation using Grasp Type Evolution In this section we want to demonstrate that the change of grasp type is a good feature for fine grain level manipulation action temporal segmentation. Dataset and Experimental protocol Cooking is an activity, requiring a variety of grasp types, that intelligent agents most likely need to learn. We conducted our experiments on a publicly available cooking video dataset collected from the world wide web and fully labeled, called the Youtube cooking dataset (YouCook) [53]. The data was prepared from open-source Youtube cooking videos with a third-person view. These features make it a good empirical testing bed for our third hypothesis. We conducted the experiment using the following protocols: (1) 8 video clips, which contain at least two fine grain activities, were reserved for testing; (2) all other video frames were used for training; (3) we randomly reserved 10% of the training 31 Figure 2.13: Left and right hand grasp type recognition along timeline and video segmentation results compared with ground truth segments. data as validation set for training the CNNs. For training the grasp type recognition model, we extended the dataset by annotating image patches containing hands in the training frames. The image patches were resized to 64× 64 pixels. The training set contains image patches that was labeled with the seven grasp types. We used the same CNN implementation [54] to train the neural network, following the same structures described above. Action Fine Level Segmentation For each testing clip, we first picked the top two hand proposals using [2] in the first frame, and then we applied a meanshift algorithm based tracking method [3] on both hands to continuously extract an image patch around each hand (Fig. 2.14). The image patches were further resized to 64×64 and pipelined to the trained CNN 32 Figure 2.14: 1st row: sample hand localization on first frame using [2]. 2nd to 5th row: two sample sequences of hand patches extracted using meanshift tracking [3]. model. We then labeled each hand with the grasp type of highest belief score in each frame. After applying a one dimensional mode filtering for temporal smoothing, we computed the grasp type evolution for each hand and segmented whenever one hand changes grasp type, as described in Sec. 2.3.6. Fig. 2.13 shows two examples of intermediate grasp type recognition for the two hands and the detected segmentation. A key frame is considered correct, when a ground truth key frame lies within 10 frames around it. In the first example, the subject’s right hand at the beginning holds the tofu using an Extension grasp, and then she cuts the tofu with a Pinch grasp holding the blade. Then using a precision Tripod grasp she separates one piece of tofu from the rest, and at the end using a 33 Lumbrical grasp she further cuts the smaller piece of tofu. Using the grasp type evolution, our system can successfully detect two key frames out of the three ground truth ones. In the second video, the gentleman using a Cylindrical grasp whisks the bowl at the beginning. Then his left hand extends to reach a small cup, and then using a Hook grasp he holds the cup. After that, his right hand extends to reach a spatula and at the end his right hand scoops food out of the small cup using a Cylindrical grasp. Using the grasp type evolution, our system can successfully detect three key frames out of the four ground truth ones. In the 8 test clips, there are 18 ground truth segmentation key frames, and 14 of them are successfully detected, which yields a recall of 78%. Among the 20 detected segmentation key frames, 16 are correct, which yields a precision of 80%. 34 Chapter 3: “Get Your Act Together”: Language-Guided Manipula- tion Action Recognition and Scene Understanding 3.1 Language-Guided Action Recognition Humans display an uncanny ability to perceive the world that far surpasses current vision only based systems both in terms of precision and accuracy. This is largely due to the vast amount of high-level knowledge that humans have acquired over their lives that enables humans to infer and recognize complex visual inputs. In the same way, service and personal robots of the future must be endowed with such knowledge in order to interact and reason with humans and their environments effectively. This knowledge can be encoded in various forms, of which language is clearly the most predominant. Language manifests itself in the form of text which is also extremely accessible from various large research corpora. The ability to reason is an essential faculty for service and personal robots. A typical scenario is when the robot needs to understand an action or task which the human performs for the purpose of learning. Such Learning from Demonstration (LfD) [55] paradigm is gaining popularity in robotics as shown by the recently 35 Figure 3.1: (Top) Ambiguities in action recognition: similar trajectories for different actions. Tools considered in isolation can only suggest possible actions. (Below) Language can predict, given the tool and action trajectories, the most likely action label. concluded Learning by Demonstration Challenge at AAAI 20111. There are two interconnected components in LfD. The first component recognizes what actions occurred and the second component creates an internal representation so that the action can be recreated by the robot. Unlike the LfD challenge where the robot is supposed to recreate the one task it was taught (the second component), we address a variant of the first component where the robot labels the correct action associated with a set of unlabeled action data (with repetitions) performed by different human teachers (or actors). Posing our problem in terms of unlabeled data over different actors is also closer to reality as the robot needs to learn how to generalize the action so that it can discover, in the second component, an optimal representation to recreate this action on its own. 1http://www.lfd-challenge.org/ 36 In this chapter, we consider the task of recognizing human activities from videos sequences – specifically, activities that involve hand-tools. Action or activity recognition has remained one of the most difficult problems in computer vision. The main reason is that detections of key objects that define the action in video – tools, objects, hands and humans are still unreliable even using current state of the art object detectors [4,56]. Descriptions based on tracking trajectories of local features such as STIP [57] and modeling velocity as suggested in [58], are strongly viewpoint dependent and may be confused when similar movements are used for different actions, e.g. drinking from a cup vs. peeling. Both actions involve large up-down hand movements. The main challenge of action recognition is the ambiguity when these two components: objects and trajectories, are considered in isolation. From Fig. 3.1(top), detecting a cup or action trajectory in isolation can only suggest some likely actions. A more reliable prediction can be achieved when we combine both object detection and trajectories. A missing component in the above approach is how we can combine object detection with actions in a logical and intuitive way. To do this, we propose to model language as a resource of prior knowledge that essentially encodes how actions and objects (tools) are related. This language model allows us to weigh the video detections so that the objects and action features that best explain the video are eventually selected (Fig. 3.1(below)). 37 3.1.1 Related Work Action recognition research spans a long history. Comprehensive reviews of recent state of the art can be found in [59–61]. Most of the focus was on studying human actions that were characterized by movement and change of posture, such as walking, running, jumping etc. Many studies represent actions by modeling body poses, body joints and body parts [62]. Depending on the extent of the features used, the literature distinguishes between local and global action models. The former use spatio-temporal interest points and descriptors based on intensity, gradients and flow within spatio-temporal cuboids centered on these interest points [57,63]. The latter compute descriptors over the whole video frame or an extracted human skeleton or silhouette. For example, [64] used histograms of optical flow and Gorelick et al. [65] and Yilmaz et al. [66] represented human activities by 3-D space-time shapes. Another class of approaches model the evolution of actions over time. For example Bissacco et al. [67] used joint-angle as well as joint trajectories from motion-capture data and features extracted from silhouettes to represent action profiles. Chaudhry et al. [68] employed non linear dynamical systems to model the temporal evolution of optical flow histograms. Our approach is more closely related to the use of language for object detection and image annotation. With advances on textual processing and detection, several works recently focused on using sources of data readily available “in the wild” to analyze static images. The seminal work of Duygulu et al. [69] showed how nouns can provide constraints that improve image segmentation. Gupta et al. [70] (and 38 references herein) added prepositions to enforce spatial constraints in recognizing objects from segmented images. Berg et al. [71] processed news captions to discover names associated with faces in the images, and Jie et al. [72] extended this work to associate poses detected from images with the verbs in the captions. Some studies also considered dynamic scenes. [73] studied the aligning of screen plays and videos, [74] learned and recognized simple human movement actions in movies, and [75] studied how to automatically label videos using a compositional model based on AND-OR-graphs that was trained on the highly structured domain of baseball videos . These recent works had shown that exploiting co-occurring text information from scripts and captions aids in the visual labeling task. This chapter takes this further by using generic text obtained from the English Gigaword corpus [76], which is a large corpus of English newswire text from which we learn a language model. As we will show, using general NLP tools, we still can derive interesting relationships to guide the visual task of action recognition. 3.1.2 Our Approach The input is a set of |M | videos, M = {md}, d ∈ {1, 2, · · · , |M |} containing some actions, with each video containing exactly one unique action. The |V | actions are drawn from the set V = {vj}, j ∈ {1, 2, · · · , |V |}. This means that we have assumed that the task of segmenting actions from a long video sequence has been done. In addition, we assume that every action must have an actor that uses a 39 particular hand-tool. The same tool can be used in multiple actions. The |N | tools comes from the set N = {ni}, i ∈ {1, 2, · · · , |N |}. All labels (both actions and tools) must be known in advance, which is a requirement for learning the appropriate language model. A summary of the approach is shown in Fig. 3.2. Figure 3.2: Key components of the approach.: (a) Training the language model from a large text corpus. (b) Detected tools are queried into the language model. (c) Language model returns prediction of action. (d) Action features are compared and beliefs updated. The high level overview of the approach is as follows (see Fig. 3.2): 1) We first detect potential tools from the input video. 2) For each identified tool, we query a trained language model to determine the most likely verbs (actions) associated with the tool. 3) We then confirm the predicted action using the action features obtained from the video to update our confidence on the current action label of the video. This process is repeated until our belief on the action labels is maximized over all tools and action features. Note that our approach is symmetric, that is, 40 we could have started off with action features and inquire the language model in exactly the same way. Our choice of starting with objects is based on the fact that object detectors are better researched and are generally more accurate than action detectors. For the purpose of this discussion, we shall assume the most general case where we only have unlabeled video data. This means that the best that we can do is to perform some form of clustering to discover automatically what is the best action label that describes the cluster. Intuitively, we want to learn the “meaning” of actions by grounding them to visual representations obtained from video data. Hence if we knew this grounding, we can assign action labels to the videos. On the other hand, if we know the action labels of the video data, we can estimate this grounding. This leads naturally to an iterative Expectation-Maximization (EM) formulation where we attempt to determine the optimal grounding parameters that will assign action labels to videos with the highest probability. More formally, our goal is to label each video with their most likely action, along with the tool that is associated with the action. That is, we want to maximize the likelihood: L(D;A) = EP(A)[L(D|A)] = EP(A)[logP(FM ,PI(·),PL(·)|A)] (3.1) where A is the current (binary) action label assignments of the videos (see eq. (3.3)). D is the data computed from the video that consists of: 1) the language model PL(·) that predicts an action given the detected tool (sec. 3.1.2.1), 2) the tool detection 41 model PI(·) (sec. 3.1.2.2) and 3) the action features, FM , associated with the video (sec. 3.1.2.3). We describe how these 3 data components are computed in the fol- lowing paragraphs and detail how we optimize eq. (3.1) using EM in sec. 3.1.3.1. 3.1.2.1 Language as a predictor of actions The key component of our approach is the language model that predicts the most likely verb (action) that is associated with a noun (tool) trained from a large text corpus. We view the Gigaword Corpus as a large text resource that contains the information we need to make correct predictions of actions given the detected tools from the video. We do this by training a language model PL(vj, ni) that returns the maximum likelihood estimates of an action vj given the tool ni. This can be done by counting the number of times vj co-occurs with ni in a sentence: PL(vj|ni) = #(vj, ni) #(ni) (3.2) As many English words share common meanings, a simple count of the action words (verbs) defined in V is likely to grossly underestimate the relationship between the tool and the action it is associated with. For example, in the Gigaword Corpus, counting the number of times drink co-occurs with cup where the actual words are used will not be significantly larger than pick and cup. The reason is that cup can mean a normal drinking cup or a trophy cup. In order to ensure that PL captures the correct sense of the word (nouns, verbs), we use WordNet to determine the synonyms and hyponymns of the tools and actions considered. As illustrated in Fig. 3.3, extending cup to include drinking_glass, wine_glass, mug captures 42 the expected action drink in a larger part of the corpus. Figure 3.3: Enlarging the word class to contain synonyms yields more reasonable counts: cup only connects weakly with drink. By clustering other closely related words together, their combined counts increase the desired association between cup and drink. We then recompute PL using these enlarged word classes to capture more meaningful relationships between the co-occurring words. Fig. 3.4 shows the |N | × |V | co-occurrence matrix of likelihood scores over all the tools and actions considered in the experiments, denoted as PL(V |N). For most of the tool classes, the predicted actions are correct (large values along the diagonals): for e.g. peeler predicts peeling with high probability (0.94) and shaker predicts sprinkling at 0.66. This shows that for tools that have a unique use, our approach can predict the expected action easily. However, there are many co-occurrences which we could not anticipate. For e.g, using synonyms of cup makes it more selective to drinking (0.17) but it is sprinkling that has the highest prediction score (0.29). Investigating further reveals that sprinkling has 43 Figure 3.4: Gigaword co-occurrence matrix of tools and predicted actions. some synonyms such as drizzle moisten splash splosh which have uses that are also close to cup. Other mis-selected tools-action are also due to the confusion at the synonyms/hyponymns levels. We also notice that more general actions such as picking have a more uniform distribution across the tools, which is expected. In the same way, the tool mat is also very general in its use such that it displays no significant selectivity to any of the actions. Despite this simplistic model, most of the entries in PL make sense – and it properly reflects the innate complexity of language. As will be shown in sec. 3.1.4, although the priors from language are weak, they are still helpful for the task of action recognition. 44 3.1.2.2 Active tool detection strategy We pursue the following active strategy for detecting, and subsequently rec- ognizing, relevant tools in the video as illustrated in Fig. 3.5. First, a trained person detector [4] is used to determine the location of the human actor in the video frame. The location of the face is also detected using [77]. Optical flow is then com- puted [78] and we focus on human regions which have the highest flow, indicating the potential locations of the hands. We then apply a variant of a CRF-based color segmentation [79] using a trained skin color+flow model to segment the hand-like regions which are moving. This is justified by the fact that the moving hand is in contact with the tool that we want to identify. In some cases, the face may be detected (since it may be moving) but they are removed using the face detector re- sults. We then apply a trained Partial Least Squares (PLS) object detector similar to [56] near the detected active hand region that returns a detection score at each video frame. Averaging out the detection yields PI(ni|md), the probability that a tool ni ∈ N exists given the video md. We denote PI(N |M) as the set of detection scores (essentially the likelihood) over all tools in N and all videos in M . This active approach has two important benefits. By focusing our processing only on the relevant regions of the video frame, we dramatically reduce the chance that the tool detector will misfire. At the same time, by detecting the hand locations, we obtain immediately the action trajectory, which is used to describe the action as shown in the next section. 45 Figure 3.5: (Best viewed in color) Overview of the tool detection strategy: (1) Optical flow is first computed from the input video frames. (2) We train a CRF segmentation model based on optical flow + skin color. (3) Guided by the flow computations, we segment out hand-like regions (and removed faces if necessary) to obtain the hand regions that are moving (the active hand that is holding the tool). (4) The active hand region is where the tool is localized. Using the PLS detector (5), we compute a detection score for the presence of a tool. 3.1.2.3 Action features Tracking the hand regions in the video provides us with two sets of (left and right) hand trajectories as shown in Fig. 3.6. We then construct for every video a feature vector Fd that encodes the hand trajectories. Fd encodes the frequency and 46 velocity components. Frequency is encoded by using the first 4 real coefficients of the Fourier transform in both the x and y directions, fx, fy, which gives a 16-dim vector over both hands. Velocity is encoded by averaging the difference in hand positions between two adjacent frames 〈δx〉, 〈δy〉 which gives a 4-dim vector. These features are then combined to yield a 20-dim vector Fd. Figure 3.6: Detected hand trajectories. x and y coordinates are denoted as red and blue curves respectively. We denote FM as the set of of action features Fd over all videos in M . 3.1.3 Using Language to guide recognition In this section, we formalize our EM approach to learn a joint tool-action model that assigns the most likely action label associated with a set of unlabeled video. We first derive from eq. (3.1) an expression that allows EM to estimate the parameters of this model, followed by details of the Expectation and Maximization steps. We then show how to use the learned model to perform testing. Finally, we consider the case where we have labeled data in which we formulate a simpler supervised approach. 47 3.1.3.1 Unsupervised learning of a joint tool-action model We first define the latent assignment variable A. To simplify our notations, we will use subscripts to denote tools i = ni, actions j = vj and videos d = md. For each i ∈ N , j ∈ V , d ∈ M , Aijd indicates whether an action j is performed using tool i during video clip d. Aijd =    1 j is performed using i during d 0 otherwise (3.3) and A is a 3D indicator matrix (or a 3D array) over all tools, actions and videos. Denoting the parameters of the model as C = {Cj} which specifies the ground- ing of each action j, we seek to determine from eq. (3.1) the maximum likelihood parameter: C∗ = arg max C ∑ A L(D, A|C) (3.4) Where, L(D, A|C) = logP (D, A|C) = logP (A|D, C)P (D|C) (3.5) with the dataD comprised of the tool detection likelihoods PI(N |M), the tool-action likelihoods PL(V |N) and action features FM under the current model parameters C. Geometrically, we can view C as the superset of the |V | action label centers that defines our current grounding of each action j in the action feature space. Using these centers, we can write the assignment given each video d, tool i 48 and action j, P (Aijd|D, C) as: P(Aijd = 1|D, C) = PI(i|d)PL(j|i)Pen(d|j) (3.6) where Pen(d|j) is an exemplar-based likelihood function defined between the asso- ciated action feature of video d, Fd and the current model parameter for action j, Cj as: Pen(d|j) = 1 Z exp−||Fd−Cj || 2 (3.7) where Z is a normalization factor. What eq. (3.7) encodes is the penalty that we score against the assignment when there is a large mismatch between Fd and Cj, the cluster center of action j. Rewriting eq. (3.6) over all videos M , tools N and actions V we have: P(A = 1|D, C) = PI(N |M)PL(V |N)Pen(FM |C) (3.8) where we use the set variables to represent the full data and assignment model parameters. In the derivation that follows, we will simplify P(A = 1|D, C) as P(A = 1) and P(A = 0) = 1 − P(A = 1). We detail the Expectation and Maximization steps in the following sections. Expectation step: We compute the expectation of the latent variable A, denoted by W , according to the probability distribution of A given our current model parameters C and data (PI , PL, and FM): W = EP(A)[A] = P(A = 1)× 1 + (1− P(A = 1))× 0 = P(A = 1) (3.9) 49 According to Eq. 3.6, the expectation of A is: W = P(A = 1) ∝ PI(N |M)PL(V |N)Pen(FM |C) (3.10) Specifically, for each i ∈ N, j ∈ V, d ∈M : Wijd ∝ PI(i)PL(j|i)Pen(d|j) (3.11) Here, W is a |N | × |V | × |M | matrix. Note that the constant of proportionality does not matter because it cancels out in the Maximization step. Maximization step: The maximization step seeks to find the updated pa- rameters Cˆ that maximize eq. (3.5) with respect to P(A): Cˆ = arg max C EP(A)[logP(A|D, C)P(D|C)] (3.12) Where D = PI ,PL, FM . EM replaces P(A) with its expectation W . As A,PI ,PL are independent of the model parameters C, we can simplify eq. (3.12) to: Cˆ = arg max C P(FM |C) = arg max C ( − ∑ i,j,d Wijd||Fd − Cj|| 2 ) (3.13) where we had replaced P(FM |C) with eq. (3.7) since the relationship between FM and C is the penalty function Pen(FM |C). This enables us to define a target maxi- mization function as F(C) = ∑ i,j,dWijd||Fd − Cj|| 2. According to the Karush-Kuhn-Tucker conditions, we can solve the maximiza- tion problem by the following constraint: ∂F ∂C = −2 ∑ i,j,d (Wijd(Fd − Cj)) = 0 (3.14) 50 Thus, for each j ∈ V , we have: Cˆj = ∑ i∈N,j∈V,d∈MWijdFd ∑ i∈N,j∈V,d∈MWijd (3.15) We then update C = Cˆ within each iteration until convergence. Testing the learned model: The learned model C∗ can be used to classify new videos from a held-out testing set. Denoting the input test video as mt, we predict the most likely action label, v∗t by: v∗t = arg max j∈V ∑ i∈N ( PI(i|mt)PL(j|i)Pen(Ft|C ∗ j ) ) (3.16) where Ft is the action features extracted from mt and C∗j is the j action center from the learned model. 3.1.3.2 Supervised action classification If we have labeled video data of actions, a supervised approach will be the most straightforward. Every video, md, is represented by a set of features Fd that combines Fd (action features), PI (tool detection) and PL together in the following manner: Fd = [Fd;PI(N |md);PI(N |md)× PL(V |N)] = [Fd;PI(N |md);PL(V |md)] (3.17) where ; means a concatenation of the features vectors. This yields a 20 + |N |+ |V |- dim vector. What eq. (3.17) means is that for every video md, we concatenate its Fd together with PI(N |md), the distribution over all |N | tools, and together 51 with the verb prediction: PL(V |md), obtained from the text corpus. Given labeled training videos from all possible actions in V , we can proceed to train discriminative classifiers (SVM, Bayes Net and Naive Bayes) to predict the action in the testing video. 3.1.4 Experiments We performed a series of experiments using a new dataset of human actions performed with hand-tools to show quantitatively how language aids in action recog- nition. We first describe the dataset, and report recognition results on both the unsupervised and supervised scenarios. 3.1.5 The UMD Sushi-Making Dataset The UMD Sushi-Making Dataset2 consists of 12 actions, performed by 4 actors using 10 different kitchen tools, for the purpose of preparing sushi. This results in 48 video sequences each of around 1000 frames (30 seconds long). Other well known datasets such as the KTH, Weizmann or Human-EVA datasets [65, 80, 81] do not involve hand-tools. The human-object interaction dataset by Gupta et al. [82] has only 4 objects. The dataset by Messing et al. [58] has only 4 actions with tool use. The CMU Kitchen Dataset [83] has many tool interactions for 18 subjects making 5 recipes, but many of the actions are blocked from view due to the placements of the 4 static cameras. The head mounted camera gives a limited and shaky top-down view which cannot be processed easily. 2http://www.umiacs.umd.edu/research/POETICON/umd_sushi/ 52 Our Sushi-Making dataset provides a clear view of the action in use with the tools and it simulates an active robotic agent observing the human actor performing the action in a realistic environment: a kitchen, with a real task: making sushi, that is made up of several actions that the robot must identify. See Fig. 3.5 for an example. The 12 actions are: cleaning, cutting, drinking, flipping, peeling, picking (up), pouring, pressing, sprinkling, stirring, tossing, turning. The tools used are: tissue, knife, cup, rolling-mat, fruit-peeler, water-pitcher, spoon, shaker, spatula, mixing-bowl. As was discussed in sec. 3.1.2.1, some of the actions such as picking or flipping are extremely general and are easily confused. We made this choice to ensure that the language prediction PL is not perfect and to show that our approach works even under noisy data. 3.1.5.1 Baseline: Vision-only Recognition As a baseline, we perform experiments without the language component, that is PL in eq. (3.17) is not considered as part of Fd. Two experiments: 1) clustering using K-means and 2) supervised classification are performed. K-means clustering results: For the case of unlabeled videos, we performed K-means clustering with k = 12. We used 36 videos in this experiment. As labels, we took the majority ground truth labels from each cluster to be the predicted labels of the cluster. We then counted the number of videos that are correctly placed in the right cluster to derive a measure of accuracy, which is reported in Fig. 3.8(a). Supervised classification results: From the 48 videos from the UMD Sushi- 53 Making dataset, we used 36 videos from 3 actors to train a degree 3 polynomial SVM classifier for the 12 actions. We set the cost parameter to 1.0 with a tolerance termi- nation value at 0.001. These parameters were chosen from a separate development set of 8 videos. The remaining 12 videos were then used for testing. 4-fold cross validation was performed and the classification accuracy is reported. In addition, we trained a Naive Bayes (NB) classifier and a Bayes Net (BN) classifier over the training data. The BN is initialized as a NB with at most 1 parent. We then apply a simple estimator to estimate the conditional probability tables. All classifiers are tested using WEKA [84]. We summarize the results in Fig. 3.8(b). 3.1.5.2 Adding Language In this section, we performed experiments with the aid of the language com- ponent PL. Three separate experiments are performed: 1) Unsupervised EM, 2) Semi-Supervised EM where we initialized the model parameters C with 12 known labels and 3) Supervised classification using trained SVM, Bayes Net and Naive Bayes classifiers. 36 videos were used for training the joint tool-action model using EM and 12 videos were held out for testing. For the supervised part, the same parameters as the baseline were used. We report our unsupervised and supervised results in Figs. 3.8(a) and 3.8(b) respectively. More detailed results (confusion ma- trices for each action) can be found online3. In addition, we show in Fig. 3.7 the improving recognition accuracy of the trained model at each iteration. The evolu- tion of the action labels versus the ground truth is also presented. Testing on the 3http://www.umiacs.umd.edu/research/POETICON/umd_sushi/res_ICRA2012 54 Figure 3.7: (Best viewed in color) (Left) Unsupervised EM: accuracy at each iter- ation. (Right) Scatterplots of action label assignments at selected iterations. We see that with each iteration, the assignment label clusters approaches the ground truth label (boxed in red). Note that we used PCA to reduce the action feature dimensions to 2 for visualization. held out video set using the trained model yields a recognition accuracy of 83.33%. 3.1.5.3 Comparison with state of art action features We compared our approach with a bag-of-words (BoW) representation built upon state of the art STIP [57] features clustered using K-means with k = 50. We trained three classifiers: SVM, Naive Bayes and Bayesian Net using the same proce- dure and parameters as the baseline and we summarize the results in Table 3.1. The BoW representation using STIP achieves a maximum classification rate of 77.08% with trained SVM classifiers. Our approach which uses comparatively simpler video features: Fourier coefficients + velocity eq. (3.17) outperforms the state of the art significantly. This gain is possible due to the addition of language prediction in the 55 Feature Method Accuracy STIP+Bag of Words Naive Bayes 56.25% Bayes Net 75% SVM 77.08% Action Features+Language Naive Bayes 66.67% Bayes Net 85.41% SVM 91.67% Action Features+Language Unsupervised EM 77.78% Semi-supervised EM 91.67% Table 3.1: Classification accuracy: STIP versus our approach. action feature. 3.1.5.4 Discussion: the effects of adding language Comparing the unsupervised recognition results, K-means clustering on the action features alone (with PL) achieves only 69.44% recognition rate. The cluster- ing accuracy, with the addition of PL and using the EM formulation described in sec. 3.1.3.1 achieves 77.78% with random initialization of the model C. We show fur- ther that with correctly initialized parameters from 12 labeled videos, is enough to increase the accuracy to 91.67%, which is just as good as the SVM classifier (which is supervised). This result shows that once again, even with no or limited labeled data, our proposed EM formulation is able to leverage the predictive power of PL to find the optimal action and its corresponding tool that best explains the video. 56 Fig. 3.9 shows some video frames with the predicted action and corresponding tool using EM. For the classification results in Fig. 3.8(b) using the three classifiers, we clearly see that the addition of PL increases the classification accuracy, with the most dramatic increase when SVM or Naive Bayes are used: from 87.5% to 91.67% (SVM) and 62.5% to 66.67% when language is added. This shows that even with a simple model, PL is able to provide additional discriminatory features which improve the classification. The most important result is that these features are estimated directly from a generic text corpus and the method is not limited to a particular domain such as cooking. This fact alone highlights the strength of language in aiding action classification. Besides improving action recognition accuracy in both supervised and unsu- pervised scenarios, another key observation from our results is that language is com- plementary in aiding many vision related tasks where the use of high-level knowledge is required. Previous works described in sec. 3.1.1 have shown that language (in a restricted sense) can be used to simplify ill-posed image problems like segmentation and annotation. We showed here that the difficult problem of recognizing actions involves high-level knowledge as well. This is because of the strong relationship between the actions and the tools that were used to perform these actions. Instead of learning from a huge amount of training image data on how tools correlate with actions, we showed that it is possible to obtain such information directly from a text corpus. Such a text corpus, although noisy, is much easier to obtain than an- notated image data; and we showed that with the right EM formulation, the noisy 57 predictions from PL provides us appreciable gains in recognition rates on unlabeled video. Figure 3.8: (a) Unsupervised recognition accuracy: no language (K-Means) ver- sus language (EM). (b) Classification accuracy: no language versus language. All reported results have variances within ±0.5%. 3.2 Language Guided Scene Understanding for Robots What happens when you see a picture? The most natural thing would be to describe it using words : using speech or text. This description of an image is the output of an extremely complex process that involves: 1) perception in the Visual space, 2) grounding to World Knowledge in the Language Space and 3) speech/text production (see Fig. 3.10). Each of these components are challenging in their own right and are still considered open problems in the vision and linguistics fields. In this chapter, we introduce a computational framework that attempts to integrate these components together. Our hypothesis is based on the assumption that nat- ural images accurately reflect common everyday scenarios which are captured in 58 Figure 3.9: Some predicted action and tools using EM. The wrong prediction (in red and italicized) of the sprinkle action is due to a high co-occurrence with bowl in PL(V |N). Figure 3.10: The processes involved for describing a scene. language. For example, knowing that boats usually occur over water will enable us to constrain the possible scenes a boat can occur and exclude highly unlikely ones – street, highway. It also enables us to predict likely actions (Verbs) given the 59 current object detections in the image: detecting a dog with a person will likely induce walk rather than swim, jump, fly. Key to our approach is the use of a large generic corpus such as the English Gigaword [76] as the semantic grounding to predict and correct the initial and often noisy visual detections of an image to produce a reasonable sentence that succinctly describes the image. Figure 3.11: Illustration of various perceptual challenges for sentence generation for images. (a) Different images with semantically the same content. (b) Pose relates ambiguously to actions in real images. In order to get an idea of the difficulty of this task, it is important to first define what makes up a description of an image. Based on our observations of annotated image data (see Fig. 3.13), a descriptive sentence for an image must contain at minimum: 1) the important objects (Nouns) that participate in the image, 60 2) Some description of the actions (Verbs) associated with these objects, 3) the scene where this image was taken and 4) the preposition that relates the objects to the scene. That is, a quadruplet of T = {n, v, s, p} (Noun-Verb-Scene-Preposition) that represents the core sentence structure. Generating a sentence from this quadruplet is obviously a simplification from state of the art generation work, but as we will show in the experimental results (sec. 3.2.4), it is sufficient to describe images. The key challenge is that detecting objects, actions and scenes directly from images is often noisy and unreliable. We illustrate this using example images from the Pascal-Visual Object Classes (VOC) 2008 challenge [85]. First, Fig. 3.11(a) shows the variability of images in their raw image representations: pixels, edges and local features. This makes it difficult for state of the art object detectors [4,56] to reliably detect important objects in the scene: boat, humans and water – average precision scores reported in [4] manages around 42% for humans and only 11% for boat over a dataset of almost 5000 images in 20 object categories. Yet, these images are semantically similar in terms of their high level description. Second, cognitive studies [86, 87] have proposed that inferring the action from static images (known as an “implied action”) is often achieved by detecting the pose of humans in the image: the position of the limbs with respect to one another, under the assumption that a unique pose occurs for a unique action. Clearly, this assumption is weak as 1) similar actions may be represented by different poses due to the inherent dynamic nature of the action itself: e.g. walking a dog and 2) different actions may have the same pose: e.g. walking a dog versus running (Fig. 3.11(b)). The missing component here is whether the key object (dog) under interaction is considered. 61 Recent works [88, 89] that used poses for recognition of actions achieved 70% and 61% accuracy respectively under extremely limited testing conditions with only 5-6 action classes each. Finally, state of the art scene detectors [5, 90] need to have enough representative training examples of scenes from pre-defined scene classes for a classification to be successful – with a reported average precision of 83.7% tested over a dataset of 2600 images. Addressing all these visual challenges is clearly a formidable task which is be- yond the scope of this chapter. Our focus instead is to show that with the addition of language to ground the noisy initial visual detections, we are able to improve the quality of the generated sentence as a faithful description of the image. In partic- ular, we show that it is possible to avoid predicting actions directly from images – which is still unreliable – and to use the corpus instead to guide our predictions. Our proposed strategy is also generic, that is, we make no prior assumptions on the image domain considered. While other works (sec. 3.2.1) depend on strong annotations between images and text to ground their predictions (and to remove wrong sentences), we show that a large generic corpus is also able to provide the same grounding over larger domains of images. It represents a relatively new style of learning: distant supervision [91, 92]. Here, we do not require “labeled” data containing images and captions but only separate data from each side. Another contribution is a computationally feasible way via dynamic programming to deter- mine the most likely quadruplet T ∗ = {n∗, v∗, s∗, p∗} that describes the image for generating possible sentences. 62 3.2.1 Related Work Recently, several works from the Computer Vision domain have attempted to use language to aid image scene understanding. [93] used predefined production rules to describe actions in videos. [71] processed news captions to discover names associated with faces in the images, and [72] extended this work to associate poses detected from images with the verbs in the captions. Both approaches use annotated examples from a limited news caption corpus to learn a joint image-text model so that one can annotate new unknown images with textual information easily. Neither of these works have been tested on complex everyday images where the large variations of objects and poses makes it nearly impossible to learn a more general model. In addition, no attempt was made to generate a descriptive sentence from the learned model. The work of [94] attempts to “generate” sentences by first learning from a set of human annotated examples, and producing the same sentence if both images and sentence share common properties in terms of their triplets: (Nouns-Verbs-Scenes). No attempt was made to generate novel sentences from images beyond what has been annotated by humans. [95] has recently introduced a framework for parsing images/videos to textual description that requires significant annotated data, a requirement that our proposed approach avoids. Natural language generation (NLG) is a long-standing problem. Classic ap- proaches [96] are based on three steps: selection, planning and realization. A com- mon challenge in generation problems is the question of: what is the input? Re- cently, approaches for generation have focused on formal specification inputs, such 63 as the output of theorem provers [97] or databases [98]. Most of the effort in those approaches has focused on selection and realization. We address a tangential prob- lem that has not received much attention in the generation literature: how to deal with noisy inputs. In our case, the inputs themselves are often uncertain (due to misrecognitions by object/scene detectors) and the content selection and realization needs to take this uncertainty into account. 3.2.2 Our Approach Our approach is summarized in Fig. 3.12. The input is a test image where we detect objects and scenes using trained detection algorithms [4, 5]. To keep the framework computationally tractable, we limit the elements of the quadruplet (Nouns-Verbs-Scenes-Prepositions) to come from a finite set of objects N , actions V , scenes S and prepositions P classes that are commonly encountered. They are summarized in Table. 3.2. In addition, the sentence that is generated for each image is limited to at most two objects occurring in a unique scene. Figure 3.12: Overview of our approach. (a) Detect objects and scenes from input image. (b) Estimate optimal sentence structure quadruplet T ∗. (c) Generating a sentence from T ∗. 64 Objects n ∈ N Actions v ∈ V Scenes s ∈ S Preps p ∈ P ’aeroplane’ ’bicycle’ ’bird’ ’boat’ ’bottle’ ’bus’ ’car’ ’cat’ ’chair’ ’cow’ ’table’ ’dog’ ’horse’, ’motorbike’ ’person’ ’pottedplant’ ’sheep’ ’sofa’ ’train’ ’tvmonitor’ ’sit’ ’stand’ ’park’ ’ride’ ’hold’ ’wear’ ’pose’ ’fly’ ’lie’ ’lay’ ’smile’ ’live’ ’walk’ ’graze’ ’drive’ ’play’ ’eat’ ’cover’ ’train’ ’close’ ... ’airport’ ’field’ ’highway’ ’lake’ ’room’ ’sky’ ’street’ ’track’ ’in’ ’at’ ’above’ ’around’ ’behind’ ’below’ ’beside’ ’between’ ’before’ ’to’ ’under’ ’on’ Table 3.2: The set of objects, actions (first 20), scenes and preposition classes considered Denoting the current test image as I, the initial visual processing first de- tects objects n ∈ N and scenes s ∈ S using these detectors to compute Pr(n|I) and Pr(s|I), the probabilities that object n and scene s exist under I. From the observation that an action can often be predicted by its key objects, Nk = {n1, n2, · · · , ni}, ni ∈ N that participate in the action, we use a trained Language model Lm to estimate Pr(v|Nk). Lm is also used to compute Pr(s|n, v), the pre- dicted scene using the corpus given the object and verb; and Pr(p|s), the predicted preposition given the scene. This process is repeated over all n, v, s, p where we used a modified HMM inference scheme to determine the most likely quadruplet: T ∗ = {n∗, v∗, s∗, p∗} that makes up the core sentence structure. Using the contents and structure of T ∗, an appropriate sentence is then generated that describes the image. In the following sections, we first introduce the image dataset used for testing followed by details of how these components are derived. 65 3.2.2.1 Image Dataset Figure 3.13: Samples of images with corresponding annotations from the UIUC scene description dataset. We use the UIUC Pascal Sentence dataset, first introduced in [94] and available on-line4. It contains 1000 images taken from a subset of the Pascal-VOC 2008 challenge image dataset and are hand annotated with sentences that describe the image by paid human annotators using Amazon Mechanical Turk. Fig. 3.13 shows some sample images with their annotations. There are 5 annotations per image, and each annotation is usually short – around 10 words long. We randomly selected 900 images (4500 sentences) as the learning corpus to construct the verb and scene sets, {V ,S} as described in sec. 3.2.3.1, and kept the remaining 100 images for testing 4http://vision.cs.uiuc.edu/pascal-sentences/ 66 and evaluation. 3.2.3 Object and Scene Detections from Images (a) (b) Figure 3.14: (a) [Top] The part based object detector from [4]. [Bottom] The graphical model representation of an object, for e.g. a bike. (b) Examples of GIST gradients: (left) an outdoor scene vs (right) an indoor scene [5]. We use the Pascal-VOC 2008 trained object detectors [99] of 20 common ev- eryday object classes that are defined in N . Each of the detectors are essentially SVM classifiers trained on a large number of the objects’ image representations from a large variety of sources. Although 20 classes may seem small, their existence in many natural images (e.g. humans, cars and plants) makes them particularly im- portant for our task, since humans tend to describe these common objects as well. As object representations, the part-based descriptor of [4] is used. This representa- tion decomposes any object, e.g. a cow, into its constituent parts: head, torso, legs, which are shared by other objects in a hierarchical manner. At each level, image gradient orientations are computed. The relationship between each parts is modeled 67 probabilistically using graphical models where parts are the nodes and the edges are the conditional probabilities that relate their spatial compatibility (Fig. 3.14(a)). For example, in a cow, the probability of finding the torso near the head is higher than finding the legs near the head. This model’s intuition lies in the assumption that objects can be deformed but the relative position of each constituent parts should remain the same. We convert the object detection scores to probabilities using Platt’s method [100] which is numerically more stable to obtain Pr(n|I). The parameters of Platt’s method are obtained by estimating the number of positives and negatives from the UIUC annotated dataset, from which we determine the appropriate probabilistic threshold, which gives us approximately 50% recall and precision. For detecting scenes defined in S, we use the GIST-based scene descriptor of [5]. GIST computes the windowed 2D Gabor filter responses of an input image. The responses of Gabor filters (4 scales and 6 orientations) encode the texture gradi- ents that describe the local properties of the image. Averaging out these responses over larger spatial regions gives us a set of global image properties. These high dimensional responses are then reprojected to a low dimensional space via PCA, where the number of principal components are obtained empirically from training scenes. This representation forms the GIST descriptor of an image (Fig. 3.14(b)) which is used to train a set of SVM classifiers for each scene class in S. Again, Pr(s|I) is computed from the SVM scores using [100]. The set of common scenes defined in S is learned from the UIUC annotated data (sec. 3.2.3.1). 68 3.2.3.1 Corpus-Guided Predictions Figure 3.15: (a) Selecting the ROOT verb from the dependency parse ride reveals its subject woman and direct object bicycle. (b) Selecting the head noun (PMOD) as the scene street reveals ADV as the preposition on Predicting Verbs: The key component of our approach is the trained lan- guage model Lm that predicts the most likely verb v, associated with the objects Nk detected in the image. Since it is possible that different verbs may be associated with varying number of object arguments, we limit ourselves to verbs that take on at most two objects (or more specifically two noun phrase arguments) as a simpli- fying assumption: Nk = {n1, n2} where n2 can be NULL. That is, n1 and n2 are the subject and direct objects associated with v ∈ V . Using this assumption, we can construct the set of verbs, V . To do this, we use human labeled descriptions of the training images from the UIUC Pascal-VOC dataset (sec. 3.2.2.1) as a learning corpus that allows us to determine the appropriate target verb set that is amenable to our problem. We first apply the CLEAR parser [101] to obtain a dependency parse of these annotations, which also performs stemming of all the verbs and nouns in the sentence. Next, we process all the parses to select verbs which are marked as ROOT and check the existence of a subject (DEP) and direct object (PMOD, 69 OBJ) that are linked to the ROOT verb (see Fig. 3.15(a)). Finally, after removing common “stop” verbs such as {is, are, be} we rank these verbs in terms of their occurrences and select the top 50 verbs which accounts for 87.5% of the sentences in the UIUC dataset to be in V . Object class n ∈ N Synonyms, 〈n〉 bus autobus charabanc double-decker jitney motorbus motorcoach omnibus passenger-vehicle schoolbus trolleybus streetcar ... chair highchair chaise daybed throne rocker armchair wheelchair seat ladder-back lawn-chair fauteuil ... bicycle bike wheel cycle velocipede tandem mountain-bike ... Table 3.3: Samples of synonyms for 3 object classes. Next, we need to explain how n1 and n2 are selected from the 20 object classes defined previously in N . Just as the 20 object classes are defined visually over several different kinds of specific objects, we expand n1 and n2 in their textual descriptions using synonyms. For example, the object class n1=aeroplane should include the synonyms {plane, jet, fighter jet, aircraft}, denoted as 〈n1〉. To do this, we expand each object class using their corresponding WordNet synsets up to at most three hyponymns levels. Example synonyms for some of the classes are summarized in Table 3.3. We can now compute from the Gigaword corpus [76] the probability that a 70 verb exists given the detected nouns, Pr(v|n1, n2). We do this by computing the log- likelihood ratio [102] , λnvn, of trigrams (〈n1〉 , v, 〈n2〉), computed from each sentence in the English Gigaword corpus [76]. This is done by extracting only the words in the corpus that are defined in N and V (including their synonyms). This forms a reduced corpus sequence from which we obtain our target trigrams. For example, the sentence: the large brown dog chases a small young cat around the messy room, forcing the cat to run away towards its owner. will be reduced to the stemmed sequence dog chase cat cat run owner5 from which we obtain the target trigram relationships: {dog chase cat}, {cat run owner} as these trigrams respect the (n1, v, n2) ordering. The log-likelihood ra- tios, λnvn, computed for all possible (〈n1〉 , v, 〈n2〉) are then normalized to obtain Pr(v|n1, n2). An example of ranked λnvn in Fig. 3.16(a) shows that λnvn predicts v that makes sense: with the most likely predictions near the top of the list. Predicting Scenes: Just as an action is strongly related to the objects that participate in it, a scene can be predicted from the objects and verbs that occur in the image. For example, detecting Nk={boat, person} with v={row} would have predicted the scene s={coast}, since boats usually occur in water regions. To learn this relationship from the corpus, we use the UIUC dataset to discover what are the common scenes that should be included in S. We applied the CLEAR dependency parse [101] on the UIUC data and extracted all the head nouns (PMOD) in the PP phrases for this purpose and excluded those nouns with prepositions (marked 5stemming is done using [101] 71 as ADV) such as {with, of} which do not co-occur with scenes in general (see Fig. 3.15(b)). We then ranked the remaining scenes in terms of their frequency to select the top 8 scenes used in S. To improve recall and generalization, we expand each of the 8 scene classes using their WordNet synsets 〈s〉 (up to a max of three hyponymns levels). Similar to the procedure of predicting the verbs described above, we compute the log-likelihood ratio of ordered bigrams, {n, 〈s〉} and {v, 〈s〉}: λns and λvs, by reducing the corpus sentence to the target nouns, verbs and scenes defined in N ,V and S. The probabil- ities Pr(s|n) and Pr(v|n) are then obtained by normalizing λns and λvs. Under the assumption that the priors Pr(n) and Pr(v) are independent and applying Bayes rule, we can compute the probability that a scene co-occurs with the object and action, Pr(s|n, v) by: Pr(s|n, v) = Pr(n, v|s)Pr(s) Pr(n, v) = Pr(n|s)Pr(v|s)Pr(s) Pr(n)Pr(v) ∝ Pr(s|n)× Pr(s|v) (3.18) where the constant of proportionality is justified under the assumption that Pr(s) is equiprobable for all s. (3.18) is computed for all nouns in Nk. As shown in Fig. 3.16(b), we are able to predict scenes that co-locate with reasonable correctness given the nouns and verbs. Predicting Prepositions: It is straightforward to predict the appropriate prepositions associated with a given scene. When we construct S from the UIUC annotated data, we simply collect and rank all the associated prepositions (ADV) 72 in the PP phrase of the dependency parses. We then select the top 12 prepositions used to define P . Using P , we then compute the log-likelihood ratio of ordered bigrams, {p, 〈s〉} for prepositions that co-locate with the scene synonyms over the corpus. Normalizing λps yields Pr(p|s), the probability that a preposition co-locates with a scene. Examples of ranked λps are shown in Fig. 3.16(c). Again, we see that reasonable predictions of p can be found. Figure 3.16: Example of how ranked log-likelihood values (in descending order) suggest a possible T : (a) λnvn for n1 = person, n2 = bus predicts v = ride. (b) λns and λvs for n = bus, v = ride then jointly predicts s = street and finally (c) λps with s = street predicts p = on. 3.2.3.2 Determining T ∗ using HMM inference Given the computed conditional probabilities: Pr(n|I) and Pr(s|I) which are observations from an input test image with the parameters of the trained language model, Lm: Pr(v|n1, n2), Pr(s|n, v), Pr(p|s), we seek to find the most likely sentence 73 structure T ∗ by: T ∗ = arg max n,v,s,p Pr(T |n, v, s, p) = arg max n,v,s,p {Pr(n1|I)Pr(n2|I)Pr(s|I)× Pr(v|n1, n2)Pr(s|n, v)Pr(p|s)} (3.19) where the last equality holds by assuming independence between the visual detec- tions and corpus predictions. Obviously a brute force approach to try all possible combinations to maximize eq. (3.19) will not be feasible due to the large number of possible combinations: (20 ∗ 21 ∗ 8) ∗ (50 ∗ 20 ∗ 20) ∗ (8 ∗ 20 ∗ 50) ∗ (12 ∗ 8) ≈ 5× 1013. A better solution is needed. Figure 3.17: The HMM used for optimizing T . The relevant transition and emission probabilities are also shown. See text for more details. Our proposed strategy is to pose the optimization of T as a dynamic program- ming problem, akin to a Hidden Markov Model (HMM) where the hidden states are 74 related to the (simplified) sentence structure we seek: T = {n1, n2, s, v, p}, and the emissions are related to the observed detections: {n1, n2, s} in the image if they ex- ist. To simplify our notations, as we are concerned with object pairs we will write NN as the hidden states for all n1, n2 pairs and nn as the corresponding emissions (detec- tions); and all object+verb pairs as hidden states NV. The hidden states are therefore denoted as: {NN, NV, S, P} with values taken from their respective word classes from Table 3.2. The emission states are {nn, s} with binary values: 1 if the detections occur or 0 otherwise. The full HMM is summarized in Fig. 3.17. The rationale for using a HMM is that we can reuse all previous computation of the probabilities at each level to compute the required probabilities at the current level. From START, we assume all object pair detections are equiprobable: Pr(NN|START) = 1|N |∗(|N |+1) where we have added an additional NULL value for objects (at most 1). At each NN, the HMM emits a detection from the image and by independence we have: Pr(nn|NN) = Pr(n1|I)Pr(n2|I). After NN, the HMM transits to the corresponding verb at state NV with Pr(NV|NN) = Pr(v|n1, n2) obtained from the corpus statistic6. As no action detections are performed on the image, NV has no emissions. The HMM then transits from NV to S with Pr(S|NV) = Pr(s|n, v) computed from the cor- pus which emits the scene detection score from the image: Pr(s|S) = Pr(s|I). From S, the HMM transits to P with Pr(P|S) = Pr(p|s) before reaching the END state. Comparing the HMM with eq. (3.19), one can see that all the corpus and detection probabilities are accounted for in the transition and emission probabilities respectively. Optimizing T is then equivalent to finding the best (most likely) path 6each verb, v, in NV will have 2 entries with the same value, one for each noun. 75 through the HMM given the image observations using the Viterbi algorithm which can be done in O(105) time which is significantly faster than the naive approach. We show in Fig. 3.18 (right-upper) examples of the top viterbi paths that produce T ∗ for four test images7. Figure 3.18: Four test images (left) and results. (Right-upper): Sentence structure T ∗ predicted using Viterbi and (Right-lower): Generated sentences. Words marked in red are considered to be incorrect predictions. Note that the proposed HMM is suitable for generating sentences that contain the core components defined in T which produces a sentence of the form NP-VP-PP, which we will show in sec. 3.2.4 is sufficient for the task of generating sentences for describing images. For more complex sentences with more components: such as adjectives or adverbs, the HMM can be easily extended with similar computations derived from the corpus. 7Complete results are available at http://www.umiacs.umd.edu/~yzyang/sentence_ generateOut.html 76 3.2.3.3 Sentence Generation Given the selected sentence structure T = {n1, n2, v, s, p}, we generate sen- tences using the following strategy for each component: 1) We add in appropriate determiners and cardinals: the, an, a, CARD, based on the content of n1,n2 and s. For e.g., if n1 = n2, we will use CARD=two, and modify the nouns to be in the plural form. When several possible choices are available, a random choice is made that depends on the object detection scores: the is preferred when we are confident of the detections while an, a is preferred otherwise. 2) We predict the most likely preposition inserted between the verbs and nouns learned from the Gigaword corpus via Pr(p|v, n) during sentence generation. For example, our method will pick the preposition at between verb sit and noun table. 3) The verb v is converted to a form that agrees with in number with the nouns detected. The present gerund form is preferred such as eating, drinking, walking as it conveys that an action is being performed in the image. 4) The sentence structure is therefore of the form: NP-VP-PP with variations when only one object or multiple detections of the same objects are detected. A special case is when no objects are detected (below the predefined threshold). No verbs can be predicted as well. In this case, we simply generate a sentence that describes the scene only: for e.g. This is a coast, This is a field. Such sentences account for 20% of the entire UIUC testing dataset which are scored lower in our evaluation metrics (sec. 3.2.4.1) since they do not fully describe the image 77 content in terms of the objects and actions. Some examples of sentences generated using this strategy are shown in Fig. 3.18. 3.2.4 Experiments We performed several experiments to evaluate our proposed approach. The different metrics used for evaluation and comparison are also presented, followed by a discussion of the experimental results. 3.2.4.1 Sentence Generation Results Three experiments are performed to evaluate the effectiveness of our approach. As a baseline, we simply generated T ∗ directly from images without using the cor- pus. There are two variants of this baseline where we seek to determine if listing all objects in the image is crucial for scene description. Tb1 is a baseline that uses all possible objects and scene detected: Tb1 = {n1, n2, · · · , nm, s} and our sentence will be of the form: {Object 1, object 2 and object 3 are IN the scene.} and we simply selected IN as the only admissible preposition. For the second baseline, Tb2, we limit the number of objects to just any two: Tb2 = {n1, n2, s} and the sen- tence generated will be of the form {Object 1 and object 2 are IN the scene}. In the second experiment, we applied the HMM strategy described above but made all transition probabilities equiprobable, removing the effects of the corpus, and produc- ing a sentence structure which we denote as T ∗eq. The third experiment produces the full T ∗ with transition probabilities learned from the corpus. All experiments were 78 performed on the 100 unseen testing images from the UIUC dataset and we used only the most likely (top) sentence generated for all evaluation. We use two evaluation metrics as a measure of the accuracy of the generated sentences: 1) ROUGE-1 [103] precision scores and 2) Relevance and Readability of the generated sentences. ROUGE-1 is a recall based metric that is commonly used to measure the effectiveness of text summarization. In this work, the short descriptive sentence of an image can be viewed as summarizing the image content and ROUGE-1 is able to capture how well this sentence can describe the image by comparing it with the human annotated ground truth of the UIUC dataset. Due to the short sentences generated, we did not consider other ROUGE metrics (ROUGE-2, ROUGE-SU4) which captures fluency and is not an issue here. Experiment R1,(length) Relevance Readability Baseline 1, T ∗b1 0.35,(8.2) 2.84± 1.40 3.64± 1.20 Baseline 2, T ∗b2 0.39,(6.8) 2.14± 1.13 3.94± 0.91 HMM no corpus, T ∗eq 0.42,(6.5) 2.44± 1.25 3.88± 1.18 Full HMM, T ∗ 0.44,(6.9) 2.51± 1.30 4.10± 1.03 Human Annotation 0.68,(10.1) 4.91± 0.29 4.77± 0.42 Table 3.4: Sentence generation evaluation results with human gold standard. Human R1 scores are averaged over the 5 sentences using a leave one out procedure. Values in bold are the top scores. A main shortcoming of using ROUGE-1 is that the generated sentences are compared only to a finite set of human labeled ground truth which obviously does not capture all possible sentences that one can generate. In other words, ROUGE-1 79 does not take into account the fact that sentence generation is innately a creative process, and a better recall metric will be to ask humans to judge these sentences. The second evaluation metric: Relevance and Readability is therefore proposed as an empirical measure of how much the sentence: 1) conveys the image content (relevance) in terms of the objects, actions and scene predicted and 2) is grammat- ically correct (readability). We engaged the services of Amazon Mechanical Turks (AMT) to judge the generated sentences based on a discrete scale ranging from 1–5 (low relevance/readability to high relevance/readability). The averaged results of ROUGE-1, R1 and mean length of the sentences with the Relevance+Readability scores for all experiments are summarized in Table 3.4. For comparison, we also asked the AMTs to judge the ground truth sentences as well. 3.2.4.2 Discussion The results reported in Table 3.4 reveals both the strengths and some short- comings of the approach which we will briefly discuss here. Firstly, the R1 scores indicate that based on a purely summarization (unigram-overlap) point of view, the proposed approach of using the HMM to predict T ∗ achieves the best results com- pared to all other approaches with R1 = 0.44. This means that our sentences are the closest in agreement with the human annotated ground truth, correctly predicting the sentence structure components. In addition sentences generated by T ∗ are also succinct: with an average length of 6.9 words per sentence. However, we are still some way off the human gold standard since we do not predict other parts-of-speech 80 such as adjectives and adverbs. Given this fact, our proposed approach performance is comparable to other state of the art summarization work in the literature [104]. Next, we consider the Relevance+Readability metrics based on human judges. Interestingly, the first baseline, T ∗b1 is considered the most relevant description of the image and the least readable at the same time. This is most likely due to the fact that this recall oriented strategy will almost certainly describe some objects but the lack of any verb description; and longer sentences that average 8.2 words per sentence, makes it less readable. It is also possible that humans tend to penalize less irrelevant objects compared to missing objects, and further evaluations are necessary to confirm this. Since T ∗b2 is limited to two objects just like the proposed HMM, it is a more suitable baseline for comparison. Clearly, the results show that adding the HMM to predict the optimal sentence structure increases the relevance of the produced sentence. Finally, in terms of readability, T ∗ generates the most readable sentences, and this is achieved by leveraging on the corpus to guide our predictions of the most reasonable nouns, verbs, scenes and prepositions that agree with the detections in the image. 81 Chapter 4: “Can’t Make an Omelette without Breaking Eggs”: De- tection of Manipulation Action Consequences 4.1 Introduction Visual recognition is the process through which intelligent agents associate a visual observation to a concept from their memory. In most cases, the concept either corresponds to a term in natural language, or an explicit definition in natural language. Most research in Computer Vision has focused on two concepts: objects and actions; humans, faces and scenes can be regarded as special cases of objects. Object and action recognition are indeed crucial since they are the fundamental building blocks for an intelligent agent to semantically understand its observations. When it comes to understanding actions of manipulation, the movement of the body (especially the hands) is not a very good characteristic feature. There is great variability in the way humans carry out such actions. It has been realized that such actions are better described by involving a number of quantities. Besides the motion trajectories, the objects involved, the hand pose, and the spatial relations between the body and the objects under influence, provide information about the action. In this work we want to bring attention to another concept, the action consequence. 82 It describes the transformation of the object during the manipulation. For example during a CUT or a SPLIT action an object is divided into segments, during a GLUE or a MERGE action two objects are combined into one, etc. The recognition and understanding of human manipulation actions recently has attracted the attention of Computer Vision and Robotics researchers because of their critical role in human behavior analysis. Moreover, they naturally relate to both, the movement involved in the action and the objects. However, so far researchers have not considered that the most crucial cue in describing manipulation actions is actually not the movement nor the specific object under influence, but the object centric action consequence. We can come up with examples, where two actions involve the same tool and same object under influence, and the motions of the hands are similar, for example in “cutting a piece of meat” vs. “poking a hole into the meat”. Their consequences are different. In such cases, the action consequence is the key in differentiating the actions. Thus, to fully understand manipulation actions, the intelligent system should be able to determine the object centric consequences. Few researchers have addressed the problem of action consequences due to the difficulties involved. The main challenge comes from the monitoring process, which calls for the ability to continuously check the topological and appearance changes of the object-under-manipulation. Previous studies of visual tracking have considered challenging situations, such as non-rigid objects [3], adaptive appearance model [105], and tracking of multiple objects with occlusions [106], but none can deal with the difficulties involved in detecting the possible changes on objects during 83 manipulation. In this chapter, for the first time, a system is implemented to conquer these difficulties and eventually achieve robust action consequence detection. 4.2 Why Consequences and Fundamental Types Recognizing human actions has been an active research area in Computer Vision [107]. Several excellent surveys on the topic of visual recognition are available ( [108], [109]). Most work on visual action analysis has been devoted to the study of movement and change of posture, such as walking, running etc. The dominant approaches to the recognition of single actions compute as descriptors statistics of spatio-temporal interest points ( [110], [111]) and flow in video volumes, or represent short actions by stacks of silhouettes ( [112], [113]). Approaches to more complex, longer actions employ parametric approaches, such as Hidden Markov Models [114], Linear Dynamical Systems [115] or Non-linear Dynamical Systems [116], which are defined on extracted features. There are a few recent studies on human manipulation actions ( [117], [11], [118]), but they do not consider action consequences for the interpretation of manipulation actions. Works like [119] emphasize the role of object perception in action or pose recognition, but they focus on object labels, not object- centric consequences. How do humans understand, recognize, and even replicate manipulation ac- tions? Psychological studies on human manners ( [120] etc.) have pointed out the importance of manipulation action consequences for both understanding human cog- nition and intelligent system research. Actions, by their very nature, are goal 84 oriented. When we perform an action, we always have a goal in mind, and the goal affects the action. Similarly, when we try to recognize an action, we also keep a goal in mind. The close relation between the movement during the action and goal is reflected also in language. For example, the word “CUT” denotes both the action in which hands move up and down or in and out with sharp bladed tools, and the consequence of the action, namely that the object is separated. Very often, we can recognize an action purely by the goal satisfaction, and even neglect the motion or the tools used. For example, we may observe a human carry out movement with a knife, that is ”up and down”, but if the object remains as one whole, we won’t draw the conclusion that a “CUT” action has been performed. Only when the goal of the recognition process, here “DIVIDE”, is detected, the goal satisfaction is reached and a “CUT” action is confirmed. An intelligent system should have the ability to detect the consequences of manipulation actions, in order to check the goal of actions. In addition, experiments conducted in neuronscience [9] show that a monkey’s mirror neuron system fires when a hand/object interaction is observed, and it will not fire when a similar movement is observed without hand/object interaction. Re- cent experiments [10] further showed that the mirror neuron regions responding to the sight of actions responded more during the observation of goal-directed actions than similar movements not directed at goals. These evidences support the idea of goal matching, as well as the crucial role of action consequence in the understanding of manipulation actions. Taking an object-centric point of view, manipulation actions can be classified 85 into six categories according how the object is transformed during the manipulation, or in other words what consequence the action has on the object. These categories are: DIVIDE, ASSEMBLE, CREATE, CONSUME, TRANSFER, and DEFORM. Each of these categories is denoted by a term that has a clear semantic meaning in natural language given as follows: • DIVIDE: one object breaks into two objects, or two attached objects break the attachment; • ASSEMBLE: two objects merge into one object, or two objects build an at- tachment between them; • CREATE: an object is brought to, or emerges in the visual space; • CONSUME: an object disappears from the visual space; • TRANSFER: an object is moved from one location to another location; • DEFORM: an object has an appearance change. To describe these action categories we need a formalism. We use the visual semantic graph (VSG) inspired from the work of Aksoy et. al [14]. This formalism takes as input computed object segments, their spatial relationship, and temporal relationship over consecutive frames. To provide the symbols for the VSG, an active monitoring process (discussed in sec. 4.4) is required for the purpose of (1) track- ing the object to obtain temporal correspondence, and (2) segmenting the object to obtain its topological structure and appearance model. This active monitoring 86 (consisting of segmentation and tracking) is related to studies on active segmenta- tion [121], and stochastic tracking ( [122] etc.). 4.3 Visual Semantic Graph (VSG) To define object-centric action consequences, a graph representation is used. Every frame in the video is described by a Visual Semantic Graph (VSG), which is represented by an undirected graph G(V,E, P ). The vertex set |V | represents the set of semantically meaningful segments, the edge set |E| represents the spatial relations between any of the two segments. Two segments are connected when they share parts of their borders, or when one of the segments is contained in the other. If two nodes v1, v2 ∈ V are connected, E(v1, v2) = 1, otherwise, E(v1, v2) = 0. In addition, every node v ∈ V is associated with a set of properties P (v), that describes the attributes of the segment. This set of properties provides additional information to discriminate the different categories, and in principle many properties are possible. Here we use location, shape, and color. We need to compute the changes of the object over time. In our formulation this is expressed as the change in the VSG. At any time instance t, we consider two consecutive VSGs, the VSG at time t−1, denoted as Ga(Va, Ea, Pa) and the VSG at time t, denoted as Gz(Vz, Ez, Pz). We then define the following four consequences, where → is used to denote the temporal correspondence between two vertices, 9 is used to denote no correspondence: • DIVIDE: {∃v1 ∈ Va; v2, v3 ∈ Vz|v1 → v2, v1 → v3)} or {∃v1, v2 ∈ Va; v3, v4 ∈ 87 Vz|Ea(v1, v2) = 1, Ez(v3, v4) = 0, v1 → v3, v2 → v4} Condition (1) • ASSEMBLE: {∃v1, v2 ∈ Va; v3 ∈ Vz|v1 → v3, v2 → v3} or {∃v1, v2 ∈ Va; v3, v4 ∈ Vz|Ea(v1, v2) = 0, Ez(v3, v4) = 1, v1 → v3, v2 → v4} Condition (2) • CREATE:{∀v ∈ Va;∃v1 ∈ Vz|v 9 v1} Condition (3) • CONSUME:{∀v ∈ Vz;∃v1 ∈ Va|v 9 v1} Condition(4) While the above actions can be defined purely on the basis of topological changes, there are no such changes for TRANSFER and DEFORM. Therefore, we have to define them through changes in property. In the following definitions, PL represents properties of location, and P S represents properties of appearance (shape, color, etc.). • TRANSFER:{∃v1 ∈ Va; v2 ∈ Vz|PLa (v1) 6= P L z (v2)} Condition (5) • DEFORM: {∃v1 ∈ Va; v2 ∈ Vz|P Sa (v1) 6= P S z (v2)} Condition (6) Figure 4.1: Graphical illustration of the changes for Condition (1-6). A graphical illustration for Condition (1-6) is shown in Fig. 4.1. Sec. 4.4 describes how we find the primitives used in the graph. A new active segmentation and tracking method is introduced to 1) find correspondences (→) between Va and Vz; 2) monitor location property PL and appearance property P S in the VSG. 88 The procedure for computing action consequences, first decides on whether there is a topological change between Ga and Gz. If yes, the system checks whether Condition (1) to Condition (4) are fulfilled and returns the corresponding con- sequence. If no, the system then checks whether Condition (5) or Condition (6) is fulfilled. If both of them are not met, no consequence is detected. 4.4 Active Segmentation and Tracking Previously, researchers have treated segmentation and tracking as two differ- ent problems. Here we propose a new method combining the two tasks to obtain the information necessary to monitor the objects under influence. Our methods com- bines stochastic tracking [122] with a fixation based active segmentation [121]. The tracking module provides a number of tracked points. The locations of these points are used to define an area of interest and a fixation point for the segmentation, and the color in their immediate surroundings are used in the data term of the segmen- tation module. The segmentation module segments the object, and based on the segmentation, updates the appearance model for the tracker. Fig 4.2 illustrates the method over time, which is a dynamically closed-loop process. We next describe the attention based segmentation (sec. 4.4.1 - 4.4.4), and then the segmentation guided tracking (sec. 4.4.5). The proposed method meets two challenging requirements, necessary to detect action consequences: 1) the system is able to track and segment objects when the shape or color (appearance) changes; 2) the system is also able to track and segment 89 Figure 4.2: Flow chart of the proposed active segmentation and tracking method for object monitoring. objects when they are divided into pieces. Experiments in sec. 4.5.1 show that our method can handle these requirements, while systems implementing independently tracking and segmentation cannot. 4.4.1 The Attention Field The idea underlying our approach is, that first a process of visual attention se- lects an area of interest. Segmentation then is considered the process that separates the area selected by visual attention from background by finding closed contours that best separate the regions. The minimization uses a color model for the data term and edges in the regularization term. To achieve a minimization that is very robust to the length of the boundary, edges are weighted with their distance from the fixation center. 90 Visual attention, the process of driving an agent’s attention to a certain area, is based on both bottom-up processes defined on low level visual features, and top- down processes influenced by the agent’s previous experience [123]. Inspired by the work of Yang et al. [124], instead of using a single fixation point in the active segmentation [121], here we use a weighted sample set S = {(s(n), pi(n))|n = 1 . . . N} to represent the attention field around the fixation point (N = 500 in practice). Each sample consists of an element s from the set of tracked points and a corresponding discrete weight pi where ∑N n=1 pi (n) = 1. Generally, any appearance model can be used to represent the local visual information around each point. We choose to use a color histogram with a dynamic sampling area defined by an ellipse. To compute the color distribution, every point is represented by an ellipse, s = {x, y, x˙, y˙, Hx, Hy, H˙x, H˙y, } where x and y denote the location, x˙ and y˙ the motion, Hx, Hy the length of the half axes, and H˙x, H˙y the changes in the axes. 4.4.2 Color Distribution Model To make the color model invariant to various textures or patterns, a color distribution model is used. A function h(xi) is defined to create a color histogram, which assigns one of the m-bins to a giving color at location xi. To make the algorithm less sensitive to lighting conditions, the HSV color space is used with less sensitivity in the V channel (8× 8× 4 bins). The color distribution for each fixation 91 point s(n) is computed as: p(s(n))(u) = γ I∑ i=1 k(||y − xi||)δ[h(xi)− u], (4.1) where u = 1 . . .m, δ(.) is the Kronecker delta function, and γ is the normalization term γ = 1∑I i=1 k(||y−xi||) . k(.) is a weighting function designed from the intuition that not all pixels in the sampling region are equally important for describing the color model. Specifically, pixels that are farther away from the point are assigned smaller weights, k(r) =    1− r2 if r < a 0 otherwise , where the parameter a is used to adapt the size of the region, and r is the distance from the fixation point. By applying the weighting function, we increase the robustness of the color distribution by weakening the influence from boundary pixels, which possibly belong to the background or are occluded. 4.4.3 Weights of the Tracked Point Set In the following weighted graph cut approach, every sample is weighted by comparing its color distribution with the one of the fixation point. Initially a fixation point is selected, later the fixation point is computed as the center of the tracked point set. Let’s call the distribution at the fixation point q, and the histogram of the nth tracked point, p(s(n)). In assigning weights pi(n) to the tracked points we want to favor points whose color distribution is similar to the fixation point. We use the Bhattacharyya coefficient ρ[p, q] = ∑m u=1 √ p(u)q(u) with m the number of bins to weigh points by a Gaussian with variance σ (σ = 0.2 in practice) and define pi(n) 92 as: pi(n) = 1 √ 2piσ e− d2 2σ2 = 1 √ 2piσ e− 1−ρ[p(s(n)),q] 2σ2 . (4.2) 4.4.4 Weighted Graph Cut The segmentation is formulated as a minimization that is solved using graph cut. The unary terms are defined on the tracked points on the basis of their color, and the binary terms are defined on all points on the basis of edge information. To obtain the edge information, in each frame, we compute a probabilistic edge map IE using the Canny edge detector. Consider every pixel x ∈ X in this edge map as a node in a graph. Denoting the set of all the edges connecting neighboring nodes in the graph as Ω, and using the label set l = 0, 1 to indicate whether a pixel x is “inside” (lx = 0) or “outside” (lx = 1), we need to find a labeling f(X) 7−→ l , that minimizes the energy function: Q(f) = ∑ x∈X Ux(lx) + λ ∑ (x,y)∈Ω Vx,yδ(lx, ly). (4.3) Vx,y is the cost of assigning different labels to neighboring pixels x and y, which we defines as Vx,y = e−ηIE,xy + k, with δ(lx, ly) =    1 if lx 6= ly 0 otherwise , λ = 1, η = 1000, k = 10−16, IE,xy = (IE(x)/Rx + IE(y)/Ry)/2, Rx, Ry are the euclidean distances between the x, y and the center of the tracked point set St. We use them as weights to make the segmentation robust to the length of the contours. Ux(lx) is the cost of assigning label lx to pixel x. In our system, we have a set of points St, and for each sample s(n), there is a weight pi(n). The weight itself indicates 93 the likelihood that the area around that fixation point belongs to the “inside” of the object. It becomes straightforward to assign weights pi(n) to the pixel s(n), which are tracked points as follows: Ux(lx) =    Npi(n) if lx = 1 0 otherwise . We assume that pixels on the boundary of a frame are “outside” of the object, and assign to them a large weight W = 1010: Ux(lx) =    W if lx = 0 0 otherwise . Using this formulation, we run a graph cut algorithm [125] on each frame. Fig. 4.3(a) illustrates the procedure on a texture-rich natural image from the Berkeley segmentation dataset [126]. (a) (b) Figure 4.3: Upper: (1) Sampling of tracked points sampling and filtering; (2) Weighted graph cut. Lower: Segmentation with different initial fixations. Green Cross: initial fixation. Two critical limitations of the previous active segmentation method [121] in practice are: 1) the segmentation performance largely varies under different initial fixation points; 2) the segmentation performance also is strongly affected by texture 94 edges, which often leads to a segmentation of object parts. Fig. 4.3(b) shows that our proposed segmentation method is robust to the choice of initial fixation point, and only weakly affected by texture edges. 4.4.5 Active Tracking At the very beginning of the monitoring process, a Gaussian sampling with mean at the initial fixation point and variances σx, σy is used to generate the initial point set S0. When a new frame comes in, the point set is propagated through a stochastic tracking paradigm: st = Ast−1 + wt−1, (4.4) where A denotes the deterministic, and wt−1 the stochastic component. In our implementation, we have considered a first order model for A, which assumes that the object is moving with constant velocity. The reader is referred to [127] for details. The complete algorithm is given in Algorithm 1 4.4.6 Incorporating Depth and Optical Flow It is easy to extend our algorithm to incorporate depth (for example from Kinect) or image motion flow information. Depth information can be used in a straightforward way during two crucial steps. 1) As described in sec. 4.4.2, we can add in depth information as another channel in the distribution model. In preliminary experiments we used 8 bins for the depth, to obtain in RGBD space a model with 8× 8× 4× 8 bins. 2) Depth can be used to achieve cleaner edge maps, 95 Algorithm 1 Active tracking and segmentation Require: Given the tracked point set St−1 and the target model qt−1, perform the following steps: 1. SELECT N samples from the set St−1 with probability pi (n) t−1. Fixation points with a high weight may be chosen several times, leading to identical copies, while others with relatively low weights may not be chosen at all. Denote the resulting set as S ′t−1; 2. PROPAGATE each sample from S ′t−1 by a linear stochastic differential eq. 4.4. Denote the new set as St 3. OBSERVE the color distributions for each sample of St using eq. 4.1. Weigh each sample using eq. 4.2. 4. SEGMENTATION using the weighted sample set. Apply the weighted graph cut algorithm described in sec. 4.4.4. and get the segmented object area M . 5. UPDATE the target distribution qt−1 by the area M to achieve the new target distribution qt. 96 IE, in the segmentation step 4.4.4. Optical flow can be incorporated to provide cues for the system to predict the movement of edges to be used for the segmentation step in the next iteration, and the movement of the points in the tracking step. We performed some experiments using the optical flow estimation method proposed by Brox [128] and the improved implementation by Liu [129]. Optical flow was used in the segmentation by first predicting the contour of the object in the next frame, and then fusing it with the next frame’s edge map. Fig. 4.4(a) shows an example of an edge map improved by optical flow. Optical flow was incorporated into tracking by replacing the first order velocity components for each tracked point in matrix A (eq. 4.4) by its flow component. Fig. 4.4(b) shows that the optical flow drives the tracked points to move along the flow vectors into the next frame. (a) (b) Figure 4.4: (a): Incorporating optical flow into segmentation. (b): Incorporating optical flow into tracking. 97 4.5 Experiments 4.5.1 Deformation and Division To show that our method can segment challenging cases, we first demonstrate its performance for the case of deforming and dividing objects. Fig. 4.5(a) shows results for a sequence with the main object deforming, and Fig. 4.5(b) for a syn- thetic sequence with the main object dividing. The ability to handle deformations comes from the updating of the target model using the segmentation of previous frames. The ability to handle division comes from the tracked point set that is used to represent the attention field (sec. 4.4.1), which guides the weighted graph cut algorithm (sec. 4.4.4). 4.5.2 The MAC 1.0 Dataset To quantitatively test our method, we collected a dataset of several RGB+Depth image sequences of humans performing different manipulation actions. In addition, several sequences from other publicly available datasets ( [14], [130] and [131]) were included to increase the variability and make it more challenging. Since the two action consequences CREATE and CONSUME (sec.4.2) relate to the existence of the object and would require a higher level attention mechanism, which is out of this chapter’s scope, we did not consider them. For the other four consequences, six sequences were collected each to make the first Manipulation Action Consequence 98 (a) (b) Figure 4.5: (a): Deformation Invariance: upper: state-of-the-art appearance based tracking [6]; middle: tracking without updating target model; lower: updating target model. (b): Division Invariance: synthetic cell division sequence. 99 (MAC 1.0) dataset.1. 4.5.3 Consequence Detection on MAC 1.0 We first evaluated the method’s ability in detecting the various consequences. Consequences happen in an event based manner. In our description, a consequence is detected using the VSG graph at a point in time, if between two consecutive image frames one of the conditions listed in sec. 4.3 is met. For example, a consequence is detected for the case of DIVIDE, when one segment becomes two segments in the next frame (Fig. 4.6), or for the case of DEFORM, when one appearance model changes to another (Fig. 4.8). We obtained ground truth by asking people not familiar with the purpose to label the sequences in MAC 1.0. Fig. 4.6, 4.7, 4.8 show typical example active segmentation and tracking, the VSG graph, and the corresponding measures used to identify the different action consequences, as well as the final detection result along the time-line are illustrated. Specifically, for DIVIDE we monitor the change in the number of segments, for ASSEMBLE we monitor the minimum Euclidean distance between the contours of segments, for DEFORM we monitor the change of appearance (color histogram and shape context [132]) of the segment, and for TRANSFER we monitor the velocity of the object. Each of the measurements is normalized to the range of [0, 1] for the ROC analysis. The detection is evaluated, by counting the correct detections over the sequence. For example, for the case of DIVIDE, at any point in time we have either the detection, “not divided” or “divided”. For the case of ASSEMBLE 1The dataset is available at www.umiacs.umd.edu/~yzyang. 100 , we have either the detection “two parts assembled” or “nothing assembled”, and for DEFORM, we have either “deformed” or “nor deformed”. The ROC curves obtained are shown in Fig. 4.9. The graphs indicate that our method is able to correctly detect most of the consequences. Several failures point out the limitations of our method as well. For example, for the PAPER-JHU sequence the method has errors in detecting DIVIDE, because the part that was cut out, connects visually with the rest of the chapter. For the CLOSE-BOTTLE sequence our method fails for ASSEMBLE because the small bottle cap is occluded by the hand. However, our method detects that an ASSEMBLE event happened after the hand move away. Figure 4.6: “Division” detection on “cut cucumber” sequence. Upper row: Original sequence with segmentation and tracking; Middle and lower right: VSG representa- tions; Lower left: Division consequences detection. 101 Figure 4.7: “Assemble” detection on “make sandwich 1” sequence; 1st row: Orig- inal sequence with segmentation and tracking; 2nd row: VSG representation; 3rd row: Distance between each two segments (red line: bread and cheese, magenta line: bread and meat, blue line: cheese and meat; 4th row: Assemble consequence detection. Figure 4.8: “Deformation” detection on “close book 1” sequence; 1st row: Original sequence with segmentation and tracking; 2nd row: VSG representation; 3rd row: appearance description (here color histogram) of each segment; 4th row: measure- ment of appearance change; 5th row: Deformation consequence detection. 102 (a) (b) (c) (d) Figure 4.9: ROC curve of each sequence by categories: (a) TRANSFER, (b) DE- FORM, (c) DIVIDE, and (d) ASSEMBLE. 4.5.4 Video Classification on MAC 1.0 We also evaluated our method on the problem of classification, although the problem of consequence detection is quite different from the problem of video classi- fication. We compared our method with the state-of-the-art STIP + Bag of Words + classification (SVM or Naive Bayes). The STIP features for each video in the MAC 103 1.0 dataset were computed using the method described in [110]. For classification we used a bag of words + SVM and a Naive Bayes method. The dictionary codebook size was 1000, and a polynomial kernel SVM with a leave-one-out cross validation setting was used. Fig. 4.10 shows that our method dramatically outperforms the typical STIP + Bag of Words + SVM and the Naive Bayes learning methods. How- ever, this does not come as a surprise. The different video sequences in an action consequence class contain different objects and different actions and thus different visual features, and are therefore not well suited for standard classification. On the other hand, our method has been specifically designed for the detection of manipu- lation action consequences all the way from low-level signal processing through the mid-level semantic representation to high-level reasoning. Moreover, different from a learning based method, it does not rely on training data. After all, the method stems from the insight of manipulation action consequences. Figure 4.10: Video classification performance comparison. 104 Chapter 5: The Syntax: A Syntactical Grammar for Understanding Human Manipulation Actions 5.1 Introduction Cognitive systems that interact with humans must be able to interpret actions. Here we are concerned with manipulation actions. These are actions performed by agents (humans or robots) on objects, which result in some physical change of the object. There has been much work recently on action recognition, with most stud- ies considering short lived actions, where the beginning and end of the sequence is defined. Most efforts have focused on two problems of great interest to the study of perception: the recognition of movements and the recognition of associated objects. However, the more complex an action, the less reliable individual perceptual events are for the characterization of actions. Thus, the problem of interpreting manipula- tion actions involves many more challenges than simply recognizing movements and objects, due to the many ways that humans can perform them. Since perceptual events do not suffice, how do we determine the beginning and end of action segments, and how do we combine the individual segments into longer ones corresponding to a manipulation action? An essential component in the 105 description of manipulations is the underlying goal. The goal of a manipulation action is the physical change induced on the object. To accomplish it, the hands must perform a sequence of sub actions on the object, such as when the hand grasps or releases the object, or when the hand changes the grasp type during a movement. Centered around this idea, we develop a grammatical formalism for parsing and interpreting action sequences, and we also develop vision modules to obtain from dynamic imagery the symbolic information used in the grammatical structure. Our formalism for describing manipulation actions uses a structure similar to natural language. What do we gain from this formal description of action? This is equal to asking what one gains from a formal description of language. Chomsky’s contribution to language was the formal description of language through his gen- erative and transformational grammar [22]. This revolutionized language research, opened up new roads for its computational analysis and provided researchers with common, generative language structures and syntactic operations on which lan- guage analysis tools were built. Similarly, a grammar for action provides a common framework of the syntax and semantics of action, so that basic tools for action un- derstanding can be built. Researchers can then use these tools when developing action interpretation systems, without having to start from scratch. The input into our system for interpreting manipulation actions is perceptual data, specifically sequences of images and depth maps. Therefore, a crucial part of our system is the vision process, which obtains atomic symbols from perceptual data. In Section 5.3, we introduce an integrated vision system with attention, seg- mentation, hand tracking, grasp classification, and action recognition. The vision 106 processes produce a set of symbols: the “Subject”, “Action”, and “Object” triplets, which serve as input to the reasoning module. At the core of our reasoning module is the manipulation action context-free grammar (MACFG). This grammar comes with a set of generative rules and a set of parsing algorithms. The parsing algorithms use two main operations – “construction” and “destruction” – to dynamically parse a sequence of tree (or forest) structures made up from the symbols provided by the vision module. The sequence of semantic tree structures could then be used by the cognitive system to perform reasoning and prediction. Figure 5.1 shows the flow chart of our cognitive system. Figure 5.1: Overview of the manipulation action understanding system, including feedback loops within and between some of the modules. The feedback is denoted by the dotted arrow. 5.2 Related Work The problem of human activity recognition and understanding has attracted considerable interest in computer vision in recent years. Both visual recognition 107 methods and nonvisual methods using motion capture systems [107,133] have been used. [108] [109], and [134] provide surveys of the former. There are many ap- plications for this work in areas such as human computer interaction, biometrics, and video surveillance. Most visual recognition methods learn the visual signature of each action from many spatio-temporal feature points (e.g, [110, 111, 135, 136]). Work has focused on recognizing single human actions like walking, jumping, or run- ning ( [113, 137]). Approaches to more complex actions have employed parametric models such as hidden Markov models [114] to learn the transitions between image frames (e.g, [14,115,116,138]). The problem of understanding manipulation actions is also of great interest in robotics, which focuses on execution. Much work has been devoted to learning from demonstration [139], such as the problem of a robot with hands learning to manipulate objects by mapping the trajectory observed for people performing the action to the robot body. These approaches have emphasized signal to signal map- ping and lack the ability to generalize. More recently, within the domain of robot control research, [140] have used temporal logic for hybrid controller design, and later [141] suggested a grammatical formal system to represent and verify robot control policies. [142] and [143] created a library of manipulation actions through semantic object-action relations obtained from visual observation. There have also been many syntactic approaches to human activity recogni- tion that use the concept of context-free grammars, because they provide a sound theoretical basis for modeling structured processes. [144] used a grammar to recog- nize disassembly tasks that contain hand manipulations. [145] used the context-free 108 grammar formalism to recognize composite human activities and multi-person in- teractions. It was a two-level hierarchical approach in which the lower level was composed of hidden Markov models and Bayesian networks while the higher-level interactions were modeled by CFGs. More recent methods have used stochastic grammars to deal with errors from low-level processes such as tracking [146, 147]. This work showed that grammar-based approaches can be practical in activity recog- nition systems and provided insight for understanding human manipulation actions. However, as mentioned, thinking about manipulation actions solely from the view- point of recognition has obvious limitations. In this work, we adopt principles from CFG-based activity recognition systems, with extensions to a minimalist grammar that accommodates not only the hierarchical structure of human activity, but also human-tool-object interactions. This approach lets the system serve as the core parsing engine for manipulation action interpretation. [148] suggested that a minimalist generative structure, similar to the one in human language, also exists for action understanding. [17] introduced a minimalist grammar of action, which defines the set of terminals, features, non-terminals and production rules for the grammar in the sensorimotor domain. However, this was a purely theoretical description. The first implementation used only objects as sensory symbols [18]. Then [31] proposed a minimalist set of atomic symbols to describe the movements in manipulation actions. In the field of natural language understanding, which traces back to the 1960s, [149] proposed the Conceptual Dependency the- ory [149] to represent content inferred from natural language input. In this theory, a sentence is represented as a series of diagrams representing both mental and phys- 109 ical actions that involve agents and objects. These actions are built from a set of primitive acts, which include atomic symbols like GRASP and MOVE. [150] have also discussed the relationship between languages and motion control. Here we extend the minimalist action grammar of [17] to dynamically parse the observations by providing a set of operations based on a set of context-free grammar rules. Then we provide a set of biologically inspired visual processes that compute from the low-level signals the symbols used as input to the grammar in the form of (Subject, Action, Object). By integrating the perception modules with the reasoning module, we obtain a cognitive system for human manipulation action understanding. 5.3 A Cognitive System For Understanding Human Manipulation Actions In this section, we first describe the Manipulation Action Context-Free Gram- mar and the parsing algorithms based on it. Then we discuss the vision methods: the attention mechanism, the hand tracking and action recognition, the object mon- itoring and recognition, and the action consequence classification. 5.3.1 A Context-Free Manipulation Action Grammar Our system includes vision modules that generate a sequence of “Subject” “Action” “Patient” triplets from the visual data, a reasoning module that takes in the sequence of triplets and builds them into semantic tree structures. The binary 110 tree structure represents the parsing trees, in which leaf nodes are observations from the vision modules and the non-leaf nodes are non-terminal symbols. At any given stage of the process, the representation may have multiple tree structures. For implementation reasons, we use a DUMMY root node to combine multiple trees. Extracting semantic trees from observing manipulation actions is the target of the cognitive system. Before introducing the parsing algorithms, we first introduce the core of our reasoning module: the Manipulation Action Context-Free Grammar (MACFG). In formal language theory, a context-free grammar is a formal grammar in which every production rule is of the form V → w, where V is a single nonterminal symbol, and w is a string of terminals and/or nonterminals (w can be empty). The basic recursive structure of natural languages, the way in which clauses nest inside other clauses, and the way in which lists of adjectives and adverbs are followed by nouns and verbs, is described exactly by a context-free grammar. Similarly for manipulation actions, every complex activity is built of smaller blocks. Using linguistics notation, a block consists of a “Subject”, “Action” and “Patient” triplet. Here a “Subject” can be either a hand or an object, and the same holds for the “Patient”. Furthermore, a complex activity also has a basic recursive structure, and can be decomposed into simpler actions. For example, the typical manipulation activity “sawing a plank” is described by the top-level triplet “handsaw saw plank”, and has two lower-level triplets (which come before the top-level action in time), namely “hand grasp saw” and “hand grasp plank”. Intuitively, the process of observing and interpreting manipulation actions is syntactically quite similar to 111 Table 5.1: A Manipulation Action Context-Free Grammar. AP → A O | A HP (1) HP → H AP | HP AP (2) H → h A → a O → o (3) natural language understanding. Thus, the Manipulation Grammar (Table 5.1) is presented to parse manipulation actions. The nonterminals H, A, and O represent the hand, the action and the object (the tools and objects under manipulation) respectively, and the terminals h, a, and o are the observations. AP stands for Action Phrase and HP for Hand Phrase. They are proposed here as XPs following the X-bar theory, which is used to construct the logical form of the semantic structure [151]. The design of this grammar is motivated by three observations: (i) Hands are the main driving force in manipulation actions, so a specialized nonterminal symbol H is used for their representation; (ii) An Action (A) can be applied to an Object (O) directly or to a Hand Phrase (HP ), which in turn contains an Object (O), as encoded in Rule (1), which builds up an Action Phrase (AP ); (iii) An Action Phrase (AP ) can be combined either with the Hand (H) or a Hand Phrase, as encoded in rule (2), which recursively builds up the Hand Phrase. The rules discussed in Table 5.1 form the syntactic rules of the grammar used in the parsing algorithms. 112 5.3.2 Cognitive MACFG Parsing Algorithms Our aim for this project is not only to provide a grammar for representing manipulation actions, but also to develop a set of operations that can automatically parse (create or dissolve) the semantic tree structures. This is crucial for practical purposes, since parsing a manipulation action is inherently an on-line process. The observations are obtained in a temporal sequence. Thus, the parsing algorithm for the grammar should be able to dynamically update the tree structures. At any point, the current leaves of the semantic forest structures represent the actions and objects involved so far. When a new triplet of (“Subject”, “Action”, “Patient”) arrives, the parser updates the tree using the construction or destruction routine. Theoretically, the non-regular context-free language defined in Table 5.1 can be recognized by a non-deterministic pushdown automaton. However, different from language input, the perception input is naturally a temporal sequence of observa- tions. Thus, instead of simply building a non-deterministic pushdown automaton, it requires a special set of parsing operations. Our parsing algorithm differentiates between constructive and destructive ac- tions. Constructive actions are the movements that start with the hand (or a tool) coming in contact with an object and usually result in a certain physical change on the object (a consequence), e.g, “Grasp”, “Cut”, or “Saw”. Destructive actions are movements at the end of physical change inducing actions, when the hand (or tool) separates from the object; some examples are “Ungrasp” or “FinishedCut”. A constructive action may or may not have a corresponding destructive action, but 113 (a) (b) Figure 5.2: The (a) construction and (b) destruction operations. Fine dashed lines are newly added connections, crosses are node deletion, and fine dotted lines are connections to be deleted. every destructive action must have a corresponding constructive action. Otherwise the parsing algorithm detects an error. In order to facilitate the action recognition, a look-up table that stores the constructive-destructive action pairs is used. This knowledge can be learned and further expanded. The algorithm builds a tree structure Ts. This structure is updated as new observations are received. Once an observation triplet “Subject”, “Action”, and “Patient” arrives, the algorithm checks whether the “Action” is constructive or de- structive and then follows one of two pathways. If the “Action” is constructive, a construction routine is used. Otherwise a destruction routine is used (Refer to Algorithm 2, Algorithm 3, and Algorithm 4 for details). The process continues till the last observation. Two illustrations in Figure 5.2 demonstrate how the construc- tion and the destruction routines work. The parse operation amounts to a chart 114 parser [152], which takes in the three nonterminals and performs bottom-up parsing following the context-free rules from Table 5.1. Algorithm 2 Dynamic manipulation action tree parsing Initialize an empty tree group (forest) Ts while New observation (subject s, action a, patient p) do if a is a constructive action then construction(Ts, s, a, p) end if if a is a destructive action then destruction(Ts, s, a, p) end if end while Figure 5.3 shows a typical manipulation action example. The parsing al- gorithm takes as input a sequence of key observations: “LeftHand Grasp Knife; RightHand Grasp Eggplant; Knife Cut Eggplant; Knife FinishedCut Eggplant; RightHand Ungrasp Eggplant; LeftHand Ungrasp Knife”. Then a sequence of six tree structures is parsed up or dissolved along the time line. We provide more examples in Section 5.4, and a sample implementation of the parsing algorithm at http://www.umiacs.umd.edu/~yzyang/MACFG. For clarity, Figure 5.3 uses a dummy root node to create a tree structure from a forest and numbers the nonter- minal nodes. 115 Algorithm 3 construction(Ts, s, a, p) Previous tree group (forest) Ts and new observation (subject s, action a and patient p) if s is Hand h, and p is an object o then Find the highest subtrees Th and To from Ts containing h and o. If h or o is not in the current forest, create new subtrees Th and To, respectively. parse(Th,a,To), attach it to update Ts. end if if s is an object o1 and p is another object o2 then Find the highest subtree T 1o and T 2 o from Ts containing o1 and o2 respectively. If either o1 or o2 is not in the current forest, create new subtree T 1o or T 2 o . If both o1 and o2 are not in the current forest, raise an error. parse(T 1o ,a,T 2 o ), attach it to update Ts. end if 5.3.3 Attention Mechanism with the Torque Operator It is essential for our cognitive system to have an effective attention mechanism, because the amount of information in real world images is vast. Visual attention, the process of driving an agent’s attention to a certain area, is based on both bottom-up processes defined on low-level visual features and top-down processes influenced by the agent’s previous experience and goals [123]. Recently, [153] have provided a vision tool, called the image torque, that captures the concept of closed contours using bottom-up processes. Basically, the torque operator takes simple edges as 116 Algorithm 4 destruction(Ts, s, a, p) Previous tree structures Ts and new observation (subject s, action a and patient p) Find corresponding constructive action of a from the look-up table and denote it as a′ if There exists a lowest subtree T ′a that contains both s and a ′ then Remove every node on the path that starts from root of T ′a to a ′. if T ′a has a parent node then Connect the highest subtree that contains s with T ′a’s parent node. end if Leave all the remaining subtrees as individual trees. end if Set the rest of Ts as new Ts. input and computes, over regions of different sizes, a measure of how well the edges are aligned to form a closed, convex contour. The underlying motivation is to find object-like regions by computing the “coherence” of the edges that support the object. Edge coherence is measured via the cross-product between the tangent vector of the edge pixel and a vector from a center point to the edge pixel, as shown in Figure 5.4(a). Formally, the value of torque, τpq of an edge pixel q within a discrete image patch with center p is defined as τpq = ||~rpq||sinθpq , (5.1) 117 Figure 5.3: Here the system observes a typical manipulation action example, “Cut an eggplant”, and builds a sequence of six action trees. where ~rpq is the displacement vector from p to q, and θpq is the angle between ~rpq and the tangent vector at q. The sign of τpq depends on the direction of the tangent vector and for this work, our system computes the direction based on the intensity contrast along the edge pixel. The torque of an image patch, P , is defined as the sum of the torque values of all edge pixels, E(P ), within the patch as τP = 1 2|P | ∑ q∈E(P ) τpq . (5.2) In this work, our system processes the testing sequences by applying the torque operators to obtain possible initial fixation points for the object monitoring process. Figure 5.4 shows an example of the application of the torque operator. The system also employs a top-down attention mechanism; it uses the hand 118 (a) (b) (c) Figure 5.4: (a) Torque for images, (b) a sample input frame, and (c) torque operator response. Crosses are the pixels with top extreme torque values that serve as the potential fixation points. location to guide the attention. Here, we integrate the bottom-up torque operator output with hand tracking. Potential objects under manipulation are found when one of the hand regions intersects a region with high torque responses, after which the object monitoring system (Section 5.3.5) monitors it. 5.3.4 Hand Tracking, Grasp Classification and Action Recognition With the recent development of a vision-based, markerless, fully articulated model-based human hand tracking system [32] (http://cvrlcode.ics.forth.gr/ handtracking/), the system is able to track a 26 degree of freedom model of hand. It is worth noting, however, that for a simple classification of movements into a small number of actions, the location of the hands and objects would be sufficient. Moreover, with the full hand model, a finer granularity of description can be achieved by classifying the tracked hand-model into different grasp types. We collected training data from different actions, which then was processed. A set of bio-inspired features, following hand anatomy [33], were extracted. Intuitively, 119 (a) (b) Figure 5.5: (a) Bones of the human hand. (b) Arches of the hand: (1) one of the oblique arches; (2) one of the longitudinal arches of the digits; (3) transverse metacarpal arch; (4) transverse carpal arch. Source: [7]. the arches formed by the fingers are crucial to differentiate different grasp types. Figure 5.5 shows that the fixed and mobile parts of the hand adapt to various everyday tasks by forming bony arches: longitudinal arches (the rays formed by finger bones and associated metacarpal bones), and oblique arches (between the thumb and four fingers). In each image frame, our system computed the oblique and longitudinal arches to obtain an eight parameter feature vector, as in Figure 5.6(a). We further reduced the dimensionality of the feature space by Principle Component Analysis and then applied k-means clustering to discover the underlying four general types of grasp, which are Rest, Firm Grasp, Delicate Grasp (Pinch) and Extension Grasp. To classify a given test sequence, the data was processed as described above and then the grasp type was computed using a naive Bayesian classifier. Figure 5.6(c) and (d) show examples of the classification result. The grasp classification is used to segment the image sequence in time and also 120 (a) (b) (c) (d) Figure 5.6: (a) One example of fully articulated hand model tracking, (b) a 3-D illustration of the tracked model, and (c-d) examples of grasp type recognition for both hands. serves as part of the action description. In addition, our system uses the trajectory of the mass center of the hands to classify the actions. The hand-tracking software provides the hand trajectories (of the given action sequence between the onset of grasp and release of the object), from which our system computed global features of the trajectory, including the frequency and velocity components. Frequency is encoded by the first four real coefficients of the Fourier transform in all the x, y and z directions, which gives a 24 dimensional vector over both hands. Velocity is encoded by averaging the difference in hand positions between two adjacent timestamps, which gives a six dimensional vector. These features are then combined to yield a 30 dimensional vector that the system uses for action recognition [24]. 5.3.5 Object Monitoring and Recognition Manipulation actions commonly involve objects. In order to obtain the in- formation necessary to monitor the objects being worked on, our system applies a new method combining segmentation and tracking [26]. This method combines stochastic tracking [122] with a fixation-based active segmentation [121]. The track- 121 ing module provides a number of tracked points. The locations of these points define an area of interest and a fixation point for the segmentation. The data term of the segmentation module uses the color immediately surrounding the fixation points. The segmentation module segments the object and updates the appearance model for the tracker. Using this method, the system tracks objects as they deform and change topology (two objects can merge, or an object can be divided into parts.). For object recognition, our system simply uses color information. The system uses a color distribution model to be invariant to various textures or patterns. A function h(xi) is defined to create a color histogram, which assigns one of the m-bins to a given color at location xi. To make the algorithm less sensitive to lighting con- ditions, the system uses the Hue-Saturation-Value color space with less sensitivity in the V channel (8× 8× 4 bins). The color distribution for segment s(n) is denoted as p(s(n))(u) = γ I∑ i=1 k(||y − xi||)δ[h(xi)− u] , (5.3) where u = 1 . . .m, δ(.) is the Kronecker delta function and γ is the normalization term γ = 1∑I i=1 k(||y−xi||) . k(.) is a weighting function designed from the intuition that not all pixels in the sampling region are equally important for describing the color model. Specifically, pixels that are farther away from the fixation point are assigned smaller weights, k(r) =    1− r2 if r < a 0 otherwise , (5.4) where the parameter a is used to adapt the size of the region, and r is the distance 122 from the fixation. By applying the weighting function, we increase the robustness of the distribution by weakening the influence of boundary pixels that may belong to the background or are occluded. Since the objects in our experiments have distinct color profiles, the color dis- tribution model used was sufficient to recognize the segmented objects. We manually labelled several examples from each object class as training data and used a nearest k-neighbours classifier. Figure 5.7 shows sample results. 5.3.6 Detection of Manipulation Action Consequences Taking an object-centric point of view, manipulation actions can be classified into six categories according to how the manipulation transforms the object, or, in other words, what consequence the action has on the object. These categories are Divide, Assemble, Create, Consume, Transfer, and Deform. To describe these action categories we need a formalism. We use the visual semantic graph (VSG) inspired by the work of [14]. This formalism takes as input computed object segments, their spatial relationship, and the temporal relationship over consecutive frames. Please refer to Chapter 4 for details. Our system integrated visual modules in the following manner. Since hands are the most important components in manipulation actions, a state-of-the-art marker- less hand tracking system obtains skeleton models of both hands. Using this data, our system classifies the manner in which the human grasps the objects into four primitive categories. On the basis of the grasp classification, our system finds the 123 start and end points of action sequences. Our system then classifies the action from the hand trajectories, the hand grasp, and the consequence of the object (as explained above). To obtain objects, our system monitors the manipulated object using a process that combines stochastic tracking with active segmentation, then recognizes the segmented objects using color. Finally, based on the monitoring process, our system checks and classifies the effect on the object into one of four fundamental types of “consequences”. The final output are sequences of “Subject” “Action” “Patient” triplets, and the manipulation grammar parser takes them as input to build up semantic structures. 5.4 Experiments The theoretical framework we have presented suggests two hypotheses that deserve empirical tests: (a) manipulation actions performed by single human agents obey a manipulation action context-free grammar that includes manipulators, ac- tions, and objects as terminal symbols; (b) a variant on chart parsing that includes both constructive and destructive actions, combined with methods for hand track- ing and action and object recognition from RGBD data, can parse observed human manipulation actions. To test the two hypotheses empirically, we need to define a set of performance variables and how they relate to our predicted results. The first hypothesis relates to representations, and we can empirically test if it is possible to manually generate target trees for each manipulation action in the test set. The second hypothesis has 124 to do with processes, and can be tested by observing how many times our system builds a parse tree successfully from observed human manipulation actions. The theoretical framework for this consists of two parts: (1) a visual system to detect (subject, action, object) triplets from input sensory data and (2) a variant of the chart parsing system to transform the sequence of triplets into tree representations. Thus, we further separate the test for our second hypothesis into two sub-tasks: (1) we measure the precision and recall metrics by comparing the detected triplets from our visual system with human-labelled ground truth data and (2) given the ground truth triplets as input, we measure the success rate using our variant of chart parsing system by comparing it with the target trees for each action. We consider a parse successful if the generated tree is identical with the manually generated target parse. Therefore, we consider the second hypothesis supported when (1) our visual system achieves high precision and recall, and (2) our parsing system achieves a high success rate. We use the ground truth triplets as input instead of the detected ones because we cannot expect the visual system to generate the triplets with 100% precision and recall, due to occlusions, shadows, and unexpected events. As a complete system, we expect the visual module to have high precision and recall, thus the detected triplets can be used as input to the parsing module in practice. We designed our experiments under the setting with one RGBD camera in a fixed location (we used a Kinect sensor). We asked human subjects to perform a set of manipulation actions in front of the camera while both objects and tools were presented within the view of the camera during the activity. We collected RGBD sequences of manipulation actions being performed by one human, and to 125 Category Hand Object Constructive Destructive Kitchen LeftHand Bread, Cheese, Tomato Grasp, Cut Ungrasp, F inishedCut RightHand Eggplant, Cucumber Assemble F inishedAssemble Manu LeftHand P lank Grasp, Saw Ungrasp, F inishedSaw −facturing RightHand Saw Assemble F inishedAssemble Table 5.2: “Hands”, “Objects” and “Actions” involved in the experiments. ensure some diversity, we collected these from two domains, namely the kitchen and the manufacturing environments. The kitchen action set included “Making a sandwich”, “Cutting an eggplant”, and “Cutting bread”, and the manufacturing action set included “Sawing a plank into two pieces” and “Assemble two pieces of the plank”. To further diversify the data set, we adopted two different viewpoints for each action set. For the kitchen actions, we used a front view setting; for manufacturing actions, a side view setting. The five sequences have 32 human- labelled ground truth triplets. Table 5.4 gives a list of “Subjects”, “Objects”, and “Actions” involved in our experiments. To evaluate the visual system, we applied the vision techniques introduced in Sections 5.3.3 to 5.3.6. To be specific, the grasp type classification module pro- vides a “Grasp” signal when the hand status changes from “Rest” to one of the three other types, and an “Ungrasp” signal when it changes back to “Rest”. At the same time, the object monitoring and the segmentation-based object recog- nition module provides the “Object” symbol when either of the hands touch an object. Also, the hand tracking module provides trajectory profiles that enable the trajectory-based action recognition module to produce “Action” symbols such as “Cut” and “Saw”. The action “Assemble” did not have a distinctive trajectory 126 Figure 5.7: The second row shows the hand tracking and object monitoring. The third row shows the object recognition result, where each segmentation is labelled with an object name and a bounding box in different color. The fourth and fifth rows depict the hand speed profile and the Euclidean distances between hands and objects. The sixth row shows the consequence detection. profile, so we simply generated it when the “Cheese” merged with the “Bread” based on the object monitoring process. At the end of each recognized action, a corresponding destructive symbol, as defined in Table 5.4, is produced, and the con- sequence checking module is called to confirm the action consequence. Figure 5.7 shows intermediate and final results of vision modules from a sequence of a per- son making a sandwich. In this scenario, our system reliably tracks, segments and recognizes both hands and objects, recognizes “grasp”, “ungrasp” and “assemble” events, and generates a sequence of triplets along the time-line. To evaluate the parsing system, given the sequence of ground truth triplets as inputs, a sequence of trees (or forests) is created or dissolved dynamically using the manipulation action 127 context free grammar parser (Section 5.3.1, 5.3.2). Our experiments produced three results: (i) we were able to manually generate a sequence of target tree representations for each of the five sequences in our data set; (ii) our visual system detected 34 triplets, of which 29 were correct (compared with the 32 ground truth labels), and yielded a precision of 85.3% and a recall of 90.6%; (iii) given the sequence of ground truth triplets, our parsing system successfully parsed all five sequences in our data set into tree representations comparing with the target parses. Figure 5.8 shows the tree structures built from the sequence of triplets of the “Making a sandwich” sequence. More results for the rest of manipulation ac- tions in the data set can be found at http://www.umiacs.umd.edu/~yzyang/MACFG. Overall, (i) supports our first hypothesis that human manipulation actions obey a manipulation action context-free grammar that includes manipulators, actions, and objects as terminal symbols, while (ii) and (iii) support our second hypothesis that the implementation of our cognitive system can parse observed human manipulation actions. 1 The experimental results support our hypotheses, but we have not tested our system on a large data set with a variety of manipulation actions. We are currently testing the system on a larger set of kitchen actions and checking to see if our hypotheses are still supported. 1As the experiments demonstrate, the system was robust enough to handle situations that involve hesitation, in which the human grasps a tool, finds that it is not the desired one, and ungrasps it (as in Figure 5.9). 128 DUMMY.0 HP.1 AP.3 O.5 Bread A.4 Grasp1 H.2 LeftHand DUMMY.0 HP.6 H.8 RightHand AP.7 O.10 Cheese A.9 Grasp2 HP.1 AP.3 O.5 Bread A.4 Grasp1 H.2 LeftHand DUMMY.0 HP.11 HP.1 AP.3 O.5 Bread A.4 Grasp1 H.2 LeftHand AP.12 HP.6 H.8 RightHand AP.7 O.10 Cheese A.9 Grasp2 A.13 Assemble DUMMY.0 HP.6 H.8 RightHand AP.7 O.10 Cheese A.9 Grasp2 HP.1 AP.3 O.5 Bread A.4 Grasp1 H.2 LeftHand DUMMY.0 H.8 RightHand O.10 Cheese HP.1 AP.3 O.5 Bread A.4 Grasp1 H.2 LeftHand DUMMY.0 H.2 LeftHand O.5 Bread H.8 RightHand O.10 Cheese Figure 5.8: The tree structures generated from the “Make a Sandwich” sequence. Figure 5.7 depicts the corresponding visual processing. Since our system detected six triplets temporally from this sequence, it produced a set of six trees. The order of the six trees is from left to right. 129 DUMMY.0 HP.1 AP.3 O.5 KnifeA A.4 Grasp1 H.2 LeftHand DUMMY.0 H.2 LeftHand O.5 KnifeA DUMMY.0 HP.6 H.8 RightHand AP.7 O.10 KnifeB A.9 Grasp2 H.2 LeftHand O.5 KnifeA DUMMY.0 HP.11 H.2 LeftHand AP.12 O.14 Bread A.13 Grasp3 HP.6 H.8 RightHand AP.7 O.10 KnifeB A.9 Grasp2 O.5 KnifeA DUMMY.0 HP.15 HP.6 H.8 RightHand AP.7 O.10 KnifeB A.9 Grasp2 AP.16 HP.11 H.2 LeftHand AP.12 O.14 Bread A.13 Grasp3 A.17 Cut O.5 KnifeA DUMMY.0 HP.11 H.2 LeftHand AP.12 O.14 Bread A.13 Grasp3 HP.6 H.8 RightHand AP.7 O.10 KnifeB A.9 Grasp2 O.5 KnifeA DUMMY.0 H.8 RightHand O.10 KnifeB HP.11 H.2 LeftHand AP.12 O.14 Bread A.13 Grasp3 O.5 KnifeA DUMMY.0 H.2 LeftHand O.14 Bread H.8 RightHand O.10 KnifeB O.5 KnifeA Figure 5.9: Example of the grammar that deals with hesitation. This figure shows key frames of the input visual data and the semantic tree structures. 130 Chapter 6: The Semantics: Learning Manipulation Action Seman- tics through Probabilistic Combinatory Categorial Gram- mar Parsing 6.1 Introduction Autonomous robots will need to learn the actions that humans perform. They will need to recognize these actions when they see them and they will need to perform these actions themselves. This requires a formal system to represent the action semantics. This representation needs to store the semantic information about the actions, be encoded in a machine readable language, and inherently be in a programmable fashion in order to enable reasoning beyond observation. A formal representation of this kind has a variety of other applications such as intelligent manufacturing, human robot collaboration, action planning and policy design, etc. In this chapter, we are concerned with manipulation actions, that is actions performed by agents (humans or robots) on objects, resulting in some physical change of the object. However most of the current AI systems require manually defined semantic rules. In this work, we propose a computational linguistics frame- work, which is based on probabilistic semantic parsing with Combinatory Categorial 131 Grammar (CCG), to learn manipulation action semantics (lexicon entries) from an- notations. We later show that this learned lexicon is able to make our system reason about manipulation action goals beyond just observation. Thus the intelligent sys- tem can not only imitate human movements, but also imitate action goals. Understanding actions by observation and executing them are generally con- sidered as dual problems for intelligent agents. The sensori-motor bridge connecting the two tasks is essential, and a great amount of attention in AI, Robotics as well as Neurophysiology has been devoted to investigating it. Experiments conducted on primates have discovered that certain neurons, the so-called mirror neurons, fire during both observation and execution of identical manipulation tasks [9, 10]. This suggests that the same process is involved in both the observation and execution of actions. From a functionalist point of view, such a process should be able to first build up a semantic structure from observations, and then the decomposition of that same structure should occur when the intelligent agent executes commands. Additionally, studies in linguistics [154] suggest that the language faculty de- velops in humans as a direct adaptation of a more primitive apparatus for planning goal-directed action in the world by composing affordances of tools and consequences of actions. It is this more primitive apparatus that is our major interest in this chap- ter. Such an apparatus is composed of a “syntax part” and a “semantic part”. In the syntax part, every linguistic element is categorized as either a function or a basic type, and is associated with a syntactic category which either identifies it as a function or a basic type. In the semantic part, a semantic translation is attached following the syntactic category explicitly. 132 Combinatory Categorial Grammar (CCG) introduced by [155] is a theory that can be used to represent such structures with a small set of combinators such as functional application and type-raising. What do we gain though from such a for- mal description of action? This is similar to asking what one gains from a formal description of language as a generative system. Chomskys contribution to language research was exactly this: the formal description of language through the formulation of the Generative and Transformational Grammar [22]. It revolutionized language research opening up new roads for the computational analysis of language, providing researchers with common, generative language structures and syntactic operations, on which language analysis tools were built. A grammar for action would contribute to providing a common framework of the syntax and semantics of action, so that ba- sic tools for action understanding can be built, tools that researchers can use when developing action interpretation systems, without having to start development from scratch. The same tools can be used by robots to execute actions. In this chapter, we propose an approach for learning the semantic meaning of manipulation action through a probabilistic semantic parsing framework based on CCG theory. For example, we want to learn from an annotated training action corpus that the action “Cut” is a function which has two arguments: a subject and a patient. Also, the action consequence of “Cut” is a separation of the patient. Using formal logic representation, our system will learn the semantic representations of “Cut”: Cut :=(AP\NP )/NP : λx.λy.cut(x, y)→ divided(y) 133 Here cut(x, y) is a primitive function. We will further introduce the representation in Sec. 6.3. Since our action representation is in a common calculus form, it enables naturally further logical reasoning beyond visual observation. The advantage of our approach is twofold: 1) Learning semantic representa- tions from annotations helps an intelligent agent to enrich automatically its own knowledge about actions; 2) The formal logic representation of the action could be used to infer the object-wise consequence after a certain manipulation, and can also be used to plan a set of actions to reach a certain action goal. We further validate our approach on a large publicly available manipulation action dataset (MANIAC) from [15], achieving promising experimental results. Moreover, we believe that our work, even though it only considers the domain of manipulation actions, is also a promising example of a more closely intertwined computer vision and computational linguistics system. The diagram in Fig.6.1 depicts the framework of the system. 6.2 Related Works Manipulation Action Grammar: As mentioned before, [148] suggested that a minimalist generative grammar, similar to the one of human language, also exists for action understanding and execution. The works closest related to this chapter are [17,18,31]. [17] first discussed a Chomskyan grammar for understanding complex actions as a theoretical concept, and [18] provided an implementation of such a grammar using as perceptual input only objects. More recently, [30] proposed a set of context-free grammar rules for manipulation action understanding, and [29] 134 Figure 6.1: A CCG based semantic parsing framework for manipulation actions. applied it on unconstrained instructional videos. However, these approaches only consider the syntactic structure of manipulation actions without coupling seman- tic rules using λ expressions, which limits the capability of doing reasoning and prediction. Combinatory Categorial Grammar and Semantic Parsing: CCG based semantic parsing originally was used mainly to translate natural language sentences to their desired semantic representations as λ-calculus formulas [156, 157]. [158] presented a framework of grounded language acquisition: the interpretation of lan- guage entities into semantically informed structures in the context of perception 135 and actuation. The concept has been applied successfully in tasks such as robot navigation [159], forklift operation [160] and of human-robot interaction [161]. In this work, instead of grounding natural language sentences directly, we ground in- formation obtained from visual perception into semantically informed structures, specifically in the domain of manipulation actions. 6.3 A CCG Framework for Manipulation Actions Before we dive into the semantic parsing of manipulation actions, a brief in- troduction to the Combinatory Categorial Grammar framework in Linguistics is necessary. We will only introduce related concepts and formalisms. For a complete background reading, we would like to refer readers to [155]. We will first give a brief introduction to CCG and then introduce a fundamental combinator, i.e., functional application. The introduction is followed by examples to show how the combinator is applied to parse actions. 6.3.1 Manipulation Action Semantics The semantic expression in our representation of manipulation actions uses a typed λ-calculus language. The formal system has two basic types: entities and functions. Entities in manipulation actions are Objects or Hands, and functions are the Actions. Our lambda-calculus expressions are formed from the following items: Constants: Constants can be either entities or functions. For example, Knife is an entity (i.e., it is of type N) and Cucumber is an entity too (i.e., it is of type 136 N). Cut is an action function that maps entities to entities. When the event Knife Cut Cucumber happened, the expression cut(Knife, Cucumber) returns an entity of type AP, aka. Action Phrase. Constants like divided are status functions that map entities to truth values. The expression divided(cucumber) returns a true value after the event (Knife Cut Cucumber) happened. Logical connectors: The λ-calculus expression has logical connectors like conjunction (∧), disjunction (∨), negation(¬) and implication(→). For example, the expression connected(tomato, cucumber)∧ divided(tomato) ∧ divided(cucumber) represents the joint status that the sliced tomato merged with the sliced cucumber. It can be regarded as a simplified goal status for “making a cucumber tomato salad”. The expression ¬connected(spoon, bowl) represents the status after the spoon fin- ished stirring the bowl. λx.cut(x, cucumber)→ divided(cucumber) represents that if the cucumber is cut by x, then the status of the cucumber is divided. λ expressions: lambda expressions represent functions with unknown argu- ments. For example, λx.cut(knife, x) is a function from entities to entities, which is of type NP after any entities of type N that is cut by knife. 137 6.3.2 Combinatory Categorial Grammar The semantic parsing formalism underlying our framework for manipulation actions is that of combinatory categorial grammar (CCG) [155]. A CCG specifies one or more logical forms for each element or combination of elements for manipulation actions. In our formalism, an element of Action is associated with a syntactic “category” which identifies it as functions, and specifies the type and directionality of their arguments and the type of their result. For example, action “Cut” is a function from patient object phrase (NP) on the right into predicates, and into functions from subject object phrase (NP) on the left into a sub action phrase (AP): Cut := (AP\NP )/NP. (6.1) As a matter of fact, the pure categorial grammar is a conext-free grammar presented in the accepting, rather than the producing direction. The expression (6.1) is just an accepting form for Action “Cut” following the context-free grammar. While it is now convenient to write derivations as follows, they are equivalent to conventional tree structure derivations in Figure. 6.3.2. Knife Cut Cucumber N N NP (AP\NP)/NP NP > AP\NP < AP 138 AP AP NP N Cucumber A Cut NP N Knife Figure 6.2: Example of conventional tree structure. The semantic type is encoded in these categories, and their translation can be made explicit in an expanded notation. Basically a λ-calculus expression is attached with the syntactic category. A colon operator is used to separate syntactical and semantic expressions, and the right side of the colon is assumed to have lower precedence than the left side of the colon. Which is intuitive as any explanation of manipulation actions should first obey syntactical rules, then semantic rules. Now the basic element, Action “Cut”, can be further represented by: Cut :=(AP\NP )/NP : λx.λy.cut(x, y)→ divided(y). (AP\NP )/NP denotes a phrase of type AP , which requires an element of type NP to specify what object was cut, and requires another element of type NP to further complement what effector initiates the cut action. λx.λy.cut(x, y) is the λ-calculus representation for this function. Since the functions are closely related to the state update, → divided(y) further points out the status expression after the action was performed. A CCG system has a set of combinatory rules which describe how adjacent syntatic categories in a string can be recursively combined. In the setting of ma- 139 nipulation actions, we want to point out that similar combinatory rules are also applicable. Especially the functional application rules are essential in our system. 6.3.3 Functional application The functional application rules with semantics can be expressed in the fol- lowing form: A/B : f B : g => A : f(g) (6.2) B : g A\B : f => A : f(g) (6.3) Rule. (6.2) says that a string with type A/B can be combined with a right-adjacent string of type B to form a new string of type A. At the same time, it also spec- ifies how the semantics of the category A can be compositionally built out of the semantics for A/B and B. Rule. (6.3) is a symmetric form of Rule. (6.2). In the domain of manipulation actions, following derivation is an example CCG parse. This parse shows how the system can parse an observation (“Knife Cut Cu- cumber”) into a semantic representation (cut(knife, cucumber)→ divided(cucumber)) using the functional application rules. 140 Knife Cut Cucumber N N NP (AP\NP)/NP NP knife λx .λy .cut(x , y) cucumber knife → divided(y) cucumber > AP\NP λx .cut(x , cucumber) → divided(cucumber) < AP cut(knife, cucumber) → divided(cucumber) 6.4 Learning Model and Semantic Parsing After having defined the formalism and application rule, instead of manually writing down all the possible CCG representations for each entity, we would like to apply a learning technique to derive them from the paired training corpus. Here we adopt the learning model of [156], and use it to assign weights to the semantic representation of actions. Since an action may have multiple possible syntactic and semantic representations assigned to it, we use the probabilistic model to assign weights to these representations. 6.4.1 Learning Approach First we assume that complete syntactic parses of the observed action are available, and in fact a manipulation action can have several different parses. The parsing uses a probabilistic combinatorial categorial grammar framework similar to the one given by [157]. We assume a probabilistic categorial grammar (PCCG) 141 based on a log linear model. M denotes a manipulation task, L denotes the seman- tic representation of the task, and T denotes its parse tree. The probability of a particular syntactic and semantic parse is given as: P (L, T |M ; Θ) = ef(L,T,M)·Θ ∑ (L,T ) e f(L,T,M)·Θ (6.4) where f is a mapping of the triple (L, T,M) to feature vectors ∈ Rd, and the Θ ∈ Rd represents the weights to be learned. Here we use only lexical features, where each feature counts the number of times a lexical entry is used in T . Parsing a manipulation task under PCCG equates to finding L such that P (L|M ; Θ) is maximized: argmaxLP (L|M ; Θ) = argmaxL ∑ T P (L, T |M ; Θ). (6.5) We use dynamic programming techniques to calculate the most probable parse for the manipulation task. In this chapter, the implementation from [162] is adopted, where an inverse-λ technique is used to generalize new semantic representations. The generalization of lexicon rules are essential for our system to deal with unknown actions presented during the testing phase. 6.5 Experiments 6.5.1 Manipulation Action (MANIAC) Dataset [15] provides a manipulation action dataset with 8 different manipulation ac- tions (cutting, chopping, stirring, putting, taking, hiding, uncovering, and pushing), 142 each of which consists of 15 different versions performed by 5 different human ac- tors1. There are in total 30 different objects manipulated in all demonstrations. All manipulations were recorded with the Microsoft Kinect sensor and serve as training data here. The MANIAC data set contains another 20 long and complex chained ma- nipulation sequences (e.g. “making a sandwich”) which consist of a total of 103 different versions of these 8 manipulation tasks performed in different orders with novel objects under different circumstances. These serve as testing data for our experiments. [8, 15] developed a semantic event chain based model free decomposition ap- proach. It is an unsupervised probabilistic method that measures the frequency of the changes in the spatial relations embedded in event chains, in order to ex- tract the subject and patient visual segments. It also decomposes the long chained complex testing actions into their primitive action components according to the spatio-temporal relations of the manipulator. Since the visual recognition is not the core of this work, we omit the details here and refer the interested reader to [8,15]. All these features make the MANIAC dataset a great testing bed for both the the- oretical framework and the implemented system presented in this work. 1Dataset available for download at https://fortknox.physik3.gwdg.de/cns/index.php? page=maniac-dataset. 143 6.5.2 Training Corpus We first created a training corpus by annotating the 120 training clips from the MANIAC dataset, in the format of observed triplets (subject action patient) and a corresponding semantic representation of the action as well as its consequence. The semantic representations in λ-calculus format are given by human annotators after watching each action clip. A set of sample training pairs are given in Table.6.1 (one from each action category in the training set). Since every training clip contains one single full execution of each manipulation action considered, the training corpus thus has a total of 120 paired training samples. Snapshot triplet semantic representation cleaver chopping carrot chopping(cleaver, carrot) → divided(carrot) spatula cutting pepper cutting(spatula, pepper) → divided(pepper) spoon stirring bucket stirring(spoon, bucket) cup take down bucket take down(cup, bucket)→ ¬connected(cup, bucket) ∧moved(cup) cup put on top bowl put on top(cup, bowl)→ on top(cup, bowl) ∧moved(cup) bucket hiding ball hiding(bucket, ball)→ contained(bucket, ball) ∧moved(bucket) hand pushing box pushing(hand, box) → moved(box) box uncover apple uncover(box, apple)→ appear(apple) ∧moved(box) Table 6.1: Example annotations from training corpus, one per manipulation action category. 144 We also assume the system knows that every “object” involved in the corpus is an entity of its own type, for example: Knife := N : knife Bowl := N : bowl ...... Additionally, we assume the syntactic form of each “action” has a main type of (AP\NP )/NP (see Sec. 6.3.2). These two sets of rules form the initial seed lexicon for learning. 6.5.3 Learned Lexicon We applied the learning technique mentioned in Sec. 6.4, and we used the NL2KR implementation from [162]. The system learns and generalizes a set of lexicon entries (syntactic and semantic) for each action categories from the training corpus accompanied with a set of weights. We list the one with the largest weight 145 for each action here respectively: Chopping :=(AP\NP )/NP : λx.λy.chopping(x, y) → divided(y) Cutting :=(AP\NP )/NP : λx.λy.cutting(x, y) → divided(y) Stirring :=(AP\NP )/NP : λx.λy.stirring(x, y) Take down :=(AP\NP )/NP : λx.λy.take down(x, y) → ¬connected(x, y) ∧moved(x) Put on top :=(AP\NP )/NP : λx.λy.put on top(x, y) → on top(x, y) ∧moved(x) Hiding :=(AP\NP )/NP : λx.λy.hiding(x, y) → contained(x, y) ∧moved(x) Pushing :=(AP\NP )/NP : λx.λy.pushing(x, y) → moved(y) Uncover :=(AP\NP )/NP : λx.λy.uncover(x, y) → appear(y) ∧moved(x). The set of seed lexicon and the learned lexicon entries are further used to probabilistically parse the detected triplet sequences from the 20 long manipulation activities in the testing set. 146 6.5.4 Deducing Semantics Using the decomposition technique from [8,15], the reported system is able to detect a sequence of action triplets in the form of (Subject Action Patient) from each of the testing sequence in MANIAC dataset. Briefly speaking, the event chain repre- sentation [14] of the observed long manipulation activity is first scanned to estimate the main manipulator, i.e. the hand, and manipulated objects, e.g. knife, in the scene without employing any visual feature-based object recognition method. Solely based on the interactions between the hand and manipulated objects in the scene, the event chain is partitioned into chunks. These chunks are further fragmented into sub-units to detect parallel action streams. Each parsed Semantic Event Chain (SEC) chunk is then compared with the model SECs in the library to decide whether the current SEC sample belongs to one of the known manipulation models or rep- resents a novel manipulation. SEC models, stored in the library, are learned in an on-line unsupervised fashion using the semantics of manipulations derived from a given set of training data in order to create a large vocabulary of single atomic manipulations. For the different testing sequence, the number of triplets detected ranges from two to seven. In total, we are able to collect 90 testing detections and they serve as the testing corpus. However, since many of the objects used in the testing data are not present in the training set, an object model-free approach is adopted and thus “subject” and “patient” fields are filled with segment IDs instead of a specific object name. Fig. 6.3 and 6.4 show several examples of the detected triplets accompanied 147 Figure 6.3: System output on complex chained manipulation testing sequence one. The segmentation output and detected triplets are from [8] . with a set of key frames from the testing sequences. Nevertheless, the method we used here can 1) generalize the unknown segments into the category of object entities and 2) generalize the unknown actions (those that do not exist in the training corpus) into the category of action function. This is done by automatically generalizing the following two types of lexicon entries using the inverse-λ technique from [162]: Object [ID] :=N : object [ID] Unknown :=(AP\NP )/NP : λx.λy.unknown(x, y) Among the 90 detected triplets, using the learned lexicon we are able to parse all of them into semantic representations. Here we pick the representation with the highest probability after parsing as the individual action semantic representa- tion. The “parsed semantics” rows of Fig. 6.3 and 6.4 show several example action semantics on testing sequences. Taking the fourth sub-action from Fig. 6.4 as an example, the visually detected triplets based on segmentation and spatial decom- 148 Figure 6.4: System output on the 18th complex chained manipulation testing se- quence. The segmentation output and detected triplets are from [8] . position is (Object 014, Chopping,Object 011). After semantic parsing, the system predicts that divided(Object 011). The complete training corpus and parsed results of the testing set will be made publicly available for future research. 6.5.5 Reasoning Beyond Observations As mentioned before, because of the use of λ-calculus for representing action semantics, the obtained data can naturally be used to do logical reasoning beyond observations. This by itself is a very interesting research topic and it is beyond this chapter’s scope. However by applying a couple of common sense Axioms on the testing data, we can provide some flavor of this idea. Case study one: See the “final action consequence and reasoning” row of Fig. 6.3 for case one. Using propositional logic and axiom schema, we can represent the common sense statement (“if an object x is contained in object y, and object z 149 is on top of object y, then object z is on top of object x”) as follows: Axiom (1): ∃x, y, z, contained(y, x) ∧ on top(z, y)→ on top(z, x). Then it is trivial to deduce an additional final action consequence in this scenario that (on top(object 007, object 009)). This matches the fact: the yellow box which is put on top of the red bucket is also on top of the black ball. Case study two: See the “final action consequence and reasoning” row of Fig. 6.4 for a more complicated case. Using propositional logic and axiom schema, we can represent three common sense statements: 1) “if an object y is contained in object x, and object z is contained in object y, then object z is contained in object x”; 2) “if an object x is contained in object y, and object y is divided, then object x is divided”; 3) “if an object x is contained in object y, and object y is on top of object z, then object x is on top of object z” as follows: Axiom (2): ∃x, y, z, contained(y, x) ∧ contained(z, y)→ contained(z, x). Axiom (3): ∃x, y, contained(y, x) ∧ divided(y)→ divided(x). Axiom (4): ∃x, y, z, contained(y, x) ∧ on top(y, z)→ on top(x, z). With these common sense Axioms, the system is able to deduce several addi- tional final action consequences in this scenario: divided(object 005) ∧ divided(object 010) ∧ on top(object 005, object 012) ∧ on top(object 010, object 012). 150 From Fig. 6.4, we can see that these additional consequences indeed match the facts: 1) the bread and cheese which are covered by ham are also divided, even though from observation the system only detected the ham being cut; 2) the divided bread and cheese are also on top of the plate, even though from observation the system only detected the ham being put on top of the plate. We applied the four Axioms on the 20 testing action sequences and deduced the “hidden” consequences from observation. To evaluate our system performance quantitatively, we first annotated all the final action consequences (both obvious and “hidden” ones) from the 20 testing sequences as ground-truth facts. In total there are 122 consequences annotated. Using perception only [8], due to the decomposition errors (such as the red font ones in Fig. 6.4) the system can detect 91 consequences correctly, yielding a 74% correct rate. After applying the four Axioms and reasoning, our system is able to detect 105 consequences correctly, yielding a 86% correct rate. Overall, this is a 15.4% of improvement. Here we want to mention a caveat: there are definitely other common sense Axioms that we are not able to address in the current implementation. However, from the case studies presented, we can see that using the presented formal frame- work, our system is able to reason about manipulation action goals instead of just observing what is happening visually. This capability is essential for intelligent agents to imitate action goals from observation. 151 Chapter 7: Procedural Learning: Robot Learning Manipulation Ac- tion Plans by “Watching” Unconstrained Videos from the World Wide Web 7.1 Introduction The ability to learn actions from human demonstrations is one of the major challenges for the development of intelligent systems. Particularly, manipulation actions are very challenging, as there is large variation in the way they can be performed and there are many occlusions. Our ultimate goal is to build a self-learning robot that is able to enrich its knowledge about fine grained manipulation actions by “watching” demo videos. In this work we explicitly model actions that involve different kinds of grasping, and aim at generating a sequence of atomic commands by processing unconstrained videos from the World Wide Web (WWW). The robotics community has been studying perception and control problems of grasping for decades [36]. Recently, several learning based systems were reported that infer contact points or how to grasp an object from its appearance [37, 38]. However, the desired grasping type could be different for the same target object, 152 when used for different action goals. Traditionally, data about the grasp has been acquired using motion capture gloves or hand trackers, such as the model-based tracker of [32]. The acquisition of grasp information from video (without 3D infor- mation) is still considered very difficult because of the large variation in appearance and the occlusions of the hand from objects during manipulation. Our premise is that actions of manipulation are represented at multiple levels of abstraction. At lower levels the symbolic quantities are grounded in perception, and at the high level a grammatical structure represents symbolic information (ob- jects, grasping types, actions). With the recent development of deep neural network approaches, our system integrates a CNN based object recognition and a CNN based grasping type recognition module. The latter recognizes the subject’s grasping type directly from image patches. The grasp type is an essential component in the characterization of manip- ulation actions. Just from the viewpoint of processing videos, the grasp contains information about the action itself, and it can be used for prediction or as a feature for recognition. It also contains information about the beginning and end of action segments, thus it can be used to segment videos in time. If we are to perform the action with a robot, knowledge about how to grasp the object is necessary so the robot can arrange its effectors. For example, consider a humanoid with one parallel gripper and one vacuum gripper. When a power grasp is desired, the robot should select the vacuum gripper for a stable grasp, but when a precision grasp is desired, the parallel gripper is a better choice. Thus, knowing the grasping type provides information for the robot to plan the configuration of its effectors, or even the type 153 of effector to use. In order to perform a manipulation action, the robot also needs to learn what tool to grasp and on what object to perform the action. Our system applies CNN based recognition modules to recognize the objects and tools in the video. Then, given the beliefs of the tool and object (from the output of the recognition), our system predicts the most likely action using language, by mining a large corpus using a technique similar to [25]. Putting everything together, the output from the lower level visual perception system is in the form of (LeftHand GraspType1 Object1 Action RightHand GraspType2 Object2). We will refer to this septet of quantities as visual sentence. At the higher level of representation, we generate a symbolic command se- quence. [30] proposed a context-free grammar and related operations to parse ma- nipulation actions. However, their system only processed RGBD data from a con- trolled lab environment. Furthermore, they did not consider the grasping type in the grammar. This work extends [30] by modeling manipulation actions using a prob- abilistic variant of the context free grammar, and explicitly modeling the grasping type. Using as input the belief distributions from the CNN based visual perception system, a Viterbi probabilistic parser is used to represent actions in form of a hier- archical and recursive tree structure. This structure innately encodes the order of atomic actions in a sequence, and forms the basic unit of our knowledge represen- tation. By reverse parsing it, our system is able to generate a sequence of atomic commands in predicate form, i.e. as Action(Subject, Patient) plus the temporal 154 information necessary to guide the robot. This information can then be used to control the robot effectors [139]. Our contributions are twofold. (1) A convolutional neural network (CNN) based method has been adopted to achieve state-of-the-art performance in grasping type classification and object recognition on unconstrained video data; (2) a system for learning information about human manipulation action has been developed that links lower level visual perception and higher level semantic structures through a probabilistic manipulation action grammar. 7.2 Related Works Most work on learning from demonstrations in robotics has been conducted in fully controlled lab environments [14]. Many of the approaches rely on RGBD sensors [18], motion sensors [107, 133] or specific color markers [19]. The proposed systems are fragile in real world situations. Also, the amount of data used for learning is usually quite small. It is extremely difficult to learn automatically from data available on the internet, for example from unconstrained cooking videos from Youtube. The main reason is that the large variation in the scenery will not allow traditional feature extraction and learning mechanism to work robustly. At the high level, a number of studies on robotic manipulation actions have proposed ways on how instructions are stored and analyzed, often as sequences. Work by [163], among others, investigates how to compare sequences in order to reason about manipulation actions using sequence alignment methods, which bor- 155 row techniques from informatics. This chapter proposes a more detailed representa- tion of manipulation actions, the grammar trees, extending earlier work. Chomsky in [148] suggested that a minimalist generative grammar, similar to the one of hu- man language, also exists for action understanding and execution. The works closest related to this chapter are [17, 18, 30, 31]. [17] first discussed a Chomskyan gram- mar for understanding complex actions as a theoretical concept, [18] provided an implementation of such a grammar using as perceptual input only objects. [30] pro- posed a set of context-free grammar rules for manipulation action understanding. However, their system used data collected in a lab environment. Here we process unconstrained data from the internet. In order to deal with the noisy visual data, we extend the manipulation action grammar and adapt the parsing algorithm. The recent development of deep neural networks based approaches revolution- ized visual recognition research. The work presented in this chapter shows that with the recent developments of deep neural networks in computer vision, it is possible to learn manipulation actions from unconstrained demonstrations using CNN based visual perception. 7.3 Our Approach We developed a system to learn manipulation actions from unconstrained videos. The system takes advantage of: (1) the robustness from CNN based vi- sual processing; (2) the generality of an action grammar based parser. Figure7.1 shows our integrated approach. 156 Figure 7.1: The integrated system reported in this work. 7.3.1 CNN based visual recognition The system consists of two visual recognition modules, one for classification of grasping types and the other for recognition of objects. In both modules we used convolutional neural networks as classifiers. First, we briefly summarize the basic concepts of Convolutional Neural Networks, and then we present our implementa- tions. 7.3.1.1 Convolutional Neural Network (CNN) is a multilayer learning framework, which may consist of an input layer, a few convolutional layers and an output layer. The goal of CNN is to learn a hi- erarchy of feature representations. Response maps in each layer are convolved with a number of filters and further down-sampled by pooling operations. These pooling operations aggregate values in a smaller region by downsampling functions includ- ing max, min, and average sampling. The learning in CNN is based on Stochastic 157 Gradient Descent (SGD), which includes two main operations: Forward and Back- Propagation. Please refer to [49] for details. We used a seven layer CNN (including the input layer and two perception layers for regression output). The first convolution layer has 32 filters of size 5× 5, the second convolution layer has 32 filters of size 5 × 5, and the third convolution layer has 64 filters of size 5 × 5, respectively. The first perception layer has 64 regression outputs and the final perception layer has 6 regression outputs. Our system considers 6 grasping type classes. 7.3.1.2 Grasping Type Recognition A number of grasping taxonomies have been proposed in several areas of re- search, including robotics, developmental medicine, and biomechanics, each focusing on different aspects of action. In a recent survey [47] reported 45 grasp types in the literature, of which only 33 were found valid. In this work, we use a categorization into six grasping types. First we distinguish, according to the most commonly used classification (based on functionality) into power and precision grasps [48]. Power grasping is used when the object needs to be held firmly in order to apply force, such as “grasping a knife to cut”; precision grasping is used in order to do fine grain actions that require accuracy, such as “pinch a needle”. We then further distinguish among the power grasps, whether they are spherical, or otherwise (usually cylin- drical), and we distinguish the latter according to the grasping diameter, into large diameter and small diameter ones. Similarly, we distinguish the precision grasps 158 into large and small diameter ones. Additionally, we also consider a Rest position (no grasping performed). Table 7.1 illustrates our grasp categories. We denote the list of these six grasps as G in the remainder of the chapter. Grasping Types Small Diameter Large Diameter Spherical & Rest Power Precision Table 7.1: The list of the grasping types. The input to the grasping type recognition module is a gray-scale image patch around the target hand performing the grasping. We resize each patch to 32 × 32 pixels, and subtract the global mean obtained from the training data. For each testing video with M frames, we pass the target hand patches (left hand and right hand, if present) frame by frame, and we obtain an output of size 6×M . We sum it up along the temporal dimension and then normalize the output. We use the classification for both hands to obtain (GraspType1) for the left hand, and (GraspType2) for the right hand. For the video of M frames the grasping type recognition system outputs two belief distributions of size 6 × 1: PGraspType1 and PGraspType2. 159 7.3.1.3 Object Recognition and Corpus Guided Action Prediction The input to the object recognition module is an RGB image patch around the target object. We resize each patch to 32 × 32 × 3 pixels, and we subtract the global mean obtained from the training data. Similar to the grasping type recognition module, we also used a seven layer CNN. The network structure is the same as before, except that the final perception layer has 48 regression outputs. Our system considers 48 object classes, and we denote this candidate object list as O in the rest of the chapter. Table 7.2 lists the object classes. apple, blender, bowl, bread, brocolli, brush, butter, carrot, chicken, chocolate, corn, creamcheese, croutons, cucumber, cup, doughnut, egg, fish, flour, fork, hen, jelly, knife, lemon, lettuce, meat, milk, mustard, oil, onion, pan, peanutbutter, pepper, pitcher, plate, pot, salmon, salt, spatula, spoon, spreader, steak, sugar, tomato, tongs, turkey, whisk, yogurt. Table 7.2: The list of the objects considered in our system. For each testing video with M frames, we pass the target object patches frame by frame, and get an output of size 48 × M . We sum it up along the temporal dimension and then normalize the output. We classify two objects in the image: (Object1) and (Object2). At the end of classification, the object recognition system outputs two belief distributions of size 48× 1: PObject1 and PObject2. We also need the ‘Action’ that was performed. Due to the large variations 160 in the video, the visual recognition of actions is difficult. Our system bypasses this problem by using a trained language model. The model predicts the most likely verb (Action) associated with the objects (Object1, Object2). In order to do prediction, we need a set of candidate actions V . Here, we consider the top 10 most common actions in cooking scenarios. They are (Cut, Pour, Transfer, Spread, Grip, Stir, Sprinkle, Chop, Peel, Mix). The same technique, used here, was used before on a larger set of candidate actions [25]. We compute from the Gigaword corpus [76] the probability of a verb occurring, given the detected nouns, P (Action|Object1, Object2). We do this by computing the log-likelihood ratio [102] of trigrams (Object1, Action, Object2), computed from the sentence in the English Gigaword corpus [76]. This is done by extracting only the words in the corpus that are defined in O and V (including their synonyms). This way we obtain a reduced corpus sequence from which we obtain our target trigrams. The log-likelihood ratios computed for all possible trigrams are then normalized to obtain P (Action|Object1, Object2). For each testing video, we can compute a belief distribution over the candidate action set V of size 10× 1 as : PAction = ∑ Object1∈O ∑ Object2∈O P (Action|Object1, Object2) × PObject1 × PObject2. (7.1) 7.3.2 From Recognitions to Action Trees The output of our visual system are belief distributions of the object categories, grasping types, and actions. However, they are not sufficient for executing actions. The robot also needs to understand the hierarchical and recursive structure of the 161 action. We argue that grammar trees, similar to those used in linguistics analysis, are a good representation capturing the structure of actions. Therefore we integrate our visual system with a manipulation action grammar based parsing module [30]. Since the output of our visual system is probabilistic, we extend the grammar to a probabilistic one and apply the Viterbi probabilistic parser to select the parse tree with the highest likelihood among the possible candidates. 7.3.2.1 Manipulation Action Grammar We made two extensions from the original manipulation grammar [30]: (i) Since grasping is conceptually different from other actions, and our system employs a CNN based recognition module to extract the model grasping type, we assign an additional nonterminal symbol G to represent the grasp. (ii) To accommodate the probabilistic output from the processing of unconstrained videos, we extend the manipulation action grammar into a probabilistic one. The design of this grammar is motivated by three observations: (i) Hands are the main driving force in manipulation actions, so a specialized nonterminal symbol H is used for their representation; (ii) an action (A) or a grasping (G) can be applied to an object (O) directly or to a hand phrase (HP ), which in turn contains an object (O), as encoded in Rule (1), which builds up an action phrase (AP ); (iii) an action phrase (AP ) can be combined either with the hand (H) or a hand phrase, as encoded in rule (2), which recursively builds up the hand phrase. The rules discussed in Table 7.3 form the syntactic rules of the grammar. 162 To make the grammar probabilistic, we first treat each sub-rule in rules (1) and (2) equally, and assign equal probability to each sub-rule. With regard to the hand H in rule (3), we only consider a robot with two effectors (arms), and assign equal probability to ‘LeftHand’ and ‘RightHand’. For the terminal rules (4-8), we assign the normalized belief distributions (PObject1, PObject2, PGraspType1, PGraspType2,PAction) obtained from the visual processes to each candidate object, grasping type and action. AP → G1 O1 | G2 O2 | A O2 | A HP 0.25 (1) HP → H AP | HP AP 0.5 (2) H → ‘LeftHand′ | ‘RightHand′ 0.5 (3) G1 → ‘GraspType1′ PGraspType1 (4) G2 → ‘GraspType2′ PGraspType2 (5) O1 → ‘Object1′ PObject1 (6) O2 → ‘Object2′ PObject2 (7) A → ‘Action′ PAction (8) Table 7.3: A Probabilistic Extension of Manipulation Action Context-Free Gram- mar. 7.3.2.2 Parsing and tree generation We use a bottom-up variation of the probabilistic context-free grammar parser that uses dynamic programming (best-known as Viterbi parser [164]) to find the most likely parse for an input visual sentence. The Viterbi parser parses the visual sentence by filling in the most likely constituent table, and the parser uses the grammar introduced in Table 7.3. For each testing video, our system outputs the most likely parse tree of the specific manipulation action. By reversely parsing the 163 tree structure, the robot could derive an action plan for execution. Figure 7.3 shows sample output trees, and Table 7.4 shows the final control commands generated by reverse parsing. 7.4 Experiments The theoretical framework we have presented suggests two hypotheses that deserve empirical tests: (a) the CNN based object recognition module and the grasping type recognition module can robustly recognize input frame patches from unconstrained videos into correct class labels; (b) the integrated system using the Viterbi parser with the probabilistic extension of the manipulation action grammar can generate a sequence of execution commands robustly. To test the two hypotheses empirically, we need to define a set of performance variables and how they relate to our predicted results. The first hypothesis relates to visual recognition, and we can empirically test it by measuring the precision and recall metrics by comparing the detected object and grasping type labels with the ground truth ones. The second hypothesis relates to execution command generation, and we can also empirically test it by comparing the generated command predicates with the ground truth ones on testing videos. To validate our system, we conducted experiments on an extended version of a publicly available unconstrained cooking video dataset (YouCook) [53]. 164 7.4.1 Dataset and experimental settings Cooking is an activity, requiring a variety of manipulation actions, that future service robots most likely need to learn. We conducted our experiments on a publicly available cooking video dataset collected from the WWW and fully labeled, called the Youtube cooking dataset (YouCook) [53]. The data was prepared from 88 open- source Youtube cooking videos with unconstrained third-person view. Frame-by- frame object annotations are provided for 49 out of the 88 videos. These features make it a good empirical testing bed for our hypotheses. We conducted our experiments using the following protocols: (1) 12 video clips, which contain one typical kitchen action each, are reserved for testing; (2) all other video frames are used for training; (3) we randomly reserve 10% of the training data as validation set for training the CNNs. For training the grasping type, we extended the dataset by annotating image patches containing hands in the training videos. The image patches were converted to gray-scale and then resized to 32×32 pixels. The training set contains 1525 image patches and was labeled with the six grasping types. We used a GPU based CNN implementation [54] to train the neural network, following the structures described. For training the object recognition CNN, we first extracted annotated image patches from the labeled training videos, and then resized them to 32 × 32 × 3. We used the same GPU based CNN implementation to train the neural network, following the structures described above. For localizing hands on the testing data, we first applied the hand detector 165 from [2] and picked the top two hand patch proposals (left hand and right hand, if present). For objects, we trained general object detectors from labeled train- ing data using techniques from [165]. Furthermore we associated candidate object patches with the left or right hand, respectively depending on which had the smaller Euclidean distance. 7.4.2 Grasping Type and Object Recognition On the reserved 10% validation data, the grasping type recognition module achieved an average precision of 77% and an average recall of 76%. On the reserved 10% validation data, the object recognition module achieved an average precision of 93%, and an average recall of 93%. Figure 7.2 shows the confusion matrices for grasping type and object recognition, respectively. From the figure we can see the robustness of the recognition. Figure 7.2: Confusion matrices. Left: grasping type; Right: object. The performance of the object and grasping type recognition modules is also reflected in the commands that our system generated from the testing videos. We 166 observed an overall recognition accuracy of 79% on objects, of 91% on grasping types and of 83% on predicted actions (see Table 7.4). It is worth mentioning that in the generated commands the performance in the recognition of object drops, because some of the objects in the testing sequences do not have training data, such as “Tofu”. The performance in the classification of grasping type goes up, because we sum up the grasping types belief distributions over the frames, which helps to smooth out wrong labels. The performance metrics reported here empirically support our hypothesis (a). 7.4.3 Visual Sentence Parsing and Commands Generation for Robots Following the probabilistic action grammar from Table 7.3, we built upon the implementation of the Viterbi parser from the Natural Language Processing Kit [166] to generate the single most likely parse tree from the probabilistic visual sentence input. Figure 7.3 shows the sample visual processing outputs and final parse trees obtained using our integrated system. Table 7.4 lists the commands generated by our system on the reserved 12 testing videos, shown together with the ground truth commands (LH:LeftHand; RH: RightHand; PoS: Power-Small; PoL: Power-Large; PoP: Power-Spherical; PrS: Precision-Small; PrL: Precision-Large). The overall percentage of correct commands is 68%. Note, that we considered a command predicate wrong, if any of the object, grasping type or action was recognized incorrectly. The performance metrics reported here, empirically support our hypothesis (b). 167 HP AP HP AP O1 Corn A1 Precision-Small-Grasp H RightHand A3 Spread HP AP O1 Brush A1 Power- Small-Grasp H LeftHand HP AP O2 Steak A3 Grip HP AP O1 Tongs A1 Power-Small-Grasp H LeftHand HP AP HP AP O1 Lemon A1 Precision-Small-Grasp H RightHand A3 Cut HP AP O1 Knife A1 Power-Small-Grasp H LeftHand HP AP HP AP O1 Bowl A1 Precision-Small-Grasp H RightHand A3 Cut HP AP O1 Knife A1 Power-Small-Grasp H LeftHand Figure 7.3: Upper row: input unconstrained video frames; Lower left: color coded (see lengend at the bottom) visual recognition output frame by frame along timeline; Lower right: the most likely parse tree generated for each clip. 168 Snapshot Ground Truth Commands Learned Commands Grasp PoS(LH, Knife) Grasp PrS(RH, Tofu) Action Cut(Knife, Tofu) Grasp PoS(LH, Knife) Grasp PrS(RH, Bowl) Action Cut(Knife, Bowl) Grasp PoS(LH, Blender) Grasp PrL(RH, Bowl) Action Blend(Blender, Bowl) Grasp PoS(LH, Bowl) Grasp PoL(RH, Bowl) Action Pour(Bowl, Bowl) Grasp PoS(LH, Tongs) Action Grip(Tongs, Chicken) Grasp PoS(LH, Chicken) Action Cut(Chicken, Chicken) Grasp PoS(LH, Brush) Grasp PrS(RH, Corn) Action Spread(Brush, Corn) Grasp PoS(LH, Brush) Grasp PrS(RH, Corn) Action Spread(Brush, Corn) Grasp PoS(LH, Tongs) Action Grip(Tongs, Steak) Grasp PoS(LH, Tongs) Action Grip(Tongs, Steak) Grasp PoS(LH, Spreader) Grasp PrL(RH, Bread) Action Spread(Spreader, Bread) Grasp PoS(LH, Spreader) Grasp PrL(RH, Bowl) Action Spread(Spreader, Bowl) Grasp PoL(LH, Mustard) Grasp PrS(RH, Bread) Action Spread(Mustard, Bread) Grasp PoL(LH, Mustard) Grasp PrS(RH, Bread) Action Spread(Mustard, Bread) Grasp PoS(LH, Spatula) Grasp PrS(RH, Bowl) Action Stir(Spatula, Bowl) Grasp PoS(LH, Spatula) Grasp PrS(RH, Bowl) Action Stir(Spatula, Bowl) Grasp PoL(LH, Pepper) Grasp PoL(RH, Pepper) Action Sprinkle(Pepper, Bowl) Grasp PoL(LH, Pepper) Grasp PoL(RH, Pepper) Action Sprinkle(Pepper, Pepper) Grasp PoS(LH, Knife) Grasp PrS(RH, Lemon) Action Cut(Knife, Lemon) Grasp PoS(LH, Knife) Grasp PrS(RH, Lemon) Action Cut(Knife, Lemon) Grasp PoS(LH, Knife) Grasp PrS(RH, Broccoli) Action Cut(Knife, Broccoli) Grasp PoS(LH, Knife) Grasp PoL(RH, Broccoli) Action Cut(Knife, Broccoli) Grasp PoS(LH, Whisk) Grasp PrL(RH, Bowl) Action Stir(Whisk, Bowl) Grasp PoS(LH, Whisk) Grasp PrL(RH, Bowl) Action Stir(Whisk, Bowl) Overall Recognition Accuracy Object: 79% Grasping type: 91% Action: 83% Overall percentage of correct commands: 68% Table 7.4: Incorrect entities learned are marked in red. 169 7.4.4 Discussion The performance metrics reported in the experiment section empirically sup- port our hypotheses that: (1) our system is able to robustly extract visual sentences with high accuracy; (2) our system can learn atomic action commands with few errors compared to the ground-truth commands. We believe this preliminary inte- grated system raises hope towards a fully intelligent robot for manipulation tasks that can automatically enrich its own knowledge resource by “watching” recordings from the World Wide Web. 170 Chapter 8: Concluding Remarks and Future Work 8.1 Concluding Remarks In Chapter. 2, the experiments produced three results: (i) we achieved in av- erage 59% accuracy using the CNN based method for grasp type recognition from unconstrained image patches; (ii) we achieved in average 65% prediction accuracy in inferring human intention using the grasp type only; (iii) using the grasp type temporal evolution, we achieved 78% recall and 80% precision in fine grain ma- nipulation action segmentation tasks. Overall, the empirical results support our hypotheses (a-c) respectively. Recognizing grasp type and its use in inference for human action intention and fine level segmentation of human manipulation actions, are novel problems in computer vision. We have proposed a CNN based learning framework to address these problems with decent success. We hope our contribu- tions can help advance the field of static scene understanding and human action fine level analysis, and we hope that they can be useful to other researchers in other applications. Additionally, we augmented a currently available hand data set and a cooking data set with grasp type labels, and provided human action intention labels for a subset of them, for future research. Chapter. 3 has shown a principled approach of integrating large scale lan- 171 guage corpora for the purpose of action recognition in videos involving hand-tools. We validated our approach in both supervised and unsupervised scenarios and out- performed the current state-of-the-art STIP+BoW features significantly. These re- sults demonstrate the strength of using language which encodes the intrinsic rela- tionships between tools and actions, leveraging it to aid in the action recognition task. In this chapter, we also have introduced a computationally feasible framework that integrates visual perception together with semantic grounding obtained from a large textual corpus for the purpose of generating a descriptive sentence of an image. Experimental results show that our approach produces sentences that are both relevant and readable. A system for detecting action consequences and classifying videos of manipu- lation action according to action consequences has been proposed in Chapter. 4. A dataset has been provided, which includes both data that we collected and eligible manipulation action video sequences from other publicly available datasets. Experi- mental results were performed that validate our method, and at the same time point out several weaknesses for future improvement. In Chapter. 5, we presented a cognitive system for understanding human ma- nipulation actions. The system integrates vision modules that ground semanti- cally meaningful events in perceptual input with a reasoning module based on a context-free grammar and associated parsing algorithms, which dynamically build the sequence of structural representations. Experimental results showed that the cognitive system can extract the key events from the raw input and can interpret the observations by generating a sequence of tree structures. 172 In Chapter.6 we presented a formal computational framework for modeling manipulation actions based on a Combinatory Categorial Grammar. An empirical study on a large manipulation action dataset validates that 1) with the introduced formalism, a learning system can be devised to deduce the semantic meaning of manipulation actions in λ-schema; 2) with the learned schema and several common sense Axioms, our system is able to reason beyond just observation and deduce “hidden” action consequences, yielding a decent performance improvement. In Chapter. 7 we presented an approach to learn manipulation action plans from unconstrained videos for cognitive robots. Two convolutional neural network based recognition modules (for grasping type and objects respectively), as well as a language model for action prediction, compose the lower level of the approach. The probabilistic manipulation action grammar based Viterbi parsing module is at the higher level, and its goal is to generate atomic commands in predicate form. We conducted experiments on a cooking dataset which consists of unconstrained demon- stration videos. From the performance on this challenging dataset, we can conclude that our system is able to recognize and generate action commands robustly. 8.2 Future Work Our experiments in Chapter. 2 indicate that there is still significant space for improving the recognition of grasp type and inference of human intention. We believe that advances in understanding high-level cognitive structure underlying human intention can help improve the performance. With the development of deep 173 learning systems and more data, we can also expect a robust grasp type recognition system beyond the seven categories used. Moreover, we believe that progress in natural language processing, such as mining the relationship between grasp type and actions, can advance high-level reasoning about human action intention to improve computer vision methods. Our approach presented in Chapter. 3 goes beyond action recognition and can be extended to other vision problems faced in robotics with a more careful treatment of language [167]. Progress in the areas of object recognition, image segmentation and general scene understanding have been slow as these problems require semantic grounding. Language, when exploited properly, provides for this. For e.g. using shallow-parsing or Named-Entity Recognition to improve PL predictions and subse- quently Fd; or performing a dependency parse to reduce the need to use synonyms to extract the tool and related verb from a sentence more accurately. An important limitation of our current approach is that we need to know in advance the action labels and tools of the video. We are currently working on approaches to discover, using attributes of the potential tool and action features obtained from the video, a prediction of the tool and action labels directly from language. The potential world-knowledge embedded in language, along with its complexities, was clearly demonstrated with Watson in Jeopardy! which set a milestone in AI. We believe it will do the same for vision and robotics in the near future. There are instances where our strategy presented in Chapter. 3 fails to pre- dict the appropriate verbs or nouns (see Fig. 3.18). This is due to the fact that object/scene detections can be wrong and noise from the corpus itself remains a 174 problem. Compared to human gold standards, therefore, much work still remains in terms of detecting these objects and scenes with high precision. Currently, at most two object classes are used to generate simple sentences which was shown in the results to have penalized the relevance score of our approach. This can be ad- dressed by designing more complex HMMs to handle larger numbers of object and verb classes. Another interesting direction of future work would be to detect salient objects, learned from training image+corpus or eye-movement data, and to verify if these objects aid in improving the descriptive sentences we generate. Another potential application of representing images using T ∗ is that we can easily sort and retrieve images that are similar in terms of their semantic content. This would en- able us to retrieve, for example, more relevant images given a verbal search query such as {ride,sit,fly}, returning images where these verbs are found in T ∗. Some results of retrieved images based on their verbal components are shown in Fig. 8.1: many images with dissimilar visual content are correctly classified based on their semantic meaning. In Chapter. 4, to avoid the influence from the manipulating hands, especially occlusions caused by hands, a hand detection and segmentation algorithm can be applied. Then we can design a hallucination process to complete the contour of the occluded object under manipulation. Preliminary results are shown in Fig. 8.2. However, resolving the ambiguity between occlusion and deformation from visual analysis is a difficult task that requires further attention. In Chapter. 5, since the grammar does not assume constraints such as the number of operators, it can be further adapted to process scenarios with multiple 175 Figure 8.1: Images retrieved from 3 verbal search terms: ride,sit,fly. Figure 8.2: A hallucination process of contour completion (paint stone sequence in MAC 1.0). Left: original segments; Middle: contour hallucination with second order polynomials fitting (green lines); Right: final hallucinated contour. 176 agents doing complicated manipulation actions once the perception tools have been developed. Moreover, we also plan to investigate operations that enable the system to reason during observation. After the system observes a significant number of manipulation actions, it can build a database of all sequences of trees [168]. By querying this database, we expect the system to predict things such as which object will be manipulated next or which action will follow. Also, the action trees could be learned not only from observation but also from language resources, such as dictionaries, recipes, manuals etc. This link to computational linguistics constitutes an interesting avenue for future research. In Chapter. 6, due to the limitation of current testing scenarios, we conducted experiments only considering a relatively small set of seed lexicon rules and logical expressions. Nevertheless, we want to mention that the presented CCG framework can also be extended to learn the formal logic representation of more complex manip- ulation action semantics. For example, the temporal order of manipulation actions can be modeled by considering a seed rule such as AP\AP : λf.λg.before(f(·), g(·)), where before(·, ·) is a temporal predicate. For actions we consider seed main type (AP\NP )/NP . For more general manipulation scenarios, based on whether the action is transitive or intransitive, the main types of action can be extended to include AP\NP . Moreover, the logical expressions can also be extended to in- clude universal quantification ∀ and existential quantification ∃. Thus, manipula- tion action such as “knife cut every tomato” can be parsed into a representation as ∀x.tomato(x) ∧ cut(knife, x) → divided(x) (the parse is given in the following chart). Here, the concept “every” has a main type of NP\NP and semantic mean- 177 ing of ∀x.f(x). The same framework can also extended to have other combinatory rules such as composition and type-raising [154]. Knife Cut every Tomato N N NP (AP\NP)/NP NP\NP NP knife λx .λy .cut(x , y) ∀x .f (x ) tomato knife → divided(y) ∀x .f (x ) tomato > NP ∀x .tomato(x ) > AP\NP ∀y .λx .tomato(y) ∧ cut(x , y)→ divided(y) < AP ∀y .tomato(y) ∧ cut(knife, y)→ divided(y) 8.3 Final Remarks The presented framework enables an intelligent agent to predict and reason ac- tion goals from observation, and thus has many potential applications such as human intention prediction, robot action policy planning and human robot collaboration. We believe that our formalism of manipulation actions bridges computational lin- guistics, vision and robotics, and opens further research in Artificial Intelligence and Robotics. As the robotics industry is moving towards robots that function safely, effectively and autonomously to perform tasks in real-world unstructured en- vironments, they will need to be able to understand the meaning of manipulation actions and acquire human-like common-sense reasoning capabilities (please refer to [169, 170] for pilot studies of scene understanding with common-sense reasoning 178 and knowledge). Some parts of this thesis have been integrated into a pilot robotic system running on a Baxter research humanoid to visually learn manipulation actions (such as making a drink) from observing human doing it1. These direct applications of the techniques presented in this thesis could potentially 1) reduce the costly reprogramming time to teach industrial or domestic robots new tasks, 2) increase the level of flexibility and adaptivity for current robotic systems, and 3) enrich robots’ procedural knowledge through a continuous learning process. 1A live recording of the system could be found at http://www.umiacs.umd.edu/~yzyang/ WaitWhatDemo 179 Bibliography [1] John McCarthy, Marvin L Minsky, Nathaniel Rochester, and Claude E Shan- non. A proposal for the dartmouth summer research project on artificial intelligence, august 31, 1955. AI Magazine, 27(4):12, 2006. [2] Arpit Mittal, Andrew Zisserman, and Philip HS Torr. Hand detection using multiple proposals. In BMVC, pages 1–11. Citeseer, 2011. [3] Dorin Comaniciu, Visvanathan Ramesh, and Peter Meer. Real-time tracking of non-rigid objects using mean shift. In Computer Vision and Pattern Recog- nition, 2000. Proceedings. IEEE Conference on, volume 2, pages 142–149. IEEE, 2000. [4] Pedro F. Felzenszwalb, Ross B. Girshick, David A. McAllester, and Deva Ramanan. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell., 32(9):1627–1645, 2010. [5] Antonio Torralba, Kevin P. Murphy, William T. Freeman, and Mark A. Rubin. Context-based vision system for place and object recognition. In ICCV, pages 273–280. IEEE Computer Society, 2003. [6] Chenglong Bao, Yi Wu, Haibin Ling, and Hui Ji. Real time robust l1 tracker using accelerated proximal gradient approach. In Computer Vision and Pat- tern Recognition (CVPR), 2012 IEEE Conference on, pages 1830–1837. IEEE, 2012. [7] Ibrahim Adalbert Kapandji and LH Honore´. The physiology of the joints. Churchill Livingstone, Boca Raton, FL, 2007. [8] E E. Aksoy and F. Wo¨rgo¨tter. Semantic decomposition and recognition of long and complex manipulation action sequences. Computer Vision and Image Understading (CVIU), page Under Review, 2015. [9] Giacomo Rizzolatti, Leonardo Fogassi, and Vittorio Gallese. Neurophysiologi- cal mechanisms underlying the understanding and imitation of action. Nature Reviews Neuroscience, 2(9):661–670, 2001. 180 [10] V Gazzola, G Rizzolatti, B Wicker, and C Keysers. The anthropomorphic brain: the mirror neuron system responds to human and robotic actions. Neu- roimage, 35(4):1674–1684, 2007. [11] Hedvig Kjellstro¨m, Javier Romero, David Mart´ınez, and Danica Kragic´. Si- multaneous visual recognition of manipulation actions and manipulated ob- jects. In Computer Vision–ECCV 2008, pages 336–349. Springer Berlin Hei- delberg, 2008. [12] Marcus Rohrbach, Anna Rohrbach, Michaela Regneri, Sikandar Amin, Mykhaylo Andriluka, Manfred Pinkal, and Bernt Schiele. Recognizing fine- grained and composite activities using hand-centric features and script data. International Journal of Computer Vision, pages 1–28, 2015. [13] Bingbing Ni, Vignesh R Paramathayalan, and Philippe Moulin. Multiple gran- ularity analysis for fine-grained action detection. In Computer Vision and Pat- tern Recognition (CVPR), 2014 IEEE Conference on, pages 756–763. IEEE, 2014. [14] E.E. Aksoy, A. Abramov, J. Do¨rr, K. Ning, B. Dellen, and F. Wo¨rgo¨tter. Learning the semantics of object–action relations by observation. The Inter- national Journal of Robotics Research, 30(10):1229–1249, 2011. [15] E E. Aksoy, M. Tamosiunaite, and F. Wo¨rgo¨tter. Model-free incremental learning of the semantics of manipulation actions. Robotics and Autonomous Systems, pages 1–42, 2014. [16] Karinne Ramirez-Amaro, Michael Beetz, and Gordon Cheng. Transferring skills to humanoid robots by extracting semantic representations from obser- vations of human activities. Artificial Intelligence, pages –, 2015. [17] K. Pastra and Y. Aloimonos. The minimalist grammar of action. Philosophical Transactions of the Royal Society B: Biological Sciences, 367(1585):103–117, 2012. [18] D. Summers-Stay, C.L. Teo, Y. Yang, C. Fermu¨ller, and Y. Aloimonos. Using a minimal action grammar for activity understanding in the real world. In Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 4104–4111, Vilamoura, Portugal, 2013. IEEE. [19] Kyuhwa Lee, Yanyu Su, Tae-Kyun Kim, and Yiannis Demiris. A syntactic approach to robot imitation learning using probabilistic activity grammars. Robotics and Autonomous Systems, 61(12):1323–1334, 2013. [20] N. Siddharth, A. Barbu, and J. M. Siskind. Seeing unseeability to see the unseeable. Advances in Cognitive Systems (ACS), 2:77–94, December 2012. [21] Yiannis Aloimonos. Active perception. Psychology Press, 2013. 181 [22] N. Chomsky. Syntactic Structures. Mouton de Gruyter, 1957. [23] Yezhou Yang, Cornelia Fermuller, Yi Li, and Yiannis Aloimonos. Grasp type revisited: A modern perspective on a classical feature for vision. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. IEEE, 2015. [24] C.L. Teo, Y. Yang, H. Daume, C. Fermu¨uller, and Y. Aloimonos. Towards a watson that sees: Language-guided action recognition for robots. In Proceed- ings of the 2012 IEEE International Conference on Robotics and Automation, pages 374–381, Saint Paul, MN, 2012. IEEE. [25] Yezhou Yang, Ching Lik Teo, Hal Daume´ III, and Yiannis Aloimonos. Corpus- guided sentence generation of natural images. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 444–454. Asso- ciation for Computational Linguistics, 2011. [26] Yezhou Yang, Cornelia Fermu¨ller, and Yiannis Aloimonos. Detection of ma- nipulation action consequences (MAC). In Proceedings of the 2013 IEEE Con- ference on Computer Vision and Pattern Recognition, pages 2563–2570, Port- land, OR, 2013. IEEE. [27] Yezhou Yang, Cornelia Fermu¨ller, Yiannis Aloimonos, and Anupam Guha. A cognitive system for understanding human manipulation actions. Advances in Cognitive Systems, 3:67–86, 2014. [28] Yezhou Yang, Cornelia Fermuller, Yiannis Aloimonos, and Eren Erdal Aksoy. Learning the semantics of manipulation action. In The 53rd Annual Meeting of the Association for Computational Linguistics (ACL), 2015. [29] Yezhou Yang, Yi Li, Cornelia Fermuller, and Yiannis Aloimonos. Robot learn- ing manipulation action plans by “watching” unconstrained videos from the world wide web. In The Twenty-Ninth AAAI Conference on Artificial Intelli- gence (AAAI-15), 2015. [30] Y. Yang, C. Fermuller, and Y. Aloimonos. A cognitive system for human manipulation action understanding. In the Second Annual Conference on Ad- vances in Cognitive Systems (ACS), 2013. [31] Anupam Guha, Yezhou Yang, Cornelia Fermu¨ller, and Yiannis Aloimonos. Minimalist plans for interpreting manipulation actions. Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5908–5914, 2013. [32] I. Oikonomidis, N. Kyriazis, and A. Argyros. Efficient model-based 3D track- ing of hand articulations using Kinect. In Proceedings of the 2011 British Machine Vision Conference, pages 1–11, Dundee, UK, 2011. BMVA. 182 [33] Raoul Tubiana, Jean-Michel Thomine, and Evelyn Mackin. Examination of the Hand and the Wrist. CRC Press, 1998. [34] Zhenyao Mo and Ulrich Neumann. Real-time hand pose recognition using low-resolution depth images. In CVPR (2), pages 1499–1505, 2006. [35] Xenophon Zabulis, Haris Baltzakis, and Antonis Argyros. Vision-based hand gesture recognition for human-computer interaction. The Universal Access Handbook. LEA, 2009. [36] Karun B. Shimoga. Robot grasp synthesis algorithms: A survey. The Inter- national Journal of Robotics Research, 15(3):230–266, 1996. [37] Ashutosh Saxena, Justin Driemeyer, and Andrew Y Ng. Robotic grasping of novel objects using vision. The International Journal of Robotics Research, 27(2):157–173, 2008. [38] Ian Lenz, Honglak Lee, and Ashutosh Saxena. Deep learning for detecting robotic grasps. International Journal of Robotics Research, page to appear, 2014. [39] Dan Xie, Sinisa Todorovic, and Song-Chun Zhu. Inferring “dark matter” and “dark energy” from videos. In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 2224–2231. IEEE, 2013. [40] Jungseock Joo, Weixin Li, Francis F Steen, and Song-Chun Zhu. Visual persuasion: Inferring communicative intents of images. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 216–223. IEEE, 2014. [41] Hamed Pirsiavash, Carl Vondrick, and Antonio Torralba. Inferring the why in images. arXiv preprint arXiv:1406.5472, 2014. [42] David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91–110, 2004. [43] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE, 2005. [44] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A re- view and new perspectives. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(8):1798–1828, 2013. [45] Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. Imagenet classification with deep convolutional neural networks. In NIPS 2012, 2013. [46] Dan Claudiu Ciresan, Ueli Meier, and Ju¨rgen Schmidhuber. Multi-column deep neural networks for image classification. In CVPR 2012, 2012. 183 [47] T. Feix, J. Romero, C. H. Ek, H Schmiedmayer, and D. Kragic. A Metric for Comparing the Anthropomorphic Motion Capability of Artificial Hands. Robotics, IEEE Transactions on, 29(1):82–93, February 2013. [48] Marc Jeannerod. The timing of natural prehension movements. Journal of motor behavior, 16(3):235–254, 1984. [49] Yann LeCun and Yoshua Bengio. The handbook of brain theory and neural networks. chapter Convolutional networks for images, speech, and time series, pages 255–258. MIT Press, Cambridge, MA, USA, 1998. [50] Felix Warneken and Michael Tomasello. Altruistic helping in human infants and young chimpanzees. science, 311(5765):1301–1303, 2006. [51] Dan Song, Nikolaos Kyriazis, Iason Oikonomidis, Chavdar Papazov, Antonis Argyros, Darius Burschka, and Danica Kragic. Predicting human intention in visual observations of hand/object interactions. In Robotics and Automation (ICRA), 2013 IEEE International Conference on, pages 1608–1615. IEEE, 2013. [52] John R Napier. The prehensile movements of the human hand. Journal of bone and joint surgery, 38(4):902–913, 1956. [53] P. Das, C. Xu, R. F. Doell, and J. J. Corso. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2013. [54] Yangqing Jia. Caffe: An open source convolutional architecture for fast feature embedding. http://caffe.berkeleyvision.org/, 2013. [55] Brenna D. Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robot. Auton. Syst., 57:469–483, May 2009. [56] W.R. Schwartz, A. Kembhavi, D. Harwood, and L.S. Davis. Human detection using partial least squares analysis. In International Conference on Computer Vision, 2009. [57] Ivan Laptev and Tony Lindeberg. Space-time interest points. In ICCV, pages 432–439. IEEE Computer Society, 2003. [58] Ross Messing, Chris Pal, and Henry Kautz. Activity recognition using the velocity histories of tracked keypoints. In ICCV ’09: Proceedings of the Twelfth IEEE International Conference on Computer Vision, Washington, DC, USA, 2009. IEEE Computer Society. 184 [59] Pavan K. Turaga, Rama Chellappa, V. S. Subrahmanian, and Octavian Udrea. Machine recognition of human activities: A survey. IEEE Trans. Circuits Syst. Video Techn., 18(11):1473–1488, 2008. [60] Daniel Weinland, Remi Ronfard, and Edmond Boyer. A survey of vision-based methods for action representation, segmentation and recognition. Computer Vision and Image Understanding, 115(2):224–241, 2010. [61] Ana Paula Brando Lopes, Eduardo Alves do Valle, Jussara Marques de Almeida, and Arnaldo Albuquerque de Arajo. Action recognition in videos: from motion capture labs to the web. (arXiv:1006.3506), Jun 2010. Preprint submitted to CVIU. [62] Benjamin Sapp, Alexander Toshev, and Ben Taskar. Cascaded models for articulated pose estimation. In ECCV, 2010. [63] P. Dolla´r, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal features. In VS-PETS, October 2005. [64] A. Efros, A. Berg, G. Mori, and J. Malik. Recognizing action at a distance. In ICCV, pages 726–733, 2003. [65] L. Gorelick, M. Blank, E. Shechtman, R. Basri, and M. Irani. Actions as space-time shapes. PAMI, 29(12):2247–2253, 2007. [66] Alper Yilmaz and Mubarak Shah. Actions sketch: A novel action representa- tion. In CVPR, pages 984–989, 2005. [67] Alessandro Bissacco, Alessandro Chiuso, and Stefano Soatto. Classification and recognition of dynamical models: The role of phase, independent com- ponents, kernels and optimal transport. IEEE Trans. Pattern Anal. Mach. Intell., 29:1958–1972, November 2007. [68] Rizwan Chaudhry, Avinash Ravichandran, Gregory Hager, and Rene Vidal. Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dy- namical systems for the recognition of human actions. In CVPR, 2009. [69] Pinar Duygulu, Kobus Barnard, Joo F. G. de Freitas, and David A. Forsyth. Object recognition as machine translation: Learning a lexicon for a fixed im- age vocabulary. In Anders Heyden, Gunnar Sparr, Mads Nielsen, and Peter Johansen, editors, ECCV (4), volume 2353 of Lecture Notes in Computer Sci- ence, pages 97–112. Springer, 2002. [70] Abhinav Gupta and Larry S. Davis. Beyond nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers. In David A. Forsyth, Philip H. S. Torr, and Andrew Zisserman, editors, ECCV (1), volume 5302 of Lecture Notes in Computer Science, pages 16–29. Springer, 2008. 185 [71] Tamara L. Berg, Alexander C. Berg, Jaety Edwards, and David A. Forsyth. Who’s in the picture? In NIPS, 2004. [72] L. Jie, B. Caputo, and V. Ferrari. Who’s doing what: Joint modeling of names and verbs for simultaneous face and pose annotation. In NIPS, editor, Advances in Neural Information Processing Systems, NIPS. NIPS, December 2009. [73] T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar. Movie/script: Alignment and parsing of video and text transcription. In ECCV. 2008. [74] Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008. [75] A. Gupta, A. Kembhavi, and Larry S. Davis. Observing human-object in- teractions: Using spatial and functional compatibility for recognition. IEEE Trans on PAMI, 31(10):1775–1789, 2009. [76] D. Graff. English gigaword. In Linguistic Data Consortium, Philadelphia, PA, 2003. [77] P. Viola and M. Jones. Robust real-time face detection. International Journal of Computer Vision, 57(2):137–154, 2004. [78] Thomas Brox, Christoph Bregler, and Jitendra Malik. Large displacement optical flow. In CVPR, pages 41–48. IEEE, 2009. [79] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. ”grabcut”: inter- active foreground extraction using iterated graph cuts. ACM Trans. Graph., 23(3):309–314, 2004. [80] Christian Schuldt, Ivan Laptev, and Barbara Caputo. Recognizing human actions: A local svm approach. In ICPR, 2004. [81] Leonid Sigal, Alexandru O. Balan, and Michael J. Black. Humaneva: Synchro- nized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision, 87(1- 2):4–27, 2010. [82] Abhinav Gupta and Larry S. Davis. Objects in action: An approach for com- bining action understanding and object perception. In CVPR. IEEE Computer Society, 2007. [83] F. De la Torre, J. Hodgins, J. Montano, S. Valcarcel, R. Forcada, and J. Macey. Guide to the carnegie mellon university multimodal activity (cmu-mmac) database. Technical report, CMU-RI-TR-08-22, Robotics Institute, Carnegie Mellon University, July 2009. 186 [84] Mark Hall, Eibe Frank, Geoffrey Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data mining software: An update. SIGKDD Explo- rations, Volume 11, Issue 1, 2009. [85] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2008 (VOC2008) Results, 2008. [86] Cosimo Urgesi, Valentina Moro, Matteo Candidi, and Salvatore M Aglioti. Mapping implied body actions in the human motor system. J Neurosci, 26(30):7942–9, 2006. [87] Zoe Kourtzi. But still, it moves. Trends in Cognitive Sciences, 8(2):47 – 49, 2004. [88] Bangpeng Yao and Li Fei-Fei. Grouplet: a structured image representation for recognizing human and object interactions. In The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, CA, June 2010. [89] Weilong Yang, Yang Wang, and Greg Mori. Recognizing human actions from still images with latent poses. In CVPR, 2010. [90] Aude Oliva and Antonio Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3):145–175, 2001. [91] P. Liang, M. I. Jordan, and D. Klein. Learning from measurements in expo- nential families. In International Conference on Machine Learning (ICML), 2009. [92] Gideon S. Mann and Andrew Mccallum. Simple, robust, scalable semi- supervised learning via expectation regularization. In The 24th International Conference on Machine Learning, 2007. [93] A. Kojima, M. Izumi, T. Tamura, and K. Fukunaga. Generating natural language description of human behavior from video images. In Pattern Recog- nition, 2000. Proceedings. 15th International Conference on, volume 4, pages 728 –731 vol.4, 2000. [94] Ali Farhadi, Seyyed Mohammad Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David A. Forsyth. Every picture tells a story: Generating sentences from images. In Kostas Daniilidis, Petros Maragos, and Nikos Paragios, editors, ECCV (4), volume 6314 of Lecture Notes in Computer Science, pages 15–29. Springer, 2010. [95] B.Z. Yao, Xiong Yang, Liang Lin, Mun Wai Lee, and Song-Chun Zhu. I2t: Image parsing to text description. Proceedings of the IEEE, 98(8):1485 –1508, aug. 2010. 187 [96] David Traum, Michael Fleischman, and Eduard Hovy. Nl generation for virtual humans in a complex social environment. In In Proceedings of he AAAI Spring Symposium on Natural Language Generation in Spoken and Written Dialogue, pages 151–158, 2003. [97] Kathy McKeown. Query-focused summarization using text-to-text generation: When information comes from multilingual sources. In Proceedings of the 2009 Workshop on Language Generation and Summarisation (UCNLG+Sum 2009), page 3, Suntec, Singapore, August 2009. Association for Computational Linguistics. [98] Dave Golland, Percy Liang, and Dan Klein. A game-theoretic approach to generating spatial descriptions. In Proceedings of EMNLP, 2010. [99] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. Dis- criminatively trained deformable part models, release 4. http://people.cs.uchicago.edu/ pff/latent-release4/, 2008. [100] Hsuan-Tien Lin, Chih-Jen Lin, and Ruby C. Weng. A note on platt’s proba- bilistic outputs for support vector machines. Mach. Learn., 68:267–276, Oc- tober 2007. [101] Jinho D. Choi and Martha Palmer. Robust constituent-to-dependency conver- sion for english. In Proceedings of the 9th International Workshop on Treebanks and Linguistic Theories, pages 55–66, Tartu, Estonia, 2010. [102] T. Dunning. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61–74, 1993. [103] Chin-Yew Lin and Eduard Hovy. Automatic evaluation of summaries using n-gram co-occurrence statistics. In NAACLHLT, 2003. [104] David Zajic Bonnie and Bonnie Dorr. Bbn/umd at duc-2004: Topiary. In In Proceedings of the 2004 Document Understanding Conference (DUC 2004) at NLT/NAACL 2004, pages 112–119, 2004. [105] A.D. Jepson, D.J. Fleet, and T.F. El-Maraghi. Robust online appearance mod- els for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(10):1296–1311, 2003. [106] V. Papadourakis and A. Argyros. Multiple objects tracking in the presence of long-term occlusions. Computer Vision and Image Understanding, 114(7):835– 846, 2010. [107] Gutemberg Guerra-Filho, Cornelia fermu¨ller, and Yiannis Aloimonos. Discov- ering a language for human activity. In Proceedings of the AAAI 2005 Fall Symposium on Anticipatory Cognitive Embodied Systems, Washington,DC, 2005. AAAI. 188 [108] T.B. Moeslund, A. Hilton, and V. Kru¨ger. A survey of advances in vision-based human motion capture and analysis. Computer vision and image understand- ing, 104(2):90–126, 2006. [109] P. Turaga, R. Chellappa, V.S. Subrahmanian, and O. Udrea. Machine recog- nition of human activities: A survey. IEEE Transactions on Circuits and Systems for Video Technology, 18(11):1473–1488, 2008. [110] I. Laptev. On space-time interest points. International Journal of Computer Vision, 64(2):107–123, 2005. [111] G. Willems, T. Tuytelaars, and L. Van Gool. An efficient dense and scale- invariant spatio-temporal interest point detector. Proceedings of the 2008 IEEE European Conference on Computer Vision, pages 650–663, 2008. [112] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri. Actions as space-time shapes. In Proceedings of the 2005 IEEE International Conference on Computer Vision, volume 2, pages 1395–1402, 2005. [113] A. Yilmaz and M. Shah. Actions sketch: A novel action representation. In Proceedings of the 2005 IEEE Intenational Conference on Computer Vision and Pattern Recognition, volume 1, pages 984–989, San Diego, CA, 2005. IEEE. [114] A. Kale, A. Sundaresan, AN Rajagopalan, N.P. Cuntoor, A.K. Roy- Chowdhury, V. Kruger, and R. Chellappa. Identification of humans using gait. IEEE Transactions on Image Processing, 13(9):1163–1173, 2004. [115] P. Saisan, G. Doretto, Y.N. Wu, and S. Soatto. Dynamic texture recognition. In Proceedings of the 2001 IEEE Intenational Conference on Computer Vision and Pattern Recognition, volume 2, pages 58–63, Kauai, HI, 2001. IEEE. [116] R. Chaudhry, A. Ravichandran, G. Hager, and R. Vidal. Histograms of ori- ented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In Proceedings of the 2009 IEEE In- tenational Conference on Computer Vision and Pattern Recognition, pages 1932–1939, Miami,FL, 2009. IEEE. [117] I.S. Vicente, V. Kyrki, D. Kragic, and M. Larsson. Action recognition and understanding through motor primitives. Advanced Robotics, 21(15):1687– 1707, 2007. [118] M. Sridhar, A.G. Cohn, and D.C. Hogg. Learning functional object-categories from a relational spatio-temporal representation. In Proceedings of the 2008 IEEE European Conference on Artificial Intelligence, pages 606–610, 2008. 189 [119] B. Yao and L. Fei-Fei. Modeling mutual context of object and human pose in human-object interaction activities. In Proceedings of the 2010 IEEE Intena- tional Conference on Computer Vision and Pattern Recognition, pages 17–24, 2010. [120] E.A. Locke and G.P. Latham. A theory of goal setting & task performance. Prentice-Hall, Inc, 1990. [121] A. Mishra, C. Fermu¨ller, and Y. Aloimonos. Active segmentation for robots. In Proceedings of the 20009 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 3133–3139, St. Louis,MO, 2009. IEEE. [122] B. Han, Y. Zhu, D. Comaniciu, and L.S. Davis. Visual tracking by contin- uous density propagation in sequential bayesian filtering framework. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5):919–930, 2009. [123] J.K. Tsotsos. Analyzing vision at the complexity level. Behavioral and Brain Sciences, 13(3):423–469, 1990. [124] Y. Yang, M. Song, N. Li, J. Bu, and C. Chen. Visual attention analysis by pseudo gravitational field. In Proceedings of the 2009 ACM International Conference on Multimedia, pages 553–556. ACM, 2009. [125] Y. Boykov and V. Kolmogorov. An experimental comparison of min-cut/max- flow algorithms for energy minimization in vision. Pattern Analysis and Ma- chine Intelligence, IEEE Transactions on, 26(9):1124–1137, 2004. [126] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the 2001 IEEE International Conference on Computer Vision, volume 2, pages 416–423, 2001. [127] K. Nummiaro, E. Koller-Meier, L. Van Gool, et al. A color-based particle filter. In First International Workshop on Generative-Model-Based Vision, volume 2002, page 01, 2002. [128] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High accuracy optical flow estimation based on a theory for warping. Proceedings of the 2004 IEEE European Conference on Computer Vision, pages 25–36, 2004. [129] C. Liu et al. Beyond pixels: exploring new representations and applications for motion analysis. PhD thesis, MIT, 2009. [130] J. Neumann and etc. Localizing objects and actions in videos with the help of accompanying text. Final Report, JHU summer workshop, 2010. 190 [131] K. Lai, L. Bo, X. Ren, and D. Fox. A large-scale hierarchical multi-view rgb-d object dataset. In Proceedings of the 2011 IEEE International conference on Robotics and Automation, pages 1817–1824. IEEE, 2011. [132] S. Belongie, J. Malik, and J. Puzicha. Shape context: A new descriptor for shape matching and object recognition. Advances in neural information processing systems, pages 831–837, 2001. [133] Yi Li, Cornelia Fermu¨ller, Yiannis Aloimonos, and Hui Ji. Learning shift- invariant sparse representation of actions. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 2630–2637, San Francisco,CA, 2010. IEEE. [134] Dariu M Gavrila. The visual analysis of human movement: A survey. Com- puter vision and image understanding, 73(1):82–98, 1999. [135] Piotr Dolla´r, Vincent Rabaud, Garrison Cottrell, and Serge Belongie. Be- havior recognition via sparse spatio-temporal features. In 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pages 65–72, San Diego,CA, 2005. IEEE. [136] Liang Wang and David Suter. Learning and matching of dynamic shape man- ifolds for human action recognition. IEEE Transactions on Image Processing, 16(6):1646–1661, 2007. [137] Jezekiel Ben-Arie, Zhiqian Wang, Purvin Pandit, and Shyamsundar Rajaram. Human activity recognition using multidimensional indexing. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(8):1091–1104, 2002. [138] Changbo Hu, Qingfeng Yu, Yi Li, and Songde Ma. Extraction of parametric human model for posture recognition using genetic algorithm. In Proceedings of the 2000 IEEE International Conference on Automatic Face and Gesture Recognition, pages 518–523, Grenoble,France, 2000. IEEE. [139] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and Autonomous Sys- tems, 57(5):469–483, 2009. [140] Georgios E Fainekos, Hadas Kress-Gazit, and George J Pappas. Hybrid con- trollers for path planning: A temporal logic approach. In 44th IEEE Confer- ence on Decision and Control, pages 4885–4890, Philadelphia,PA, 2005. IEEE. [141] N. Dantam and M. Stilman. The motion grammar: Analysis of a linguistic method for robot control. Transactions on Robotics, 29(3):704–718, 2013. [142] F Wo¨rgo¨tter, Eren Erdal Aksoy, N Kruger, Justus Piater, Ales Ude, and Minija Tamosiunaite. A simple ontology of manipulation actions based on hand-object relations. IEEE Transactions on Autonomous Mental Develop- ment, pages 117–134, 2012. 191 [143] Javad Mohamad Aein, Eren Erdal Aksoy, Minija Tamosiunaite, Jeremie Pa- pon, Ales Ude, and Florentin Wo¨rgo¨tter. Toward a library of manipulation actions based on semantic object-action relations. In Proceedings of the 2013 EEE/RSJ International Conference on Intelligent Robots and Systems, pages 4555–4562, Tokyo,Japan, 2013. IEEE. [144] Matthew Brand. Understanding manipulation in video. In Proceedings of the Second International Conference on Automatic Face and Gesture Recognition, pages 94–99, Killington,VT, 1996. IEEE. [145] Michael S Ryoo and Jake K Aggarwal. Recognition of composite human ac- tivities through context-free grammar based representation. In Proceedings of the 2006 IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 1709–1718, New York City, NY, 2006. IEEE. [146] Yuri A. Ivanov and Aaron F. Bobick. Recognition of visual activities and interactions by stochastic parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):852–872, 2000. [147] Darnell Moore and Irfan Essa. Recognizing multitasked activities from video using stochastic context-free grammar. In Proceedings of the National Confer- ence on Artificial Intelligence, pages 770–776, Menlo Park, CA, 2002. AAAI. [148] Noam Chomsky. Lectures on government and binding: The Pisa lectures. Walter de Gruyter, 1993. [149] Roger C Schank and Larry Tesler. A conceptual dependency parser for natural language. In Proceedings of the 1969 conference on Computational linguistics, pages 1–3, Sanga-Saby, Sweden, 1969. Association for Computational Linguis- tics. [150] Vikram Manikonda, Perinkulam S Krishnaprasad, and James Hendler. Lan- guages, behaviors, hybrid architectures, and motion control. Springer, New York, 1999. [151] R Jackendorff. X-bar-syntax: A study of phrase structure. Linguistic Inquiry Monograph, 2, 1977. [152] Daniel H Younger. Recognition and parsing of context-free languages in time n3. Information and control, 10(2):189–208, 1967. [153] M. Nishigaki, C. fermu¨ller, and D. DeMenthon. The image torque operator: A new tool for mid-level vision. In Proceedings of the 2012 IEEE International Conference on Computer Vision and Pattern Recogntion, pages 502–509, Prov- idence,RI, 2012. IEEE. [154] Mark Steedman. Plans, affordances, and combinatory grammar. Linguistics and Philosophy, 25(5-6):723–753, 2002. 192 [155] Mark Steedman. The syntactic process, volume 35. MIT Press, 2000. [156] Luke S Zettlemoyer and Michael Collins. Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars. In UAI, 2005. [157] Luke S Zettlemoyer and Michael Collins. Online learning of relaxed ccg gram- mars for parsing to logical form. In EMNLP-CoNLL, pages 678–687, 2007. [158] Raymond J Mooney. Learning to connect language and perception. In AAAI, pages 1598–1601, 2008. [159] Cynthia Matuszek, Nicholas FitzGerald, Luke Zettlemoyer, Liefeng Bo, and Dieter Fox. A joint model of language and perception for grounded attribute learning. In International Conference on Machine learning (ICML), 2011. [160] Stefanie Tellex, Pratiksha Thaker, Joshua Joseph, and Nicholas Roy. Learning perceptually grounded word meanings from unaligned parallel data. Machine Learning, 94(2):151–167, 2014. [161] Cynthia Matuszek, Liefeng Bo, Luke Zettlemoyer, and Dieter Fox. Learning from unscripted deictic gesture and language for human-robot interactions. In Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014. [162] Chitta Baral, Juraj Dzifcak, Marcos Alvarez Gonzalez, and Jiayu Zhou. Using inverse λ and generalization to translate english to formal languages. In Pro- ceedings of the Ninth International Conference on Computational Semantics, pages 35–44. Association for Computational Linguistics, 2011. [163] Moritz Tenorth, Johannes Ziegltrum, and Michael Beetz. Automated align- ment of specifications of everyday manipulation tasks. In IROS. IEEE, 2013. [164] Kenneth Ward Church. A stochastic parts program and noun phrase parser for unrestricted text. In Proceedings of the second conference on Applied natural language processing, pages 136–143. Association for Computational Linguis- tics, 1988. [165] Ming-Ming Cheng, Ziming Zhang, Wen-Yan Lin, and Philip H. S. Torr. BING: Binarized normed gradients for objectness estimation at 300fps. In IEEE CVPR, 2014. [166] Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with Python. ” O’Reilly Media, Inc.”, 2009. [167] Yezhou Yang, Ching L Teo, Cornelia Fermuller, and Yiannis Aloimonos. Robots with language: Multi-label visual recognition using nlp. In Robotics and Automation (ICRA), 2013 IEEE International Conference on, pages 4256–4262. IEEE, 2013. 193 [168] Yezhou Yang, Anupam Guha, Cornelia Fermu¨ller, and Yiannis Aloimonos. Manipulation action tree bank: A knowledge resource for humanoids. In IEEE/RAS International Conference on Humanoid Robots, 2014. [169] Somak Aditya, Yezhou Yang, Chitta Baral, Cornelia Fermuller, and Yiannis Aloimonos. Visual common-sense for scene understanding using perception, semantic parsing and reasoning. In The Twelfth International Symposium on Logical Formalization on Commonsense Reasoning, 2015. [170] Somak Aditya, Yezhou Yang, Chitta Baral, Cornelia Fermuller, and Yiannis Aloimonos. From images to sentences through scene description graphs us- ing commonsense reasoning and knowledge. arXiv preprint arXiv:1511.03292, 2015. 194