ABSTRACT Title of dissertation: COGNITIVE ROBOTS FOR SOCIAL INTERACTION Yi Li, PhD Student, Electrical and Computer Engineering Dissertation directed by: Professor Yiannis Aloimonos, Computer Science One of my goals is to work towards developing Cognitive Robots, especially with re- gard to improving the functionalities that facilitate the interaction with human beings and their surrounding objects. Any cognitive system designated for serving human beings must be capable of processing the social signals and eventually enable efficient prediction and planning of appropriate responses. My main focus during my PhD study is to bridge the gap between the motoric space and the visual space. The discovery of the mirror neurons ([RC04]) shows that the visual perception of human motion (visual space) is directly associated to the motor control of the human body (motor space). This discovery poses a large number of challenges in dif- ferent fields such as computer vision, robotics and neuroscience. One of the fundamental challenges is the understanding of the mapping between 2D visual space and 3D motoric control, and further developing building blocks (primitives) of human motion in the visual space as well as in the motor space. First, I present my study on the visual-motoric mapping of human actions. This study aims at mapping human actions in 2D videos to 3D skeletal representation. Second, I present an automatic algorithm to decompose motion capture (MoCap) sequences into synergies along with the times at which they are executed (or ?activated?) for each joint. Third, I proposed to use the Granger Causality as a tool to study the coordinated actions performed by at least two units. Recent scientific studies suggest that the above ?action mirroring circuit? might be tuned to action coordination rather than single action mirror- ing. Fourth, I present the extraction of key poses in visual space. These key poses facilitate the further study of the ?action mirroring circuit?. I conclude the dissertation by describ- ing the future of cognitive robotics study. Cognitive Robots for Social Interaction by Yi Li Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2010 Advisory Committee Professor Yiannis Aloimonos, Chair Professor John Baras Professor Min Wu Doctor Cornelia Fermuller Professor Gang Qu Professor David Jacobs, Dean?s representative c Copyright by Yi Li 2010 Acknowledgements This dissertation would not have been possible without the advices from Professor Yiannis Aloimonos and Doctor Cornelia Fermuller. They helped me in all possible ways they can, and they are definitely among the greatest advisors I have ever seen. I am also indebted to many of my labmates who supported me tremendously. They include Kostas Bitsakos, Justin Domke, Ji Hui, Ching Lik Teo, Yezhou Yang, Xiaodong Yu (in alphabetic order), and many others. I also owe my deepest gratitude to my wife, Haizan Zeng, for her endless support and always being on my side. ii Contents List of Figures vii List of Tables xiii 1 Introduction 1 1.1 Cognitive Functions for Human Interaction . . . . . . . . . . . . . . . . . . . 1 1.2 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Related work 5 2.1 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Brief History of Recording and Measuring Human Motion . . . . . . . . . . 5 2.3 Human Pose Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3.1 Pose Analysis in Videos . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3.1.1 Pose Estimation in Videos . . . . . . . . . . . . . . . . . . . 6 2.3.1.2 Key Poses in Visual Space . . . . . . . . . . . . . . . . . . . 7 2.3.2 Pose Analysis in MoCap data . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.2.1 Human Motion Primitives . . . . . . . . . . . . . . . . . . . 9 2.3.2.2 Temporal Segmentation of Human Motion . . . . . . . . . . 9 2.3.2.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Action Recognition in Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4.1 Pose Based Action Analysis . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4.2 Trajectory based Action Recognition . . . . . . . . . . . . . . . . . . . 11 2.5 Coordinated Action and its Causality Network . . . . . . . . . . . . . . . . . 11 2.5.1 Autoregressive Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5.2 Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3 Learning Visual-Motoric Mappings of Actions 15 3.1 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.2 Motion based Visual-Motoric Maps . . . . . . . . . . . . . . . . . . . 17 3.2.3 Our Approach in a Nutshell . . . . . . . . . . . . . . . . . . . . . . . . 18 iii 3.2.3.1 The UMD-Sushi Dataset: A Pilot Dataset . . . . . . . . . . . 18 3.2.3.2 Yet Another Cooking Dataset (YACD) . . . . . . . . . . . . 19 3.2.3.3 The Difference between Two Dataset . . . . . . . . . . . . . 20 3.2.4 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 The Visual-Motoric Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3.1 Accurate Localization of the Body Parts . . . . . . . . . . . . . . . . . 21 3.3.1.1 Hand Detector . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.1.2 Brief Introduction to Conditional Random Fields (CRF) . . 22 3.3.1.3 Unary potential functions . . . . . . . . . . . . . . . . . . . . 23 3.3.1.4 Pairwise potential functions . . . . . . . . . . . . . . . . . . 24 3.3.1.5 Segmenting Hands and Flow . . . . . . . . . . . . . . . . . . 24 3.3.1.6 Elbow Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.2 Learning the Mapping between 2D video and 3D MoCap Trajectories 27 3.3.3 Partial Least Squares in a Nutshell . . . . . . . . . . . . . . . . . . . . 29 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4.1 Toy Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4.2 Synthetic Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.4.3 3D Reconstruction on Real Data: A Pilot Study . . . . . . . . . . . . . 33 3.4.4 3D Reconstruction on the YACD Dataset . . . . . . . . . . . . . . . . 34 3.4.4.1 Computing Body Parts in Videos . . . . . . . . . . . . . . . 34 3.4.4.2 Learning the Mapping from Videos to MoCap . . . . . . . . 34 3.4.5 3D Action Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.4.5.1 Partition the YACD by Subjects . . . . . . . . . . . . . . . . 35 3.4.5.2 Partition the YACD by Actions . . . . . . . . . . . . . . . . . 36 3.5 Chapter Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4 Learning Shift-Invariant Sparse Representation of Actions 45 4.1 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3 Shift-Invariant Sparse Modeling of Actions . . . . . . . . . . . . . . . . . . . 48 4.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3.2 Formulating the Problem in Frequency Domain . . . . . . . . . . . . 49 iv 4.3.3 Solving the Problem using L1 Minimization . . . . . . . . . . . . . . 51 4.3.3.1 OMP for Solving the Activations . . . . . . . . . . . . . . . 51 4.3.3.2 Split Bregman Algorithm for Solving the Bases . . . . . . . 52 4.4 Preprocessing: Normalization for Handing Actions with Various Speeds . . 53 4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.5.1 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.5.2 Learning the Basis Functions . . . . . . . . . . . . . . . . . . . . . . . 55 4.5.3 Motion Approximation and Compression . . . . . . . . . . . . . . . . 56 4.5.4 Action Retrieval and Classification . . . . . . . . . . . . . . . . . . . . 57 4.5.4.1 Segment-based Action Retrieval . . . . . . . . . . . . . . . . 58 4.5.4.2 Segment-based Action Classification . . . . . . . . . . . . . 58 4.5.5 Motion Disorder Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . 59 4.6 Chapter Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5 An Application of Granger Causality to Coordinated Actions Analysis in Orches- tra 65 5.1 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.3 A Brief Introduction to Granger Causality . . . . . . . . . . . . . . . . . . . . 68 5.3.1 Preliminary: Forecasting Time Series using Autoregressive Processes 68 5.3.2 Granger Causality Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.3.3 Statistical Analysis of Granger Causality . . . . . . . . . . . . . . . . 71 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.4.1 Data Acquisition and Preprocessing . . . . . . . . . . . . . . . . . . . 72 5.4.1.1 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.4.1.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . 73 5.4.1.3 Creating Non-Causal Data for Comparison . . . . . . . . . 74 5.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.4.2.1 Experiment 1: The Causality between the Conductors and the Players . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.4.2.2 Experiment 2: The Causality within the Players . . . . . . . 75 5.4.2.3 Comparison using the Non-Causal data . . . . . . . . . . . 77 v 5.5 Chapter Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6 Detecting Discontinuity in Human Movement for Action Representation and Recognition 79 6.1 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.3 Action Discontinuities in Human Movement . . . . . . . . . . . . . . . . . . 81 6.3.1 Action Discontinuities in Motion Capture Data . . . . . . . . . . . . . 81 6.3.2 Action Discontinuities in Videos . . . . . . . . . . . . . . . . . . . . . 84 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.4.1 Effective Detection of Action Discontinuities in Video . . . . . . . . . 87 6.4.1.1 Results on the CMU Dataset . . . . . . . . . . . . . . . . . . 87 6.4.1.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.4.2 Robust Detection of Action Discontinuities in Videos . . . . . . . . . 90 6.4.2.1 Robustness to Changes in Viewpoint . . . . . . . . . . . . . 90 6.4.2.2 Robustness to Changes in Background . . . . . . . . . . . . 92 6.4.3 Action Discontinuities Facilitate Visual Action Recognition . . . . . . 93 6.5 Chapter Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 7 Conclusion 95 7.1 What is Next? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7.1.1 Robots Need Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 7.1.2 Blind Patients Need Vision . . . . . . . . . . . . . . . . . . . . . . . . 96 A Publications 97 B Brief Resume 99 B.1 Education . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 B.2 Awards and Honors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 B.3 Research Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 B.4 Participated Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Bibliography 101 vi List of Figures 3.1 We aim to study the mapping fi( ) from body joint history Xi (i = 1::4) in videos (red curve) to its 3D position (green: lower arm, blue: upper arm, purple: torso, orange: head). . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Two cooking actions in our UMD-Sushi dataset, Scooping (a) and Turning (b), performed by three different subjects in visual space. . . . . . . . . . . . 19 3.3 Examples of our dataset. The actions were captured using jointly the MOVEN motion suit [MOV], which was worn under natural clothes by the subjects S1 to S4, and a HD camera. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4 Our hand detector was initialized by a pose estimator (a). Then we used the trained color model and the optical flow (b) to localize the hands in the subsequent frames (c). Magnitude of the flow is color coded, and hand regions were coded in yellow. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.5 Training the GMM color and flow model. ?+ves? and ?-ves? denote the manually labeled positive and negative regions for the color (top) and flow (bottom) images. Figure courtesy of Ching Lik Teo. . . . . . . . . . . . . . . 25 3.6 Segmenting hands and flow from a test image sequence. Labels with num- bers correspond to the description. (1) Compute the optical flow. (2) Com- pute the unary and pairwise potentials. (3) Perform inference to get the best labels to obtain the hand and flow segmentation in (4). Figure courtesy of Ching Lik Teo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.7 The results of the continuous pose localization. Four actions performed by four subjects were shown. Please refer to Fig. 3.1 and 3.4 for the coding of the colors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.8 Illustrating the Partial Least Squares algorithm. Instead of modeling the linear relationship between the responses (output) and the observations (in- put) directly, it attempts to model the linear relationship between their prin- cipal components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 vii 3.9 A toy example. We generated the (x; y) coordinates based on the letters ?CVPR? (top, blue curves), and used Eq. 3.14 to generate the z coordinates (bottom, blue curves). The reconstructions in red overlap significantly with the blue curves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.10 Synthetic 3D pose reconstruction of ?Right Wrist? (RW) in ?Peeling? and ?Pouring? on the YACD dataset. Blue curves: the original 3D curves in WCS. Red curves: the reconstructed results with the 3D rotation angle (45 , 45 , 45 ). The start pose (blue) and the end pose (pink) of the right arm in an instance are also shown on top. The dashed arrow denotes the motion. . 37 3.11 Robustness test on synthetic data. The 3D trajectories of two joints {RW, LW} were projected to 2D planes using orthographic projection. (a) The reconstruction accuracy for viewpoints (Azimuth and Elevation). (b) For each pair of Azimuth and Elevation, we changed the testing angle between 10 to 10 . The error is the average from the number of viewpoints. . . . . 38 3.12 Computing 2D joint trajectories from videos. We showed the trajectories of three instances per action for six different actions. . . . . . . . . . . . . . . . 39 3.13 Illustrating the proposed approach using real data. Bottom: the blue curves denote the original 3D curves in the WCS and the red curves denote the re- constructed results using the 2D trajectories in unknown (but fixed) camera coordinate system. Top: two instances in the visual domain and the motoric domain are shown. Blue skeleton: the original MoCap data; Red skeleton: the reconstructed 3D positions. . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.14 Results for pose localization. For each instance the first frame and the 2D joint trajectory of the ?Right Wrist? (RW) in the videos are shown. The yellow arrows denote the motion direction. The viewpoint of subject S4 is slightly different from others. . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.15 (a) Confusion matrix of the baseline [MPK] (Naive Bayes) on our dataset. (b) Accuracy of Naive Bayes (NB) and BayesNet (BN) applied to the 2D trajectories and the 3D reconstruction. . . . . . . . . . . . . . . . . . . . . . . 42 3.16 Confusion matrices of Naive Bayes (NB) and BayesNet (BN) applied to the 2D trajectories and the 3D reconstruction. . . . . . . . . . . . . . . . . . . . . 43 viii 4.1 Modeling human motion in MoCap sequences using shift-invariant sparse representation. The short basis functions are sparsely selected and linearly combined to create action units for individual joints. The units may be shifted to different locations where multiple instances of the movement are realized. The time shift is modeled by the convolution (denoted by ?). . . . 47 4.2 Estimating the average action speed by measuring the action discontinu- ities. A sequence ?walking to running? from the University of Bonn dataset [MRC+07] is shown. The poses corresponding to the discontinuities are dis- played as mannequin. The trajectories of the head, the left elbow, and the right ankle are drawn in red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3 17 out of the 55 actions in our data set. The rendering is as follows: poses corresponding to the discontinuities are displayed as mannequins; the tran- sitions in between are illustrated by wire-frames; the trajectories of some joints are drawn in red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.4 Side by side comparison between original motion frames (a) and recon- structed motion (b) using the estimated basis functions, demonstrated on two ?salsa? sequences from the CMU dataset [Lab]. . . . . . . . . . . . . . . 61 4.5 The basis functions learned by the algorithm (a) and their usage for individ- ual joints (b) (denoted by the blue diamonds) . . . . . . . . . . . . . . . . . . 62 4.6 Action classification. Four actions, ?walking?, ?marching?, ?running?, and ?salsa? from the CMU dataset, were used in the experiment. A very simple k-nearest neighborhood (kNN, k = 3) classifier was chosen. (a)-(c) show the confusion matrices of the classifier using the weights of the proposed algorithm, the Sparse PCA algorith, and the PCA, respectively. For each algorithm, 50% of the estimated segments in each action category were ran- domly selected as the training samples and the remaining as the test samples. 62 4.7 Collecting Parkinson Disease data. Courtesy of Leonardo Max Batista Claudino. 63 ix 4.8 Parkinson disease diagnosis by measuring the alignment and the average approximation error. From left to right, the results for ?Finger To Nose?, ?Catching a Tennis Ball?, ?Bread Cutting?, and the diagnosis chart, respec- tively. In a)-c), blue triangles denote the healthy controls, and red squares denote the patients. d) suggests a chart for diagnosis. P: patient. H: healthy control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.1 Illustrating the recording scenario and the kinematic data. (a) The subjects in our experiments. Eight string players (grouped into ?Violin I? and ?Vio- lin II?) were conducted by two conductors, respectively; (b) The locations of the endpoints of the music instruments; (c) The kinematic data of a conduc- tor and a player in the first three seconds of the first movement of Mozart?s Symphony No.40. The time series in red, green, and blue represent the tra- jectories of the markers in X , Y , and Z dimension, respectively. . . . . . . . 68 5.2 The percentage of the time when the conductor leads the players. This figure shows that the professional conductor consistently leads the players more frequently than the amateur conductor. . . . . . . . . . . . . . . . . . . . . . 72 5.3 The error bar of Fc!i, which denotes the causal influence strength from the conductor to the players. This figure shows that the professional conduc- tor?s influence is consistently stronger than the amateur. . . . . . . . . . . . 73 5.4 The average influence and its standard deviation of the causality among the players in the same orchestra section. (a) The players in the ?Violin I?; (b) The players in the ?Violin II? (The 9th-15th piece of music are not shown due to many missing samples). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.5 The percentage of the time when the conductor leads the players in the non- causal data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.6 The error bar of Fc!i, which denotes the causal influence from the conduc- tor to the ith player, in the non-causal data. . . . . . . . . . . . . . . . . . . . 77 5.7 The average strength and its standard deviation of the causality among the players in the same orchestra section using the non-causal data. (a) The players in the ?Violin I?; (b) The players in the ?Violin II?. . . . . . . . . . . 78 x 6.1 I redraw Fig. 4.2 here for convenient purpose. Action discontinuities of the speed-varying action ?walking to running? from the University of Bonn dataset [MRC+07]. The jerk envelopes of the six joints are color coded and normalized in magnitude. The purple dashed lines indicate that the loca- tions of the envelope extrema of certain joints coherently occur at the same time. The trajectories of the head, the left elbow, and the right ankle are drawn in red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.2 (a) The motion trajectory of a video in the Weizmann dataset [BGS+05] in the GPDM dimensions 1-3 (blue curve). (b) The discontinuity in acceleration for variables x1; :::; x6. Each row represents the locations and the strengths of the local maxima of the 3rd order derivative of a variable. The poses corre- sponding to the center of the discontinuities (dashed red line) in different variables are shown on top. Frame numbers are displayed both below the poses and in the motion trajectory (red dots). . . . . . . . . . . . . . . . . . . 85 6.3 a)-c) Results for action discontinuities in videos; d)-f) Results for action dis- continuities in corresponding MoCap data; g)-i) Results of the baseline algo- rithm. Comparing the figures in each column, we clearly see that the action discontinuities for videos correspond to the ones in MoCap data, and that the results for the baseline algorithm are not consistent. . . . . . . . . . . . . 88 6.4 UMD Gesture dataset [LJD09]. The dataset contains videos of 14 different gestures of military signals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.5 Results for a multiple view sequence from the UMD Pose dataset [OKA06]. Each column displays the poses from a different viewpoint. Each row sug- gests that the estimated key poses from different cameras are consistent. The camera (C#) and the frame number (F#) are shown in the lower right corner in each representative frame. . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.6 Action discontinuities of two videos in public domain. For each action, the motion trajectory (blue curve) of the video in the reduced space is shown. The frame numbers corresponding to the key poses are displayed under the individual frames as well as in the GPDM space 1-2 (red dots). The blue arrow denotes the time direction. . . . . . . . . . . . . . . . . . . . . . . . . . 92 xi 6.7 (a) and (b): Results for the video in Fig. 6.2 using foreground (a) and flow fields (b) as the input. Frame numbers are displayed below the images. The magnitude of the flow field is shown in (b). (c) Results for a video (?limping? action, taken with moving camera) using our algorithm. This result shows that each step is clearly identified. (d) Action discontinuities based on the human detection results. . . . . . . . . . . . . . . . . . . . . . . 93 xii List of Tables 3.1 Performance comparison between the Partial Least Square (PLS) and the Multivariate Regression (MR). The performance was measured by the aver- aged standard deviation for each subject. . . . . . . . . . . . . . . . . . . . . 34 4.1 Average fitting error for different MoCap sequences using the basis func- tions learned in Sec. 4.5.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.2 Comparison of different MoCap data compression algorithms. (*): the com- pression rate without quantizing the weights. (**): the compression rate with weight quantization. The ratios of the other algorithms are copied from [Ari06] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3 Performance comparison (Bullseye) of action retrieval on the segments of our dataset. Three algorithms, namely Sparse PCA, PCA and our algorithm, were used in the comparison. The segments were normalized for Sparse PCA and PCA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4 Parkinson disease patients? age information. The disease level is measured by the Hoehn and Yahr scale which ranges from 1-5 (shown in parentheses). 60 6.1 Consistency between action discontinuities in videos and MoCap data. The consistency is defined as the zero-mean standard deviation of the differ- ences between corresponding discontinuities after applying dynamic pro- gramming (fps=25). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.2 Accuracies of the detection algorithms using different dimension reduction methods. The accuracy is computed using the Hungarian algorithm on the UMD Gesture dataset [LJD09]. . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.3 Improving the recognition performance using action discontinuities. The results for [LJD09] are copied from the original paper. . . . . . . . . . . . . . 94 xiii Chapter 1 Introduction One of my goals is to work towards developing the functionalities that facilitate the inter- action with human beings and their surrounding objects. Any cognitive system designated for serving human beings must be capable of processing the visual signals and eventually enable efficient prediction and planning of appropriate responses. I have been working on a multi-disciplinary project that involves neuroscientists, robot engineers, and vision scientists. The project pinpoints the necessity of cognition modules that establish the protocols for parsing, generating and translating signals among body movements, visual objects and their linguistic descriptions. I strive to develop the multi- modal tools for Social Signal Processing, including analyzing human actions in the visual space and the motoric space, and the coordination in human actions. Previously, I worked on handwriting image analysis, and I developed a few practical tools that are reasonably recognized by the community. During this period, I realized the importance of a cognitive system in the applications where human body motion, such as hand movement in writing, is involved. Thus, I planned to go deeper in this direction, and I began to seek the research topics that are closely related to human?s cognitive system and body motion. 1.1 Cognitive Functions for Human Interaction My research on the mirroring mechanism for human-robot interaction analysis primarily aims at creating the sensorimotor representation of human action in visual space. Re- cently, application-specific action recognition has been widely studied. However, current computational approaches have been limited in their abilities to represent the unrestricted full-body motion and to learn the intrinsic features for semantic understanding of human action. The study on Mirror Neurons provides the physiological mechanism for action percep- tion, which states that one recognizes an action because he/she can represent the same ac- 1 tion using his own motoric representation. The implementation of such a cognitive system will shed light on the advancement of the human-centered research and the improvement of the quality of human life like humanoid, intelligent service robot, and diagnosis and rehabilitation of motion disorder diseases. In the visual space, my goal is to develop a computational mechanism for mapping 2D movements in images to 3D body joint positions. A module for building a 3D skeleton from video data brings many advantages to computer vision. The skeleton is a natural abstraction of a human, which is viewpoint invariant and subject invariant. Besides being useful for human action analysis and recognition, it can be used in many applications in a number of related areas including computer graphics, visualization, and human-robot interaction with cognitive robots (e.g., iCub [SMV07]). In addition, I attempted to extract key poses for effective action recognition for robot- human interaction. It is well believed that not all poses are created equal, but much of the previous work largely focused on the recognition algorithms, and ignored to address the pose extraction sufficiently. I model the key poses as the discontinuities in the second order derivatives of the latent variables in the reduced visual space which we obtain using the Gaussian Process Dynamical Models (GPDM). Experiments demonstrate that the key poses are consistent under different conditions, such as subject variation, video quality, frame rate, and camera motion, and robust to viewpoint change. This facilitates the hu- man action analysis and improves the action recognition rate significantly by reducing the uncharacteristic poses both in the training and the test sets. The results are also helpful in psychological experiments for action understanding. In the motoric space, part of my work focuses on decomposing motion capture (Mo- Cap) sequences into synergies (smooth and short basis functions) along with the times at which they are ?activated? for each joint. The result will advance the humanoid re- search by providing effective building blocks for robot body movements. Given MoCap sequences, I proposed an algorithm to automatically learn the synergies as well as the ac- tivations simultaneously using L1 Minimization, a novel optimization technique that aims at recovering the exact sparse solution by minimizing the L1 norm (sum of absolute value) of the variables instead of the traditional sum of squared errors. Human actions by their nature are sparse both in action space domain and time domain. They are sparse in action space, because different actions share similar movements on some joints. They are sparse 2 in the time domain, because we do not want much overlap of the individual movements on a single joint. Two recent developed tools from machine learning, namely Orthogo- nal Matching Pursuit and the Split Bregman Algorithm, enabled us to alternately solve two large convex minimizations to learn the synergies and the activations simultaneously. Experiments demonstrate the power of the decomposition as a tool for representing and classifying human movements in the MoCap sequences. The activations of the synergies characterize the underlying rhythm in the parts of the bodies, and the weights of the syn- ergies are effective for action retrieval and recognition. Particularly, the representation allows us to substantially compress novel MoCap data, which is fundamentally invariant to the sampling rate. Experiments further suggest that our approach is well suited for early diagnosis of motion disorder diseases (e.g., Parkinson?s disease), an emerging issue in public health. A direct application to the analysis of the motoric data is to study the human coor- dination. Among cognitive studies on human body movements, social coordination has gained more attention than others because human beings in any society have complex be- haviors to coordinate themselves and interact with the real world for defensive, living and hunting purposes. In fact, recent scientific studies suggest that the ?action mirror circuit?, which maps the action in the visual space to the motoric space, might be tuned to social coordination rather than single action mirroring. Thus, it is very clear that the knowledge of social action coordination will be helpful for understanding the functionalities of the mirroring mechanism of human action, and will eventually result in efficiently estimat- ing appropriate actions in response or in modifying behaviors online. In the project, we attempt to establish a theory for studying the influence in a group of coordinated actions using the concept of Mirror Neurons in neuroscience. I proposed to use the Granger Causality as a tool to study the coordinated actions per- formed by at least two units. In the Granger Causality, actions are modeled by autoregres- sive processes, and the causality is framed in terms of predictability. If one action causes the other, then knowledge (history) of the first action should help predict future values of the latter. We successfully applied the Granger Causality to the kinematic data in a cham- ber orchestra to test the interaction among players and between conductors and players. As an extremely interesting case of the mirror-like mechanism, the motor systems in musi- cians support their orchestrated musical execution. We discovered that a good conductor 3 drives the orchestra more frequently than an amateur conductor, with a stronger driving force on average for all players in each piece of music. The above theory can be applied in other human interaction scenarios, such as sport game and dancing, to study the dynami- cal network of coordination. 1.2 Organization of the Dissertation The remaining of this dissertation is organized as follows. Chap. 2 reviews the literatures in the related problems. Chap. 3 presents the learning approach for mapping 2D actions to 3D space. Chap. 4 presents the algorithm for learning motion primitives in MoCap data. Chap. 5 presents the study of coordinated actions. Chap. 6 presents the algorithm for automatic extraction of key poses in videos. Chap. 7 concludes the dissertation. My publications from 2004-2010 are listed in Appendix. A, and a brief resume is included in Appendix. B. 4 Chapter 2 Related work 2.1 Chapter Summary This chapter reviews a number of areas related to human motion analysis. First, I give a brief introduction to the history of human motion capture. Second, I present the literature on human pose analysis in visual space. Based on the pose analysis, we can recognize human actions in videos. Third, I focus on analyzing human motion in the motoric space. In addition, I discuss the concept of action coordination and present a tool of modeling the coordination. 2.2 Brief History of Recording and Measuring Human Motion Ancient philosophers studied animal motion for a long time, which dates back to 2000 years ago. The first known document in biomechanics written by Aristotle [Ari], tried to discuss the relation between the locomotion and the structure. But the computational study of human motion did not start until recently. 100 years ago when the ?motion pic- ture? was introduced, some pioneers began to explore the possibility of recording human motion using photography techniques. For example, Eadweard Muybridge setup a cam- era system to record different people performing different actions in motion [Muy]. The approach Muybridge used is now called ?optical motion capture?, which is a ?non- invasive? approach. Usually multiple cameras are used in a room. Optical motion cap- ture can be further categorized into ?marker-based? and ?markerless? approaches. In the marker-based approaches, multiple markers are sticked in different positions of the subject body, and are tracked across frames in realtime [VIC]. The markerless approaches usually use silhouettes [SC08]. Therefore, clean or static background and canvas are favored in the environment. Unfortunately, due to the limitation of camera frame rates and image resolution, cur- rent state-of-the-art of computer vision systems cannot easily achieve high resolution in temporal and spatial domain at the same time. In addition, imperfect results of the image 5 segmentation techniques limit the usability and portability. In the field of neuroscience, researchers develop different devices for measuring human motion. However, these devices are not portable for general purposes. One possible trade off between usability and accuracy is to use the inertial sensors. Recently, high accurate inertial sensors are used extensively in the field of computer graphics. Speed, gravity, and acceleration are usually taken into account. For example, MOVEN suit [MOV] is one of the mature products in the market. Depending on the applications, calibration might not be compulsory [YJSB09]. It is natural to use the skeleton as the model to control human motion. This model is usually called the stick-figure model, and it is used in different fields. For example, we can control a humanoid by specifying rotations of joint angles of a certain stick-figure model. In computer graphics, different animations are usually done by rendering different figure-stick models. Joint rotation is frequently represented as Euler angle in MoCap datasets, which is the rotation angle with respect to the X, Y, and Z axis, respectively. One typical format, BioVi- sion Hierarchy (BVH), stores the Euler angle directly in the plain text format. Since Euler angle has many fundamental flaws, quaternion [Qua] is widely used as an alternative in computer graphics community. It has many advantages compared to the Euler angle. 2.3 Human Pose Analysis Human pose analysis in videos and MoCap sequences is very useful for compute vision and neuroscience studies. 2.3.1 Pose Analysis in Videos 2.3.1.1 Pose Estimation in Videos Detecting humans in images has received significant attention in the area of computer vi- sion. The studies on this topic can be categorized into: 1) pedestrian detection for tracking, 2) 2D pose estimation, and 3) 3D pose estimation. Pedestrian detection is a very challenging task because of the variations in illumination, shadow, and pose. Many datasets have been collected (e.g., [PSZ08], [DT05a]). Recently, Schwartz et al. [SKHD09a] used the Partial Least Squares (PLS) method to reduce the 6 dimension in the feature space of detection windows and outperformed other algorithms. Since the method was developed in the context of pedestrians, it is limited to full body human detection and walking actions. Estimating arbitrary 2D poses from single images has been a challenging problem for a long time. Usually body parts detectors are used first to estimate the potential locations of the limbs, then skeleton models are imposed as constraints in finding the optimal locations of the limbs. To solve the 2D pose estimation problem, belief propagation and dynamical programming have been used. Ramanan [Ram06] parsed images to body parts using an iterative process to tune the performance. Based on Ramanan?s result, Ferrari et al. [FMJZ] estimated a few upper body poses from the TV series ?Buffy?. Felzenszwalb and Hutten- locher [FH05] developed a part-based approach for object recognition and used it also for pose estimation. Recently, Sapp et al. [STT10] estimated body poses using cascade models. These approaches usually try numerous possible scales and rotations to find the optimal solution. 3D pose estimation directly from 2D images has also been of interest. Closely related to optical motion capture, multiple camera settings are very useful when the calibration data is available [SB06]. Algorithms for single uncalibrated cameras were proposed in [ARS09]. However, the explicit mapping functions were not well addressed in these studies. Urta- sun and Darrell [UD08b] used Gaussian Processes as the regression technique to compute the mapping, but it is mainly single image based and the action recognition capability was not discussed. 2.3.1.2 Key Poses in Visual Space Key pose has been used in action recognition [RA] and gesture recognition [SNI04] in visual space. Existing methods for pose extraction and temporal segmentation of action video can be categorized into two classes: 1) dynamics-based approaches and 2) model- based approaches. Dynamics based approaches, such as Marr?s seminal work [MV82], try to detect the discontinuity between actions so that the boundaries can be detected. These approaches work well when background subtraction is applied [LN07]. Model- based approaches are tightly coupled with action recognition algorithms. The space-time approaches [KSH] and dynamic time warping [LJD09] techniques are frequently adopted. Pose extraction in the motion capture (MoCap) sequences has also been studied. The 7 data from a motion capture suit are time series of three rotation angles each at a number of joints on the human body. Similar to the vision-based extraction, the methods can also be grouped to two categories. Jenkins and Mataric [JM02] use the KCS, a heuristic algorithm, to partition human motion using dynamics. Vecchio et al. [VMP03], Bissacco [Bis05] and Lu and Ferrier [LF04] assume that human motion is ruled by autoregressive (AR) processes or state-space models and partitioned the sequences based on different model parameters. In related work, neuroscientists use velocity, acceleration, and jerk as the measurements for input motion streams [Zat97]. A closely related topic to pose extraction is video temporal segmentation, which has been used for shot boundary detection, key frame extraction [TRE], video content analysis [XG08], and video synopsis [PRAP08]. Such segmentations are helpful for retrieving useful information from a large collection of videos. However, these techniques do not aim at extracting human poses. Action synopsis [ACCO05] is related to our work but the joint locations need to be labeled using a semi-automatic software called Icarus prior to the analysis. Motion primitives can be conceived as smooth and nicely behaving functions between motion discontinuities. The discontinuities are characterized by the changes in acceler- ation of the muscle signals [ddSB03]. In model based approaches, the discontinuities in MoCap data are detected as the changes in the model parameters [Bis05]. In visual space, the motion discontinuities have often been used for key pose detec- tion. Existing methods for pose extraction and temporal segmentation of action videos can be categorized into two classes: 1) dynamics-based approaches and 2) model-based ap- proaches. Dynamics based approaches, such as Marr?s seminal work [MV82], try to com- pute the discontinuities between actions in which the boundaries can be detected. These approaches are applicable when background subtraction is effective [LN07]. Model-based approaches are coupled to action recognition algorithms, and space-time approaches are frequently adopted [KSH]. 2.3.2 Pose Analysis in MoCap data A central problem in the analysis of MoCap data is how to decompose motion sequences into primitives. The representations of the primitives can be poses or basic atoms of actions over time. 8 2.3.2.1 Human Motion Primitives Finding motion primitives has been studied in a large body of work. D?Avella et al. [ddSB03] discovered that the muscles were activated together to perform actions. [CJ06] applied non-negative matrix factorization to study torque patterns. [SMI07] studied motion using a dynamical system. None of these models uses shift-invariant primitives. Li et al. [LFAJ10] recently proposed an unsupervised learning algorithm for automati- cally decomposing joint movements sequences into shift-invariant basis functions. These primitives assist the recognition, synthesis, and characterization of human actions. A decomposition into shift-invariant features has been studied in acoustic signals clas- sification ([BD06]). Shift invariant sparse coding [RKN07] further improved the perfor- mance of classification. A major difference between their mathematical formulation and ours is that we enforce the weights to be positive, and the basis functions of individual joints to be shifted coherently to realize an instance of an action. L1 minimization recently gained much attention with the emergence of compressive sensing [Can06] and has been applied frequently to image denoising [CJLS09], sparse representation of data [DGJL07], and for solving non-negative sparse-related problems [DT05b]. Our approach involves solving an L1 norm minimization in many variables. Al- though in principle, it is possible to solve an L1 minimization problem by formulating it as a linear programming problem, such an approach is not efficient when many variables are involved. But recently a number of fast algorithms have been developed for approxi- mating the optimal solution. For example, Basis Pursuit [CDS01] solves the L1 minimiza- tion by selecting the best bases. Orthogonal Matching Pursuit (OMP) [TG07] can reliably recover a signal with K nonzero entries given a reasonable number of random linear mea- surements of that signal. Alternatively, the Split Bregman Algorithm [GO09], approxi- mates the optimal solution by iteratively solving efficiently a few simple sub-problems. 2.3.2.2 Temporal Segmentation of Human Motion A few studies proposed methods for breaking MoCap sequences into small action seg- ments. Jenkins and Mataric [JM02] used a heuristic algorithm to partition human motion. Vecchio et al. [VMP03], Bissacco [Bis05] and Lu and Ferrier [LF04] assumed that human motion is ruled by autoregressive (AR) processes or state-space models and partitioned the 9 sequences based on different model parameters. A comparison of partitioning algorithms in motor space can be found in [Bou08]. 2.3.2.3 Applications The action basis could find direct application in action embodiment [PHRA]. The segments are useful for action retrieval [DGL09] and action classification [YJSB09], and can be used for compressing human motion [LM] and [LFAJ10]. 2.4 Action Recognition in Videos There is a large body of work on recognizing actions in visual space ([BGS+05], [GKS02], [KWC], [SvG], and [SLC04]). 2D features from image sequences, such as optical flow, sil- houette, and similarities between frames, are computed and mapped directly to semantic concepts such as walking [AT06], running [EBMM03] and dancing [RSH+05]. Lin et al. [LJD09] use poses, colors, and flow as features, and align a test video to possible training sequences. There exist many algorithms for video abstraction (e.g., [PRAP08]). However, these techniques do not aim at analyzing human poses. Action synopsis [ACCO05] is related to our work but the joint locations need to be labeled using a semi-automatic software called Icarus prior to the analysis. Action recognition can be categorized into three areas: 1) pose based methods, 2) tra- jectory based methods, and 3) interest points based methods. I will discuss related work in the first two categories in this chapter. 2.4.1 Pose Based Action Analysis Lin et al. [LJD09] use the pose information, color, and the flow as the features, and align a test video to possible training sequences. Alternatively, 2D features from image sequences are mapped to the joint space. Urtasun and Darell [UD08a] recently use sparse probabilis- tic regression to inference the poses from the human silhouettes. Dimension reduction techniques have been used to analyze human motion both in motor space ([CGM+], [WFH08]) and in visual space ([Ple03], [EL04], [WWW08]). Studies 10 show closed curves for periodic patterns in reduced spaces, but action recognition in the reduced signals has not been addressed. 2.4.2 Trajectory based Action Recognition Action recognition has been intensively studied. We limit our discussion to the algorithms based on the trajectories of detected interest points. Using the Space-Time Interest Point [SLC04], Messing et al. [MPK] proposed an algorithm for computing the trajectories of tracked keypoints. Raptis and Soatto [RS] further improved the descriptor, and Matikainen et al. [MHS] outperformed other algorithms by considering pairwise relation in space. None of these approaches uses pose estimators or body joint locations, both of which are more representative than generic interest points. 2.5 Coordinated Action and its Causality Network Causality is a well known concept used in reasoning, planning, and knowledge represen- tation [MT97] in artificial intelligence [Pea00] and cognitive science [Ste08]. Given Mo- Cap sequences, we can carry out scientific analysis in the related areas including cognitive science and neuroscience. In these areas, Causality is widely known for its essential sup- port of reasoning, planning, and knowledge representation [MT97] in artificial intelligence [Pea00] and cognitive science [Ste08] because a change is possible only through an action of the causality [Kan87]. A change is possible only through an action of a causation [Kan87]. Thus, causality has many applications ranging from logic [GLL+04] to humanoid robotics [Sca02]. A quanti- tative measurement of causality in correlated signals, such as coordinated human actions, will be very helpful for understanding the underlying connections between signals, and the messages passing through the connections. Granger Causality was originally proposed by Granger to understand the influence be- tween two correlated time series such as stock market indexes [Gra69]. Granger Causality is based on the modeling of forecasting. In the Granger Causality Test, linear predictors are used for evaluating the causality quantitatively. This technique was recently adopted extensively in neuroscience to study brain activi- ties [SE07]. Granger Causality is regarded as the flow of information from one part of the 11 brain to another, which helps determine whether a coincidence is the result of one process influencing another process [LDB+09]. Modeling of human motion using linear predictors is widely used in computer vision and machine learning [VMP03, LF04]. Linear predictors are used for approximating hu- man motions in short time intervals sufficiently. Autoregressive (AR) processes are fre- quently used to model multi-dimensional trajectories of human body joints recorded by Motion Capture (MoCap) equipments. Coordinated human action can be conceived as a result of the Mirror Neurons mech- anism [FCO05], which maps actions in visual space to motoric space and leads to appro- priate responses. Therefore, a causality network might be particularly useful for under- standing the mechanism that facilitates anticipating other?s actions and planning error corrections in motor control. Coordinated human action in music performance recently attracted much attention be- cause action prediction and error correction have already been shown to be a fundamental aspect of music performance [MRPK09]. Thus, Granger Causality may be ideal for devel- oping models and techniques for measuring social interaction in a controlled framework [CVV09]. 2.5.1 Autoregressive Model In signal processing and time series analysis, an autoregressive (AR) model is a random process which is used to model and predict stochastic signals [Hay]. Related concepts include Moving average (MA) model, Autoregressive moving average (ARMA) model, and Linear predictive coding. Calculation of the AR parameters requires computing the autocorrelation matrix of the process. This is done using the Yule-Walker equations or spectral methods. 2.5.2 Regression Analysis Regression is extensively used in time series analysis and dimension reduction. In general, regression refers to the methods that find the relations between observa- tions and responses [DS98]. The goal of linear regression is to determine the values of the parameters for a linear function that best fit a set of observations. The regression is 12 called Multivariate Analysis [Cle94] (MVA) when the observations are multidimensional, and Multiple Regression (MR) when the responses are multidimensional. In Multivariate Multiple Regression, the responses form a vector function. A common practice is to consider each response as an independent variable. Multivariate analysis then is applied to each dimension [DS98]. However, if the responses are correlated, latent variable methods and non linear methods must be used. Partial Least Squares [Abd07] was developed to find a linear regression model of the principal components in latent space and it considers the covariance in the output dimensions as well as the input dimensions. 13 Chapter 3 Learning Visual-Motoric Mappings of Actions 3.1 Chapter Summary The visual descriptions of actions used in current vision algorithms are appearance-based, and thus viewpoint dependent. Humans, however, can recognize actions under varying viewpoints, because we use multimodal representations, combining information from the visual and motor space. Here we develop a 3D representation of actions based on mo- tor and visual information. We learn the function that maps 2D image positions to 3D human body joint trajectories using synchronized videos in conjunction with Motion Cap- ture (MoCap) data. Using this function, we then map video data from a single viewpoint to 3D motion descriptions, on which we perform action recognition. As image data we use the positions of the hand, which are localized using color and motion in a Conditional Random Field (CRF) based segmentation, and the elbow loca- tions, which are acquired using particle filtering. The function that maps 2D data to 3D is approximated using the Partial Least Squares (PLS) method. For every joint, the PLS model is trained to predict the 3D locations of the joint in future frames using its 2D mo- tion history. We collected two dataset of 8 and 10 cooking actions repeatedly performed 8-10 times by 4 subjects, respectively. Synthetic and real experiments showed that our ap- proach can robustly map 2D to 3D data, and the recognition accuracy on the 3D action data in comparison to the 2D data is improved by 17% and outperforms a state-of-the-art motion descriptor by 7:3% on our dataset. This chapter is based on the paper submitted to the IEEE Conference on Computer Vision and Pattern Recognition 2011 and the paper submitted to the IEEE Conference on Robotics and Automation 2011. Please refer to [1, 2] in the Appendix A for details. 15 3.2 Introduction 3.2.1 Motivation Humans have an amazing capability in visually recognizing other humans? actions [Joh73] despite the large variability in visual appearance due to differences in subject?s style and viewpoint. Evidence from neurophysiological studies on the so-called ?mirror neurons? points to mechanisms of multimodal representations of actions in humans. It is considered that the ?action mirroring? [RC04] mechanism enables an agent to understand the actions of others in terms of his/her own motor representations, and thus this mechanism is at the core for the very basic yet essential capability of primates to perform imitation. Therefore, it necessitates a visual module that parses, maps, and reproduces human actions from videos to their own 3D skeletal representation. State of the art vision algorithms focus on visual processing only, and use appearance- based approaches to represent actions. Clearly, these approaches have their limitations for generalizing to different viewing conditions. On the other hand, some studies analyze 3D data from motion capture suit, but do not consider vision. The challenging mapping prob- lem from vision to action space has not been sufficiently addressed, and it is still unknown what action and vision representations exactly are associated with each other in primates. A few studies have discussed the association of visual data with action processes and pro- posed learning approaches to link the two modalities [MSN+06, AGC10]. In this study we propose a computational mechanism for mapping 2D movements in images to 3D body joint positions. We express this mapping by a set of functions, which we approximate using linear methods (Fig.3.1). A module for building a 3D skeleton from video data brings many advantages to com- puter vision. The skeleton is a natural abstraction of a human, which is viewpoint invariant and subject invariant. Besides being useful for human action analysis and recognition, it can be used in many applications in a number of related areas including computer graph- ics, visualization, and human-robot interaction with cognitive robots (e.g., iCub [SMV07]). 16 Figure 3.1: We aim to study the mapping fi( ) from body joint history Xi (i = 1::4) in videos (red curve) to its 3D position (green: lower arm, blue: upper arm, purple: torso, orange: head). 3.2.2 Motion based Visual-Motoric Maps We attempt to obtain the 3D movement of humans as this leads to invariance. Thus we are faced with the challenging problem of mapping 2D motion in images to 3D motion in space. Certainly, it is theoretically impossible to establish a one-to-one mapping between image points in a single view and their corresponding 3D points, because of the loss of the third dimension due to the projective transformation. Therefore, we will use the motion trajectories from video sequences to alleviate this problem. We will map image points at joint locations, which we estimate using visual processes, to 3D joint locations obtained from MoCap data. The parameters of the mapping are computed from training data using regression. As the 3D representation is in a human- centered coordinate system, it is intrinsically viewpoint invariant. However, the learned 2D to 3D mapping depends on the viewpoint. In order to use the proposed approach for general video interpretation, we will need to first determine the viewpoint from which the video is taken. Although we do not address this issue in this chapter, establishing the viewpoint from uncalibrated video may be done by detecting the orientation of human poses by the face and the shoulder alignment) with 17 respect to the camera. In our experiments, we used the frontal parallel viewpoint. 3.2.3 Our Approach in a Nutshell We learn a mapping of the endpoints of human body limbs in 2D videos to 3D skeletons. The mapping is represented using a linear approximation. Learning of the parameters of the mapping is formulated as a time series problem, specifically, a multivariate multiple regression problem. Our image representations are the 2D locations of the face, the elbows, and the hands. We first use a state of the art pose estimator to identify the hand locations and initialize a color model for the hand segmentation, which allows the approach to localize the hands in multiple frames in the videos. We also run an efficient face detector. Using the estimated locations of the face and the hands, a particle filter is then used to detect the elbows. As 3D representation, we compute the 3D positions of the limbs, which are obtained from the MoCap data. Then we use the statistical linear method called Partial Least Squares (PLS) to compute the weights of the mapping. 3.2.3.1 The UMD-Sushi Dataset: A Pilot Dataset As our experimental platform, we chose cooking actions. Our created a small UMD-Sushi dataset for a pilot study. It consists of eight different actions in sushi making, including ?Cutting?, ?Peeling?, ?Pickup?, ?Pressing? ?Placing?, ?Scooping?, ?Turning?, and ?Trans- ferring?. The actions were captured using the MOVEN motion suit at 100 frames per sec- ond (fps) and a high definition camera at 30 fps simultaneously. The MoCap and video sequences were both resampled to 50 fps. Each action sequence consists of at least 10 repetitions of the same action. The MoCap coordinate system is the same as the conventional World Coordinate System (WCS), where the z coordinate increases vertically upward, and the coordinate system is left-handed. Fig. 3.2 shows the same actions performed by different subjects in the visual domain. One can see there is reasonable intra-class variation for the same action type. 18 Subjec t 1 Subjec t 2 Subjec t 3 (a) Scooping (b) Turning Figure 3.2: Two cooking actions in our UMD-Sushi dataset, Scooping (a) and Turning (b), performed by three different subjects in visual space. 3.2.3.2 Yet Another Cooking Dataset (YACD) To estimate the mapping we need to detect accurately the body poses for a number of actions, and we need to have the MoCap data. The datasets currently available publicly (e.g., [lTFHB+, TBB]) do not satisfy these two requirements at the same time. Therefore, we captured our own pilot recording, which will be made available in public domain. We recorded ten actions during cooking performed by four subjects. The actions are ?Cleaning?, ?Cutting?, ?Flipping?, ?Peeling?, ?Picking?, ?Pressing?, ?Placing?, ?Sprin- kling?, ?Stirring?, and ?Turning?. The actions were captured using jointly the MOVEN motion suit [MOV], which was worn under natural clothes (Fig. 3.3), at 100 frames per second (fps) and a high definition camera at 30 fps. The videos feature half-body motions. 19 Similar to the UMD-Sushi dataset, The MoCap and video sequences were both resam- pled to 50 fps. Each action sequence consists of 8-10 repetitions of the same action. The Mo- Cap coordinate system is the same as the right-handed World Coordinate System (WCS), with the z coordinate increasing vertically upward. S1 S2 S3 S4 Figure 3.3: Examples of our dataset. The actions were captured using jointly the MOVEN motion suit [MOV], which was worn under natural clothes by the subjects S1 to S4, and a HD camera. 3.2.3.3 The Difference between Two Dataset A significant difference between these two datasets is the viewpoint. The UMD-Sushi dataset primarily focus on the hand actions, and the colors of the motion capture suit can be easily detected. Therefore, it serves as a pilot dataset. The YACD is a more formal dataset. Advanced computer vision algorithms are re- quired to compute the human poses and to further localize hand positions. It is a more challenging dataset, thus, it is useful for demonstrating the effectiveness of the algorithm in real applications. 3.2.4 Our Contributions 1. We proposed a representation for the visual-motoric mapping of actions as a tool for recognition, using a linear approximation from image points to 3D poses. 2. Experiments on real and synthetic data demonstrate that our method improves ac- tion recognition against state of the art 2D visual representations by 7:3%. 3. We recorded an upper body cooking dataset using both MoCap suit and a HD cam- era, which complements current datasets. The rest of the chapter is organized as follows. Sec. 3.3 presents the mapping algorithm. 20 Sec. 3.4 demonstrates the usefulness of our algorithm on synthetic and real data, and Sec. 3.5 concludes the chapter. Peelin g Cuttin g (a) (b) (c) Figure 3.4: Our hand detector was initialized by a pose estimator (a). Then we used the trained color model and the optical flow (b) to localize the hands in the subsequent frames (c). Magnitude of the flow is color coded, and hand regions were coded in yellow. 3.3 The Visual-Motoric Mapping To estimate the visual-motoric mapping, we extract body joint information in videos and compute 3D data of humans in the MoCap data, respectively. A MoCap sequence is the time series of three rotation angles each at a number of joints on the human body. In our approach, we used linear regression to approximate the mapping. 3.3.1 Accurate Localization of the Body Parts Our goal is to accurately extract the body poses in video. Ideally, one can run a pose esti- mator on every frame. However, current pose estimators do not consider the smoothness of the poses over time. This is a limitation due to the quantization in scale and orientation used in estimating the pose. Furthermore, it is inefficient. To address these issues, we run a pose estimator [STT10] only once every 30 frames in videos, and detect the body joints in between. We describe the procedure in the following sections. 21 3.3.1.1 Hand Detector Our hand detector is based on color and motion. The color model is automatically trained using the results of the pose estimator in [STT10] (Fig. 3.4a). For flow estimation, the implementation in [BM10] was adopted (Fig. 3.4b). Then, a Conditional Random Fields (CRF) model was adopted to localize the hand regions in the subsequent frames (Fig. 3.4c). 3.3.1.2 Brief Introduction to Conditional Random Fields (CRF) We begin by providing the basic notations and background that are used to describe the CRF model in this chapter. Consider a discrete random variable X defined over the pixel lattice of the input image I = f1; ; Ng with a neighborhood system E . Each pixel, i in the lattice can then be viewed as a random variable xi 2 X that takes a value from the set of m possible labels L = fl1; ; lmg. E is a set of edges that connects the neighboring pixels or xi within a clique, c. The set of all cliques is denoted as C: We denote xc as the set of random variables that are conditionally dependent on each other within each c; and x as a possible labeling (assignment) over I taken over the superset L = LN . L therefore represents the space of all possible segmentations that can be applied on I . The goal of the CRF is to choose the labeling that satisfies certain constraints with highest probability. The probability of any labeling x is denoted by P (x) and the constraints are modeled as potential functions . By the Hammersley-Clifford theorem, the posterior distribution P (xjI) where I is the set of observable data computed from I , is a Gibbs distribution: P (xjI) = 1 Z exp X c2C (xc) ! (3.1) where Z is the partitioning function, (xc) are the potentials defined over all xi within the clique c, conditioned on data I. Taking the negative log of (3.1) yields the Gibbs energy function: E(x) = logP (xjI) logZ = X c2C (xc) (3.2) Inference of the CRF involves minimizing Eq. 3.2 which gives us the maximum a pos- teriori (MAP) labeling, x : 22 x = arg min x2L E(x) (3.3) and is the desired segmentation of image I . For the purpose of segmentation, Eq. 3.2 consists of potentials defined over unary and pairwise cliques: E(x) = X i2I i(xi) + X (i;j)2E ij(xi;xj) (3.4) The unary potentials, i, encode the likelihood that a label, L is assigned to pixel i, which is computed from a learned distribution for the constraints that we are seeking. The pairwise potentials, ij , encode consistency within segments and favor similarly la- beled pixels within a clique and penalize the energy function when dissimilar labels appear within c. 3.3.1.3 Unary potential functions We encode two unary potential functions into Eq. 3.4: i(xi) = col col(xi) + flow flow(xi) (3.5) where col and flow are the weighting parameters for the color, col, and optical flow, flow, respectively. col is obtained from a Gaussian Mixture (GMM) color model, learned from training data over the CIELab color space. flow is a unary prior that encodes the motion of hand and tool. Using it for segmentation favors moving regions. flow is obtained from a bimodal GMM of optical flow learned from the training data. We use the implemen- tation by Brox and Malik [BM10] to compute the flow. By combining the color and flow potentials, we induce a strong prior on hand-like regions that are moving. The weighting parameters col and flow can be further adjusted to favor the final seg- mentation either towards the color model (to get more hand-like regions) or the flow model (to get regions with higher flow). To obtain a segmentation that favors the hand color model, we set col > flow such that the posterior likelihood computed from the hand color GMM is greatly increased. This means that pixels with color closer to the hand color will be labeled as hand, even if the flow in that region is small. Conversely, in order to obtain a segmentation that favors regions of high flow, we set flow > col to favor the flow 23 model. 3.3.1.4 Pairwise potential functions In order to enforce consistent labels within segments, the pairwise potentials are defined as: ij(xi; xj) = 8 >< >: 0 if xi = xj f(xi; xj) otherwise (3.6) where f(xi; xj) is an edge based contrast function defined over the image gradient, color and flow difference: f(xi; xj) = exp( kxi xjk 2) (3.7) where = (2 (xi xj)2 ) 1 and h i represents taking the mean. is a constant pa- rameter that is different for the gradient, color and flow features. Eq. 3.7 favors color and flow constancy within regions of similar color or flow by penalizing less the potentials when similar labels are assigned within a clique. This formulation had been widely used in several state of the art color segmentation algorithms ([BJ01, RKB04]) with impressive results. First we have to obtain the potentials inE(x). Then given a test image, we apply Eq. 3.3 to compute the hand and flow segments, which are used to localize the tools in the hands. The following sections elaborate these processes in detail. 3.3.1.5 Segmenting Hands and Flow The system must initially be trained to obtain the GMM color and flow models. This is done by either randomly sampling 5 frames from the training sequence (each of length 100 frames) where binary labels of the color classes (hand and non-hand regions) and flow classes (large flow and small flow) are manually assigned (Fig. 3.5), or use a pose estimator to initialize the hand regions. This training process is equivalent to the service robot being shown typical examples of hands and flow training data during a short period of teaching the robot. Since the 24 Figure 3.5: Training the GMM color and flow model. ?+ves? and ?-ves? denote the man- ually labeled positive and negative regions for the color (top) and flow (bottom) images. Figure courtesy of Ching Lik Teo. 25 Figure 3.6: Segmenting hands and flow from a test image sequence. Labels with numbers correspond to the description. (1) Compute the optical flow. (2) Compute the unary and pairwise potentials. (3) Perform inference to get the best labels to obtain the hand and flow segmentation in (4). Figure courtesy of Ching Lik Teo. class labels are typically invariant across many testing scenes (e.g. hand color and hand actions are unlikely to change drastically), it is expected that training will only be necessary during the initial stages of the service robot?s life or when large changes to its environment or actions occur (e.g. from typical hand movements to large body movements). During the testing phase, a sequence of test image frames are presented to the algo- rithm which computes Sh and Sf in the following steps as illustrated in Fig. 3.6: 1. Compute the optical flow F using the implementation of [BM10] between the previ- ous and current image frames. 2. Compute the unary and pairwise potentials from Eqs. 3.5 and 3.6. 3. Obtain two energy functions E(x) for the hands and flow by adjusting the unary potentials weights col and flow in Eq. 3.4. This is done by setting col > flow to favor the hand regions in the first case and flow > col to favor the high flow regions in the second case. Since we still want consistent labels within the segments, the pairwise potentials ij are the same in both cases. 4. Perform -expansion optimization on both energy functions using Eq. 3.3 to obtain Sh and Sf . 26 3.3.1.6 Elbow Tracker We used particle filtering to estimate the elbow positions, based on the hand and face detection results. The Viola-Jones face detector was first used to estimate the faces. We used the offset between the face and the shoulder given by the pose estimator in Sec. 3.3.1.1 to compute the shoulder positions, and then together with hand positions we used a particle filter on each limb to estimate the positions of the elbows over time. In particle filtering, we randomly generate 300 hypothesis for the pose positions, and keep k = 10 in each stage. For each guess, we measure the similarity by the difference in color distribution between the current and the previous regions of the upper arm and the lower arm, respectively. Fig. 3.7 shows the results for our 2D pose localization on four different actions per- formed by four subjects. The torso was detected using Calvin body detector1. Poses in consecutive frames were correctly estimated. Please refer to Fig. 3.1 and 3.4 for the coding of the colors. 3.3.2 Learning the Mapping between 2D video and 3D MoCap Trajectories Without loss of generality, we assume that the videos and the MoCap data were synchro- nized. Denote the locations of the two endpoints of a body part in 2D over time at n time instances as: s = [p1; p2; :::; pt; :::; pn] (3.8) and the corresponding points trajectories in 3D as: S = [P1; P2; :::; Pt; :::; Pn] (3.9) where pi and Pi are column vectors. Due to the projective transformation, a one-to-one mapping between pi and Pi is theoretically impossible. Therefore, a practically feasible approach is to predict the next position Pt+1, given the history of pt, pt 1 ... , pt m+1 , where m is the window size. This approach is segmentation-free, tractable, and can be 1http://www.vision.ee.ethz.ch/~calvin/calvin_upperbody_detector/ 27 Peelin g Pourin g Cleanin g Sprinklin g Figure 3.7: The results of the continuous pose localization. Four actions performed by four subjects were shown. Please refer to Fig. 3.1 and 3.4 for the coding of the colors. implemented by many existing learning algorithms. By concatenating the coordinates of the histories of pi?s and Pi?s respectively, we create two matrices X and Y , where X is n 4m and Y is n 6. Therefore, the problem can be formulated as follows: Y = f(X) (3.10) where the unknown function f( ) is the mapping between the data points in the train- ing set. This mapping is a multi-input-multi-output system, and can be approximated by linear equations in practice. In the chapter, we used Partial Least Squares (PLS) regression [Abd07] to estimate the function f( ). 28 3.3.3 Partial Least Squares in a Nutshell Partial Least Squares regression is one of the statistical methods that handle Multivariate Multiple Linear Regression. Instead of finding linear coefficients between the responses and independent observations directly, it first projects the input variables and the obser- vations to a reduced space using Principal Component Analysis (PCA), and then uses linear regression models to fit the parameters in the reduced space. Therefore, it attempts to use the major components of the observations to explain the major components of the responses. This method is particularly suited here when the matrix of observations is mul- ticollinear. Fig. 3.8 illustrates this idea. Figure 3.8: Illustrating the Partial Least Squares algorithm. Instead of modeling the lin- ear relationship between the responses (output) and the observations (input) directly, it attempts to model the linear relationship between their principal components. We briefly describe the procedure of PLS here. Given the number of the principal com- ponents p, PLS decomposes the zero-mean n 4m matrix X and zero-mean n 6 matrix Y in Eq. 3.10 into: 29 X = TP T + E (3.11) Y = UqT + e (3.12) where T and U are n p matrices, P is a 2m p matrix, and the q is a 6 p matrix. The n 2m matrix E and the n 6 matrix e are the residuals. The PLS method then uses the nonlinear iterative partial least squares (NIPALS) algo- rithm to construct a set of weight vectors W = fw1; w2; :::; wpg such that [cov(ti; ui)] 2 = max jwj=1 [cov(Xwi; Y )] 2 (3.13) where ti is the ith column of matrix T , ui is the ith column of matrix U , and cov( ) is the covariance. Please refer to [Abd07] and [SKHD09a] for details. 3.4 Experiments The following four experiments demonstrate the usefulness of our approach. First, we present a toy example that gives the intuition of the approach. Second, we show that the approach can effectively reconstruct the 3D poses using synthetic data, and it is robust to viewpoint changes. Third, we use real data from the UMD-Sushi and the YACD dataset to train and test the mapping module. We showed that PLS is superior to multiple linear regression on each response because it considers the covariance in the input and the output together. Finally, we demonstrate that the 3D reconstruction leads to better action recognition rates. Learning the mapping amounts to computing the weights in Eq. 3.13 from the training data. We set the window size to 5 in all the experiments, and set the number of the principal components in the PLS method to 3. 3.4.1 Toy Example We first present a toy example to demonstrate the intuition behind the proposed approach. The goal of this example is to recover the 3D coordinates from the 2D trajectories on a pro- jection plane. In this example, four non-parametric nonlinear 3D curves were generated. 30 Then they were projected to a 2D plane using orthographic projection. To make this example interesting, we used ?CVPR?, the four capital letters of this con- ference to generate the (x; y) coordinates of Ci (i = 1; :::; 4) (shown as blue curves in the first row in Fig. 3.9), each of which is 1 pixel wide. The z coordinates of Ci were generated using the following parametric formula (blue curves in the second row in Fig. 3.9): z = 50 sin(log p x+ log p y) (3.14) Figure 3.9: A toy example. We generated the (x; y) coordinates based on the letters ?CVPR? (top, blue curves), and used Eq. 3.14 to generate the z coordinates (bottom, blue curves). The reconstructions in red overlap significantly with the blue curves. We projected Ci (i = 1; :::; 4) to a 2D plane using orthographic projection. Here, we chose the viewpoint = (45 ; 45 ; 45 ) as Azimuth, Elevation, and Roll of the WCS. The 3D trajectories were rotated and projected to the xy plane in the new space as the 2D projections. To learn the mapping, we randomly selected 33% of the data points for training. Then, all the projection points were used to estimate the 3D positions for testing. The red curves in Fig. 3.9 show the reconstruction in the original xy plane and the corresponding z coordinate, respectively. Referring to the figure, one can see that the blue curves and the red curves overlap significantly demonstrating that the reconstructions ap- 31 proximate the original 3D curves very well. The average reconstruction error is 0:84 units per data point. This example shows that the proposed approach is able to reconstruct different types of curves because the Ci(x; y; z) (i = 1; :::; 4) are nonlinear. 3.4.2 Synthetic Example The goal of the synthetic example is to show that the learned mapping is effective. Each MoCap sequence contains a few instances of the same action, and we projected the 3D positions of the body joints in the MoCap sequences to arbitrary 2D planes, generating synthetic 2D curves. In this experiment, all the actions of subjects S1 and S2 were used for training and all the actions of other subjects for testing. To simplify the discussion, orthographic projection was used in this example, and the ?Left Wrist? (LW) and the ?Right Wrist? (RW) were used for demonstration. Fig. 3.10 shows four examples of the original 3D trajectory and its reconstruction for the RW in the actions ?Peeling? and ?Pouring?, when the viewpoint = (45 ; 45 ; 45 ). The start and the end poses of the right arm in these actions are also shown on top. Comparing the same action performed by different subjects (blue curves in each col- umn in Fig. 3.10), one can see the intra-class variance in 3D. Our mapping algorithm robustly handles the variance by modeling the process, and our reconstruction in the WCS (red curves) approximates the original trajectories (blue curves) reasonably well. To statistically analyze the robustness of the module to change of camera viewpoint, we iterated all possible Azimuth and Elevation angles, while keeping the Roll zero. Then, for each viewpoint the average error was obtained by averaging the mean accumulated errors of each joint over all time frames. Fig. 3.11a illustrates the average error in a color diagram. It shows that the expected error of each joint in each frame ranges from 2cm to 2.6cm for different viewpoints. Par- ticularly, the error map has some local minima, which may indicate that the actions in our YACD dataset are best approximated and understood in 3D from these viewpoints. For instance, the viewpoint near 45 in elevation and 0 in azimuth has the best fitting results, which corresponds to the normal watching angle in commercial cooking shows. Furthermore, we would like to demonstrate that our approach is robust when the test viewpoint is different from the training viewpoint. Given a viewpoint for training, we 32 changed the test viewpoint to a slightly different angle, and use the trained model to pre- dict the 3D data. Fig. 3.11b shows the error for change in viewpoint between 10 to 10 . The error is the average from a number of viewpoints. The result suggests that the reconstruction is accurate when there is small change of viewpoint, and degrades reasonably when the change becomes large. This synthetic experiment demonstrates that the PLS can recover the 3D position from arbitrary projections accurately, and its performance is robust to reasonable changes in viewpoint between the training and testing samples. 3.4.3 3D Reconstruction on Real Data: A Pilot Study We carried out a pilot study in this section on the UMD-Sushi dataset. In this pilot study, the body joints can be easily tracked. Therefore, the 3D reconstruction can be analyzed more intuitively. We first detect the trajectories of four body joints, namely ?Left Wrist?, ?Left Elbow?, ?Right Wrist?, and ?Right Elbow?, in videos, and use the results to demonstrate the use- fulness of the mapping from 2D to 3D. Fig. 3.12 shows the trajectories (yellow curves) of the right wrist for six different actions in our UMD-Sushi dataset. Three instances performed by the same subject were visualized. One can see in Fig. 3.12 that the 2D positions of the joints in the different actions may be similar, but their trajectories are different. This suggests that using motion history is sufficient to distinguish the actions and estimate the 3D positions. In this example, all actions of four body joints from two subjects were used for training and the remaining subject?s data for testing. First, we low-pass filtered the trajectories in 2D to reduce noise, and selected the ?ac- tive? trajectories with the range of trajectories larger than a threshold. Then, we applied the proposed algorithm to the active trajectories in 2D and in 3D for each body joint. Fig. 3.13 illustrates the result. The blue curves denote the original 3D curves in the WCS and the red curves denote the reconstructed results using the 2D trajectories on unknown (but fixed) camera coordinate system (Bottom, Fig. 3.13). Two instances in the visual domain and in the motoric domain are shown (Top, Fig. 3.13). The blue skeleton denotes the original MoCap data, and the red skeleton denotes the reconstructed 3D positions. 33 Subject S1 S2 S3 S4 PLS (cm/f) 8.28 9.48 11.62 11.69 MR (cm/f) 8.61 10.68 12.51 12.96 Table 3.1: Performance comparison between the Partial Least Square (PLS) and the Multi- variate Regression (MR). The performance was measured by the averaged standard devi- ation for each subject. 3.4.4 3D Reconstruction on the YACD Dataset Real data from the camera and the MoCap suit were used in this experiment. We first localized the upper body parts using the method described in Sec. 3.3.1, and then mapped the 2D trajectories from to 3D. In the following examples, we performed manually a temporal segmentation on the YACD, and normalized the data in each action segment to 50 frames. 3.4.4.1 Computing Body Parts in Videos Our efficient pose localization can detect poses effectively in videos. Fig. 3.14 shows 16 examples of our pose estimation on the YACD dataset. The first frame of each action instance and the corresponding trajectory of the RW are shown. Referring to figure, one can see that the viewpoint of subject S4 is slight different from those of other subjects, because we intentionally changed the angle during recording, to evaluate the robustness of the approach. Fig. 3.14 demonstrates that the 2D visual processing is robust across subjects. The smooth trajectories of body joints further enable the processing using time series tech- niques. 3.4.4.2 Learning the Mapping from Videos to MoCap All actions of subjects S1 and S2 were used for training and all four subjects? data were used for testing. First, we low-pass filtered the trajectories in 2D to reduce noise. Then, we applied the proposed algorithm to the smoothed trajectories in 2D and in 3D for each body joint. Table 3.1 shows the performance measured by the standard deviation of error between the reconstruction and the ground truth (MoCap data). Compared to Fig. 3.11, the ex- periments on real data have higher errors than the synthetic data on average. The ap- 34 proximations on novel subject data (S3 and S4 in Table 3.1) also have reasonably larger error (10%-15%) compared to the training set. In addition, S4 has a marginally larger error compared to S3 because of the slight viewpoint change. We further compared the PLS method to Multivariate Regression (MR). As can be seen from the table the average error of PLS is smaller. This is because PLS considers a linear relationship between the covariances instead of the original data. Therefore, it models the process better. Because of the loss of the third dimension due to the projective transformation, and errors in the 2D pose localization, the reconstruction cannot be perfect. We show how the mapping facilitates action recognition next. 3.4.5 3D Action Recognition We demonstrate that the 3D mapping facilitates action recognition in the following exper- iments. Two different schemes of partitioning the data were used for training and testing, and the performance was compared against a 2D action recognition algorithm based on trajectories. In the first scheme, we partition the YACD dataset by subjects. All ten actions per- formed by a randomly chosen subject were used for training, and the rest for testing. In the second scheme, four randomly chosen actions performed by all subjects were used for training and the remaining six actions from all subjects were used for testing. 3.4.5.1 Partition the YACD by Subjects In this scheme, one subject, Si (i = 1::4i), was randomly chosen for training the mapping, and all the actions of the remaining three subjects were used in a test set TSi. As a baseline, the algorithm in [MPK] was used. A 5-fold cross validation on each TSi was performed. Fig. 3.15a shows the averaged confusion matrix. We then chose two classifiers, Naive Bayes (NB) and BayesNet (BN), and performed a 5-fold cross validation on each TSi using the original 2D trajectories and the reconstructed 3D trajectories, respectively. The averaged confusion matrices are shown in Fig. 3.16. The recognition rate of NB on the 2D trajectories is smaller than the baseline because we only considered the 2D joints as opposed to a large number of tracked keypoints. How- ever, our 3D representation outperformed the base line descriptor 7:3% because we directly 35 modeled underlying motion process. One can also see that the reconstructed 3D data are preferable because they facilitated both classifiers and increased the average accuracy from 60% to 77% (NB), and 74% to 82% (BN), respectively. This experiment demonstrates that the mapping recovers the 3D locations of seen ac- tions performed by novel subjects, and further improves the action recognition accuracy. 3.4.5.2 Partition the YACD by Actions The evaluation procedure was similar to the one in Sec. 3.4.5.1 except that we partitioned the actions instead of the subjects. Four actions performed by all subjects were randomly chosen for training the mapping module, and the other six actions were used for testing. To evaluate the performance, we ran the above procedure 20 times, and averaged the accuracy for each action. Fig. 3.15b plots the results of the two classifiers on 2D and 3D inputs, respectively. Clearly, the info in the recovered 3rd dimension improved the performance of both classifiers. For instance, the increase was significant for the actions ?Cutting? and ?Flipping? where the 2D information may not be sufficient. On average, we obtained 4:8% (NB) and 11% (NB) improvement using the 3D reconstruction, respectively. This experiment suggested that we can use the 3D information of the seen actions to predict novel actions. 3.5 Chapter Conclusion We have presented a visual motoric mapping for action representation, which we obtained using a statistical learning method called Partial Least Squares. The method involves de- tecting the body joints, and then mapping the 2D joint trajectories in visual space to 3D motion in motoric space. Two dataset consisting of eighteen cooking actions performed by four subjects in a kitchen environment were collected. Experiments on synthetic and real data showed that our method robustly represents human actions. 36 Figure 3.10: Synthetic 3D pose reconstruction of ?Right Wrist? (RW) in ?Peeling? and ?Pouring? on the YACD dataset. Blue curves: the original 3D curves in WCS. Red curves: the reconstructed results with the 3D rotation angle (45 , 45 , 45 ). The start pose (blue) and the end pose (pink) of the right arm in an instance are also shown on top. The dashed arrow denotes the motion. 37 (a) (b) Figure 3.11: Robustness test on synthetic data. The 3D trajectories of two joints {RW, LW} were projected to 2D planes using orthographic projection. (a) The reconstruction accuracy for viewpoints (Azimuth and Elevation). (b) For each pair of Azimuth and Elevation, we changed the testing angle between 10 to 10 . The error is the average from the number of viewpoints. 38 (a) Peeling (b) Pickup (c) Pressing (d) Scooping (e) Transferring (f) Turning Figure 3.12: Computing 2D joint trajectories from videos. We showed the trajectories of three instances per action for six different actions. 39 Figure 3.13: Illustrating the proposed approach using real data. Bottom: the blue curves denote the original 3D curves in the WCS and the red curves denote the reconstructed results using the 2D trajectories in unknown (but fixed) camera coordinate system. Top: two instances in the visual domain and the motoric domain are shown. Blue skeleton: the original MoCap data; Red skeleton: the reconstructed 3D positions. 40 Figure 3.14: Results for pose localization. For each instance the first frame and the 2D joint trajectory of the ?Right Wrist? (RW) in the videos are shown. The yellow arrows denote the motion direction. The viewpoint of subject S4 is slightly different from others. 41 (a) (b) Figure 3.15: (a) Confusion matrix of the baseline [MPK] (Naive Bayes) on our dataset. (b) Accuracy of Naive Bayes (NB) and BayesNet (BN) applied to the 2D trajectories and the 3D reconstruction. 42 Figure 3.16: Confusion matrices of Naive Bayes (NB) and BayesNet (BN) applied to the 2D trajectories and the 3D reconstruction. 43 Chapter 4 Learning Shift-Invariant Sparse Representation of Actions 4.1 Chapter Summary A central problem in the analysis of motion capture (MoCap) data is how to decompose motion sequences into primitives. Ideally, a description in terms of primitives should fa- cilitate the recognition, synthesis, and characterization of actions. We propose an unsu- pervised learning algorithm for automatically decomposing joint movements in human motion capture (MoCap) sequences into shift-invariant basis functions. Our formulation models the time series data of joint movements in actions as a sparse linear combination of short basis functions (snippets), which are executed (or ?activated?) at different positions in time. Given a set of MoCap sequences of different actions, our algorithm finds the de- composition of MoCap sequences in terms of basis functions and their activations in time. Using the tools of L1 minimization, the procedure alternately solves two large convex min- imizations: Given the basis functions, a variant of Orthogonal Matching Pursuit solves for the activations, and given the activations, the Split Bregman Algorithm solves for the basis functions. Experiments demonstrate the power of the decomposition in a number of ap- plications, including action recognition, retrieval, MoCap data compression, and as a tool for classification in the diagnosis of Parkinson (a motion disorder disease). This chapter is based on the paper appeared in the IEEE Conference on Computer Vision and Pattern Recognition 2010. Please refer to [7] in Appendix A for details. 45 4.2 Introduction Interpreting human behavior is a newly emerging area that has attracted increasing at- tention in computer vision. One of the intellectual challenges in modeling human motion is to come up with formalisms for describing and recognizing human actions in motion capture (MoCap) sequences. Fundamentally, the primitives should assist the recognition, synthesis, and characterization of human actions. From this perspective, the formalism of the primitives is essential to action representation. Human actions by their nature are sparse both in action space domain and time do- main. They are sparse in action space, because different actions share similar movements on some joints, and also different joints share similar movements. They are sparse in the time domain, because we do not want much overlap of the individual movements on a sin- gle joint. These observations make the concept of shift-invariant sparse representation as the primitives of human actions very attractive, where shift invariant means that the out- put does not depend explicitly on time, e.g., the same action can have multiple realizations at different times. Let us get into more detail. We are given many MoCap sequences. The data from a motion capture suit are time series of three rotation angles each at a number of joints on the human body. Each of these sequences consists of a number of instances of different actions (where an instance of an action could be a step of a ?running? sequence, or a single ?kick?, or ?jump?). Our goal is to obtain from these action sequences a set of basis functions that could be used for approximating the entire set of the actions. Our basis functions are chosen to be smooth functions and about the length of an in- stance of an action (Fig. 4.1). This enables us to achieve a useful underlying representation of different actions. The joint movements in an instance of an action are approximated by a sparse linear combination of basis functions (?Action Unit? in Fig. 4.1). To achieve a meaningful behavioral interpretation the weights are defined to be positive. Multiple instances of the same joint movement are realized by executing (or ?activating?) the linear combination of basis functions at different instances of time, but with different strength (?Activation? in Fig. 4.1). That is, all basis functions involved in the representation of a single joint are activated simultaneously. But different joints are activated separately. Our action representation then is the weights of the basis functions along with their 46 Figure 4.1: Modeling human motion in MoCap sequences using shift-invariant sparse rep- resentation. The short basis functions are sparsely selected and linearly combined to create action units for individual joints. The units may be shifted to different locations where multiple instances of the movement are realized. The time shift is modeled by the convo- lution (denoted by ?). activations. Once these shift-invariant basis functions are learned, we can approximate any novel action sequence using a weighted combination of a number of these basis functions. In our learning procedure, we solve for both the basis functions of the actions and the times when these functions are ?activated?. Solving them together would amount to a complicated non-convex optimization with a large number of variables. However, the op- timization problem is convex in either the basis functions or the activations. Our method thus solves alternately for the two set of parameters. Recently developed L1 minimization techniques allow us to solve these two problems effectively. Given a set of basis functions, a variant of Orthogonal Matching Pursuit [TG07] is used to obtain the activations by solv- ing a non-negative L1 minimization problem with a large number of variables. Given the activations, the Split Bregman Algorithm [GO09] is used to solve an L1 regularized liner least square problem. The characteristics of our decomposition approach are: 1. Our unsupervised algorithm learns a high-level sparse representation (the primi- 47 tives) of action which allows recognizing actions in MoCap sequences effectively. 2. The shift-invariant modeling naturally handles the MoCap sequence composed of multiple instances of different actions. 3. The sparse activations explicitly express the coordination among different joints. The rest of the chapter is organized as follows. Sec. 4.3 presents the algorithm for learn- ing the basis functions. Sec. 4.4 summarizes an algorithm used for normalizing the length of MoCap sequences. Sec. 4.5 demonstrates the usefulness of our action representation on four applications, and Sec. 4.6 concludes the chapter. 4.3 Shift-Invariant Sparse Modeling of Actions A MoCap sequence consists of the time series of three rotation angles each at a number of joints on the human body. In our approach, we approximate each time series by sparse linear combinations of shift-invariant sparse features. For simplicity, we start our journey from the following example. A body joint rotation s (1d time series) consists of the movements of multiple instances of the same action. Given N short basis functions bi (i = 1; 2; :::N ) which have the length about an instance of an action, we would like to approximate s as follows: s X i a ? wibi; (4.1) where a is the sparse activation for s, and ? is the convolution operator (Fig. 4.1). The variables a and wi are non-negative. Eq. 4.1 is equivalent to s X i wia ? bi = X ai ? bi; (4.2) where ai = wia. This means we can model the shift of each individual basis function separately, with the additional constraint that all the activations ai (i = 1; 2; :::N ) must have non-zero values at the same time when used for approximating s. In the following formulation, we first discuss a solution for Eq. 4.2 in Sec. 4.3.1 and 4.3.2. The additional constraint is enforced when we solve the activations in Sec 4.3.3.1. 48 4.3.1 Problem Formulation Given a set of M 1d signals sj , j = 1; 2; :::M , each of which is of length l and represents a time series of a joint movement, we want to approximate all sj as the convolution between the activations and the basis functions, i.e., sj = X i aji ? bi + nj ; (4.3) where aji (j = 1; 2; :::M , i = 1; 2; :::N ) are the sparse non-negative activations for the i th basis function in the jth signal, and nj is the noise. We enforce that the activations are sparse, and the basis functions are sparse in the Fourier domain. Therefore, the modeling poses the following L1 regularized optimization problem: min (aji ;bj) X j jsj X i aji ? bj j2 + 1 X i;j jaji j1 + 2 X i jF b^ij1; (4.4) where j jp is the Lp norm of the vector, F is the Fourier transform matrix, and b^i are the zero-padded bi which are of length l. Solving the activations and the basis functions together would amount to a non-convex optimization with a large number of variables. In Sec. 4.3.2, we re-formulate the problem in the frequency domain. 4.3.2 Formulating the Problem in Frequency Domain We show that the optimization problem is convex in either the basis functions or the ac- tivations. Therefore, a coordinate descent algorithm is used to alternately solve two large convex L1 regularized problems. The convolution in time domain is equivalent to the dot product in frequency domain. Therefore, Eq. 4.3 is equivalent to: Sj X i Aji B^i (4.5) where is the pairwise multiplication operation, Aji = Fa j i , and B^i = F b^i, respectively. 49 Denoting X as the square matrix whose diagonal is X , we have: Aji B^i = A j i B^i = B^iA j i (4.6) Therefore, Eq. 4.5 is equivalent to Sj h B^1 : : : B^N i 2 6 6 6 4 Aj1 ... AjN 3 7 7 7 5 = h B^1 : : : B^N i 2 6 6 6 4 F . . . F 3 7 7 7 5 2 6 6 6 4 aj1 ... ajN 3 7 7 7 5 = Baj ; (4.7) where aj = [aj1; :::; a j N ] T , and B = h B^1 : : : B^N i 2 6 6 6 4 F . . . F 3 7 7 7 5 : Similarly, Eq. 4.5 can be rewritten as: Sj h Aj1 : : : A j N i 2 6 6 6 4 B^1 ... B^N 3 7 7 7 5 = h Aj1 : : : A j N i 2 6 6 6 4 F . . . F 3 7 7 7 5 2 6 6 6 4 b^1 ... b^N 3 7 7 7 5 = h Aj1 : : : A j N i 2 6 6 6 4 Fl . . . Fl 3 7 7 7 5 2 6 6 6 4 b1 ... bN 3 7 7 7 5 = AjFb; (4.8) 50 where Aj = [Aj1; :::; A j N ], b = [b1; :::; bN ] T , the matrix Fl as the first l columns of the F , and F = 2 6 6 6 4 Fl . . . Fl 3 7 7 7 5 : Let S, A, a and b denote the concatenations of all possible Sj , Aj , aj and bi in the column form. Eq. 4.4 is convex in either a or b, so we solve it alternately as two convex optimization problems. Given the basis functions b in time domain, from Eq. 4.4 we obtain: min a jS Baj2 + 1jaj1 (4.9) Eq. 4.9 is the sum of M independent subproblems, all of which are convex. We can ap- proximate them separately. Given the activations a, from Eq. 4.4 we obtain: min b jS AFbj2 + 2jFbj1 (4.10) Both Eq. 4.9 and 4.10 are convex. Therefore, the objective function Eq. 4.4 is always non-increasing using the updates. Eq. 4.9 is solved using the Orthogonal Matching Pur- suit, and Eq. 4.10 is solved using the Split Bregman iterative algorithm. To avoid trivial results, we normalize the basis function in amplitude in each iteration. 4.3.3 Solving the Problem using L1 Minimization 4.3.3.1 OMP for Solving the Activations We use a variant of the Orthogonal Matching Pursuit (OMP) to solve Eq. 4.9. The variant amounts to implementing Orthogonal Matching Pursuit in a batch mode. Orthogonal Matching Pursuit is a greedy algorithm. It progressively picks the new basis which minimizes the residual. The major advantages of this algorithm are its ease of implementation and its speed. This approach can easily be extended to related problems, such as finding non-negative bases [DT05b]. In our modeling, a single body joint movement is a sparse combination of the basis 51 functions, with the weights non-negative and the additional constraint that the activations must be coherent. This is solved as follows: Given the movement (1d time series) of a joint, we progressively pick the new basis at the locations found in the previous steps, and minimize the residual of all the time series. We enforce the solution to be positive by checking which basis to choose and checking the weights found in the least-squares minimization. During the optimization, a sparse subset of basis functions is automatically selected. In our implementation, we allow a maximum of 4 basis functions at a single activation at one joint, with the total number of basis functions being 15. This makes it easier to compare the weights of the same joint in action retrieval and classification. 4.3.3.2 Split Bregman Algorithm for Solving the Bases As defined in the literature, the Split Bregman Iterative Algorithm is applied to the follow- ing problem min u J(u) +H(u); (4.11) with u 2 Rn, J(u) is the L1 norm of a function of u and is continuous but not differentiable function, and H(u) is the L2 norm of a function of u and is continuous differentiable. In our case, J(u) = jFbj1 and H(u) is the L2 norm of the approximation error. By introducing jd uj2 and E(u; d) = jdj1 +H(u), we rewrite Eq. 4.11 as min (u;d) E(u; d) + =2jd uj2: (4.12) The solution is given by iteratively updating the following three equations: uk+1 arg min u H(u) + =2jdk u pkj2 (4.13) dk+1 arg min d jdj1 + =2jd u k+1 pkj2 (4.14) pk+1 pk + uk+1 dk+1 This ?splits? Eq. 4.12 into the subproblems. Eq. 4.13 is a 2nd order continuous differ- ential function that can be solved efficiently. Eq. 4.14 is solved by shrinkage operation1. 1shinkage(x; y) = sgn(x) max(jxj y; 0). 52 By the alternately update in the Split Bregman Algorithm, we obtain the optimal sparse solution for Eq. 4.10. 4.4 Preprocessing: Normalization for Handing Actions with Various Speeds It is important to handle action sequences of different speeds. For this we use our action segmentation algorithm2. This algorithm breaks an action sequence into action segments. We then compute the average length of the action segments and use it to normalize the sequence. Our goal is to find the discontinuities in the 3rd order derivative of the time series. Motivation for this approach comes from the work of d?Avella et al. [ddSB03], who found that the change of the muscle force indicates the time of action change, and the change of muscle force is proportional to the 3rd order derivative of the time series. Our algorithm partitions a MoCap sequences by minimizing the sum of the pairwise distances of the envelope extrema of the different joints. In this algorithm, we use the quaternion representation for rotation. The quaternion series of a certain joint is a 4D vector X(t) = [x1(t); x2(t); x3(t); x4(t)]T (4.15) The jerk of X(t) is computed as J(t) = j d3(X(t)) dt3 j2 (4.16) To minimize the error in computing the derivative, we smooth the data using a low pass filter. To measure the jerk better, we compute the jerk envelope Env(t) = jHilbert(J(t))j2 (4.17) for every joint, where Hilbert( ) is the Hilbert transform. This is a standard approach for computing the signal envelope [Bra99]. Then we process Hilbert( ) using as a low pass 2Please refer to Chap. 6 for detailed analysis of the temporal segmentation of human actions. 53 Figure 4.2: Estimating the average action speed by measuring the action discontinuities. A sequence ?walking to running? from the University of Bonn dataset [MRC+07] is shown. The poses corresponding to the discontinuities are displayed as mannequin. The trajecto- ries of the head, the left elbow, and the right ankle are drawn in red. filter a Butterworth filter, and compute the envelope?s extrema of the filtered Env(t) for every joint. Fig. 4.2 gives an example of the speed-varying action ?walking to running? from the University of Bonn dataset [MRC+07]. Here, we show the jerk envelopes of the joints ?Hip?, ?Knee?, and ?Ankle? of both legs, respectively. The envelopes are color coded and normalized in magnitude. The breaking poses (mannequins) can be selected by finding the optimal ?alignment? of the envelope extrema of the different joints (the purple dash lines in Fig. 4.2). The alignment can be solved as an optimization problem using the envelope extrema of all joints. It is applied until the average pairwise distances is larger than a threshold . 4.5 Experiments The following four experiments demonstrate the usefulness of our representation: First, we present the basis functions learned from our own dataset in Sec. 4.5.2. Second, we show that our basis functions are well suited for approximating novel actions. This allows 54 us to substantially compress novel MoCap data (Sec. 4.5.3). Third, the experiments in Sec. 4.5.4 demonstrate that using only the magnitude of the activations, action retrieval and classification can be solved effectively. Finally, we show that the activations and the fitting error alone are very useful for motion related disease diagnosis, thus demonstrating the intuitive nature of the description (Sec. 4.5.5). We used three datasets, our own, the Univ. Bonn dataset, and the CMU MoCap dataset [Lab]. Our own dataset consists of 55 different actions, which were captured with the MOVEN motion capture suit [MOV] at 100 fps. Each action sequence consists of at least 6 repetitions of the same action. Fig. 4.3 shows some of the actions in our dataset. The convergence speed primarily depends on OMP and Bregamn algorithm. OMP is a greedy algorithm that takes linear time, and Bregman is proved to be very efficient for many problems that are difficult by other means [GO09]. Thus, our algorithm is very efficient. It takes only 10-15 iterations before convergence ( 5 mins in Matlab for our dataset). 4.5.1 Parameters 1 and 2 in Eq. 4.4 determine the balance between the fidelity of the object function and the sparsity of the variables in the optimization. In all the experiments, they were both set to 12 . in the Split Bregman Algorithm (Eq. 4.12) is the parameter for penalizing the auxiliary variable. It was set to 1000 in our experiments. The length of the basis function was set to the frame rate of our MoCap suit (100). In our experiments, the normaliza- tion speeds up the convergence of the algorithm, therefore, MoCap sequences were also approximately normalized to 100 samples per action instance. The initialization of Eq. 4.4 is randomly generated. This optimization is a large non- convex problem, and a common practice is to have a random guess at the beginning. 4.5.2 Learning the Basis Functions Fig.4.3 visualizes 17 out of the 55 actions in our dataset. First, we applied our normaliza- tion algorithm. As found by visual inspection the discontinuities in the action sequences estimated by this algorithm correspond to the intuitive poses separating actions. After normalization, the action decomposition algorithm is applied to the action se- quences. Fig 4.5a shows the fifteen basis functions learned by the algorithm. Each column 55 is a color-coded basis function. In our modeling, individual joints are described by four ba- sis functions, from the above set of fifteen learned basis functions. Fig 4.5b shows the basis functions used by the individual joints. We can see that different joints may share the same basis functions. Please note that the combination of the basis functions that composes joint movements. Different joints may have different combinations for an action, therefore, it is better to plot the functions individually. Despite these very small numbers, the approximation is very good. The first column in Table 4.1 shows that the error residual in the approximation was very small. The residual was measured by the total fitting error divided by the number of frames and the number of joints in the dataset. On average, our representation approximates the training sequences with only 2:36 degree per joint in every frame. The result shows that the primitives are effective and compact representations of the actions in the datasets. Table 4.1: Average fitting error for different MoCap sequences using the basis functions learned in Sec. 4.5.2. Seq Training Walking,Bonn Running, Bonn Error 2.36 3.18 3.56 4.5.3 Motion Approximation and Compression We use the basis functions to approximate novel actions. This further leads to effective compression of MoCap data. In this experiment, the novel sequences were first normal- ized, and Eq. 4.9 is then used to compute the activations and approximate the sequences. An averaging filter is used to handle the possible discontinuities between actions. A useful representation of action should have the generalization capability of express- ing unseen actions. First, we used the basis function learned from our dataset to approx- imate two sets of the sequences in the Bonn dataset, which were captured by an optical motion capture suit by different subjects. We then measured the fitting error. As shown in the 2nd and 3nd columns in Table 4.1, they are very small. Comparing lossy compression results objectively is very difficult. As pointed out by [Ari06], the fitting error may not be a good predictor of visual quality. Therefore, the sub- jective judgments were used. Fig. 4.4b visualizes the approximation using two ?salsa? dances in the CMU dataset. We can see that the poses of reconstructed movement (Fig. 56 4.4b) approximate those in the original sequences (Fig. 4.4b) very well. This side by side comparison shows that the shift-invariant decomposition effectively handles the compli- cated novel actions. Another advantage of our decomposition approach is that it leads to high compres- sion rate. To effectively compress MoCap sequences composed of arbitrary actions is very useful both for storage and for visualization, but it is also a challenging problem. In our approaches, a joint movement that has 100 data samples can be described by only four coefficients. Thus, we achieved approximately 25 : 1 compression rate by default3. Table 4.2 compares the compression rate of some algorithms on CMU dataset. Table 4.2: Comparison of different MoCap data compression algorithms. (*): the compres- sion rate without quantizing the weights. (**): the compression rate with weight quantiza- tion. The ratios of the other algorithms are copied from [Ari06] Algorithm Ours Arikan Wavelet Zip Ratio 19:1(*) 37:1(**) 30:1 6:1 1.4:1 We archived competitive compression rate compared to the state-of-the art algorithm. More importantly, the primitive-based compression is fundamentally invariant to the frame rate. A major difference between the our approach and previous approaches is that we ex- plicitly model the human actions. Therefore, the change in frame rate only changes the number of samples in the basis functions, but not the activation positions in time. 4.5.4 Action Retrieval and Classification In the following experiments, we demonstrate the usefulness of our description for action classification and retrieval. First, our preprocessing algorithm breaks the action sequences into action segments. Each segment is treated as one complete action. Then, we decompose every single action using Eq. 4.9 allowing for only 1 activation. Finally, action retrieval and classification can be solved effectively using only the magnitude of the activations as the weights. We considered it more helpful to provide the intuition of the usefulness of the rep- resentation using simple Euclidean distance and a neireast neighborhood classifier. This demonstrates how much the representation contributes to the retrieval and classification, 3For complicated actions, we may use more activations to approximate the time series, based on the nor- malization ratio. In addition, a small amount of overhead is required (e.g., storing the scaling factor and the basis functions). 57 without tuning parameters in a sophisticated classifier. Therefore, we chose to compare our representation to decomposition methods. 4.5.4.1 Segment-based Action Retrieval We compared our algorithm with the Sparse Principal Component Analysis [DGJL07] al- gorithm and the Principal Component Analysis algorithm. For both algorithms the seg- ments are normalized to have the same length. Retrieval is evaluated on our dataset using the so-called Bullseye test [LJ07]. 6 seg- ments per action sequence were selected4. A leave-one-out trial was performed for every segment. The retrieval rate is ratios of the correct hits in top 12 candidates for all trials. The performance of the three algorithms is shown in Table 4.3. Our algorithms achieved higher accuracy (86:07%) in the Bullseye test. This indicates that the repeated action seg- ments in an action sequence have similar representation. The result demonstrates that our decomposition algorithm has the power of finding the similar movements. Table 4.3: Performance comparison (Bullseye) of action retrieval on the segments of our dataset. Three algorithms, namely Sparse PCA, PCA and our algorithm, were used in the comparison. The segments were normalized for Sparse PCA and PCA. Algorithm Ours Sparse PCA PCA Accuracy 86.07% 82.64% 78.87% 4.5.4.2 Segment-based Action Classification We classify actions performed by different subjects. Four actions (?walking?, ?marching?, ?running?, and ?salsa?) from the CMU dataset, were used in the experiment. We compared our algorithm with the Sparse Principal Component Analysis algorithm and the Principal Component Analysis algorithm. To demonstrate the usefulness of the weights, we chose a very simple k-nearest neighborhood (kNN, k = 3) classifier. For each partitioning algorithm, we randomly selected 50% of the estimated segments in each action category as the training samples, and used the remaining as the test samples. Figs. 4.6a-c show the confusion matrices of the classification using the coefficients obtained by our algorithm, the Sparse PCA and the PCA algorithm. 4For action sequences which had a larger number of segments, we randomly selected 6 segments 58 Results show that our representation gives the best classification performance. This demonstrate that our shift-invariant representation models the nature of the human action, and the sparse linear decomposition facilitates the performance of classification. 4.5.5 Motion Disorder Diagnosis The activations are a natural measurement for describing body coordination. If the activa- tions for different joints are not well aligned, the subject might have a problem in control- ling her/his motions. Another measure is the approximation error. We demonstrate the applicability of the basis functions in modeling the Parkinson mo- tion disorder, which is characterized by degenerative muscle movements. The primary symptoms are the results of decreased coordination caused by insufficient control. This problem is highly difficult because the correct modeling for the coordination among dif- ferent body parts is challenging. In this experiment, we captured the MoCap data for four patients diagnosed with the PD disease and four healthy controls (Table. 4.4). Fig. 4.7 shows the scenario when the experiments were performed. Subjects were asked to perform a number of actions repeat- edly. For this application, we learned the basis functions and the activations for each subject individually. Three common actions, ?Finger To Nose?, ?Catching a Tennis Ball?, and ?Bread Cutting?, were recorded. Figs. 4.8a-c show the plot of the activation alignment score and the average approximation error. The alignment score between two sequences is the zero-mean standard deviation of the differences between corresponding elements. The activation alignment score is defined as the largest value of the pairwise alignment scores. Fig. 4.8d shows the chart for classifying the patients. As can be seen the two measure- ments are sufficient to separate controls from patients. Referring to 4.8a-c, the data points are well separated. The diagnosis shows that the activations in our decomposition approach characterize the underlying rhythm in the parts of the bodies. This suggests that our approach is well suited for further understanding the principles of coordinated actions. 59 Table 4.4: Parkinson disease patients? age information. The disease level is measured by the Hoehn and Yahr scale which ranges from 1-5 (shown in parentheses). Controls Patients 4 healthy subjects 63(2.5), 63(2.5), 60(2.5), 60(3) 4.6 Chapter Conclusion This chapter presented an algorithm for finding basic primitives to represent human mo- tion data. Body movements in MoCap sequences are decomposed into the shift-invariant basis functions and their activations. The decomposition is solved by alternately updating two large convex problems using L1 minimization techniques. Experiments show that the compact representation is effective for motion approximation, MoCap data compression, action retrieval, and classification with application to disease diagnosis. 60 Figure 4.3: 17 out of the 55 actions in our data set. The rendering is as follows: poses cor- responding to the discontinuities are displayed as mannequins; the transitions in between are illustrated by wire-frames; the trajectories of some joints are drawn in red. (a) (b) Figure 4.4: Side by side comparison between original motion frames (a) and reconstructed motion (b) using the estimated basis functions, demonstrated on two ?salsa? sequences from the CMU dataset [Lab]. 61 Figure 4.5: The basis functions learned by the algorithm (a) and their usage for individual joints (b) (denoted by the blue diamonds) Figure 4.6: Action classification. Four actions, ?walking?, ?marching?, ?running?, and ?salsa? from the CMU dataset, were used in the experiment. A very simple k-nearest neighborhood (kNN, k = 3) classifier was chosen. (a)-(c) show the confusion matrices of the classifier using the weights of the proposed algorithm, the Sparse PCA algorith, and the PCA, respectively. For each algorithm, 50% of the estimated segments in each action category were randomly selected as the training samples and the remaining as the test samples. 62 Figure 4.7: Collecting Parkinson Disease data. Courtesy of Leonardo Max Batista Claudino. Figure 4.8: Parkinson disease diagnosis by measuring the alignment and the average ap- proximation error. From left to right, the results for ?Finger To Nose?, ?Catching a Tennis Ball?, ?Bread Cutting?, and the diagnosis chart, respectively. In a)-c), blue triangles denote the healthy controls, and red squares denote the patients. d) suggests a chart for diagnosis. P: patient. H: healthy control. 63 Chapter 5 An Application of Granger Causality to Coordinated Actions Analysis in Orchestra 5.1 Chapter Summary Causal knowledge is a concept well known in artificial intelligence and developmental learning. In many commonsense reasoning tasks, it is studied in a setting where the causes and the effects are discrete events. In this chapter we use the concept of causality to evalu- ate coordination of continuous human actions. The ?coordination of actions? refers to the degree of synchrony or complementarity between actions performed by different individ- uals. We propose to use the Granger Causality for measuring causal interactions between human actions. In our approach, actions are modeled as autoregressive processes, and Granger Causality is framed in terms of predictability. The underlying idea is that if one action causes the other, then knowledge (history) of the first action should help predict future values of the latter. We carried out experiments using data of the trajectories of the instruments of eight orchestra musicians and the batons of two conductors. Our data demonstrates the existence of a dynamical network of causal interactions among players and conductors. The modulation of the network is in relation to the degree of direction that the conductor is able to express to the musicians. The study suggests that the concept of causality allows us to quantitatively study the coordination of actions, and also that the same idea may be applied to other human action scenarios. This chapter is based on the paper accepted by the IEEE Conference on Developmental Learning 2010. I did not submit the final version of the paper because the co-authors would like to revise it and submit to a prestigious neuroscience journal. 65 5.2 Introduction Causality, defined as ?the relation between a cause and its effect, or between regularly cor- related events or phenomena? by Merriam-Webster Dictionary, is one of the fundamental mechanisms in many commonsense reasoning tasks. The analysis of causality is also very helpful for modeling the functionalities of cognitive systems, especially with regards to efficient estimation of appropriate responses of causes and modification of an agent?s be- havior online. The common conception is that causes always precede effects. Typically, causes and effects are regarded as discrete events. Therefore, most of the previous work in artifi- cial intelligence and development was concerned with the causal links between discrete events. However, causality also exhibits itself in continuous signals. In this chapter, we focus on causality in coordinated human actions. Coordination is one of the prerequisites for social interactions among intelligent agents. In fact, all ani- mal species grouping for defensive, reproductive or hunting needs have evolved complex communicative behaviors to obtain coordinated actions [Fri08]. Recent studies further suggested that the ?action mirroring circuit? might be tuned to action coordination rather than single action perception [FCO05]. The causality network plays an important role in supporting the coordination of ac- tions [SHK09]. The coordination, at the individual level, can be modeled conceptually as a computation transforming visual information into motor control parameters. In such a context, movement kinematics of one individual must have statistical causal relation with the kinematics generated by another individual. Thus, the understanding of this under- lying causality network will facilitate the research of many developmental learning prob- lems, such as cognitive robotics, multiagent systems, and human-computer interaction. Despite the progress in the study of causality, quantitative measurements of a dynami- cal causality network for coordinated signals remain elusive. The reason is that the formal analysis of causality networks requires an effective definition of causality between pair- wise signals. For example, if we merely record the kinematics of two actions behaving randomly, we would always find some correlated parameters (i.e., velocity). However, correlation does not necessarily imply causation. Therefore, a tool for formally studying the causality network of coordinated signals is necessary. 66 Originally from econometrics and neuroscience, Granger Causality [Gra69] is a statisti- cal concept that provides a computationally feasible way for quantitatively describing the pairwise causal influence. Granger?s proposal is that if a time-series y causes (or has an in- fluence on) x, then knowledge of y should help predict future values of x. Thus, causality is framed in terms of predictability. In our study, we examined the causality network of the coordinated musicians in a chamber orchestra using the kinematic data of their instruments. An orchestra is an in- teresting case because movement coordination is the main concern of the musicians and a successful communication is a pivotal feature for which they train for years. A chamber orchestra is divided into multiple sections (e.g., ?Violin I? and ?Violin II? in our study). Theoretically, players in the same orchestra section should have the same motion. However, each player might play slightly different according to his/her style and understanding. Our aim is to verify whether complex coordinated behaviors may be influenced by the causal interaction expressed by individuals. We recorded the trajectories of the endpoints of the instruments used by eight violin players and two conductors (Fig. 5.1a). The players played fifteen pieces with a profes- sional and an amateur conductor. In this experiment, our goal was to find out whether the expertise of the conductor is related to his/her influence on the musicians? performance, and if this in turn has an influence on the communication among the musicians. Our results demonstrate the existence of a dynamical network of causal interactions among the players and the conductors. The strength of causal influence is related to the degree of influence that the conductor is able to express to the musicians. Thus, our study suggests that we may quantitatively express the causality of coordinate actions. This chapter is organized as follows. First, we present the concept of Granger Causality as a statistical inference of the causal interaction between pairwise coordinated signals. Then, we show experiments and discuss the causality network. Finally, we review the literatures related to causality, coordination, and human motion modeling, and conclude the chapter. 67 Figure 5.1: Illustrating the recording scenario and the kinematic data. (a) The subjects in our experiments. Eight string players (grouped into ?Violin I? and ?Violin II?) were conducted by two conductors, respectively; (b) The locations of the endpoints of the music instruments; (c) The kinematic data of a conductor and a player in the first three seconds of the first movement of Mozart?s Symphony No.40. The time series in red, green, and blue represent the trajectories of the markers in X , Y , and Z dimension, respectively. 5.3 A Brief Introduction to Granger Causality The Granger Causality Test is a technique for determining whether one time series is use- ful in forecasting another. First, we discuss how to model time series using Autoregressive (AR) Processes. Then, we describe the idea of Granger Causality, which uses linear pre- dictors. 5.3.1 Preliminary: Forecasting Time Series using Autoregressive Processes Granger Causality is based on the residual in forecasting time series using linear predic- tors. An autoregressive model or process is a stochastic process. An AR(k) process, where 68 k is the order of the model, has the following form: y(t) = kX i=1 a(i)y(t i) + e(t); (5.1) where a(i), i = 1:::k, are the coefficients for the lagged observations, and e is the estimation residual. Two criteria are widely used for selecting the optimal order of a linear predictor in Eq. 5.1: Akaike?s Final Prediction Error (AIC) criterion and Schwarz?s Bayesian Criterion (BIC) [BJ90]. Lutkepohl [Lut07] compared these and other order selection criteria in a simulation study and found that BIC chose the correct model order most often and led, on the average, to the smallest mean-squared prediction error of the linear models. In many real world tasks, the time series is multi-dimensional. Human motion, for instance, is characterized by a number of continuous signals, such as a 3d trajectory of an instrument?s endpoint in our study. Certainly we could consider each dimension as a 1d time series and use Eq.5.1 to model the data. However, these 1d time series could be highly correlated (e.g., they might accelerate at the same time). Therefore, it is better to jointly model them. The Vector Autoregressive model [NS01] is a natural extension of the traditional AR model. This predictor models a vector series as follows: Y (t) = kX i=1 A(i)Y (t i) + E(t); (5.2) where Y (t) is the d (dimensions) by 1 vector, E is the d by 1 residual vector, and A(i), i = 1:::k, is the d by d coefficient matrices, which can be learned using maximal likelihood estimation or the Bayesian prior (Bayesian Vector Autoregressive Process) [LP09]. 5.3.2 Granger Causality Test Given two time series X1 and X2, we may either model them separately using two indi- vidual AR processes (denoted as L1), or jointly model them in one AR process (denoted as L2). 69 If X1 and X2 are modeled separately, we have: X1(t) = kX i=1 A1(i)X1(t i) + E1(t) (5.3) X2(t) = kX i=1 A2(i)X2(t i) + E2(t); (5.4) where A1 and A2 are the coefficients for the lagged observations, and E1 and E2 are the residuals. Alternatively, we can use a more complicated model L2 to jointly model these two signals as follows: X1(t) = kX i=1 B11(i)X1(t i) + kX i=1 B12(i)X2(t i) + E1(t) (5.5) X2(t) = kX i=1 B21(i)X1(t i) + kX i=1 B22(i)X2(t i) + E2(t); (5.6) where B11, B12, B21 and B22 are the coefficients for the lagged observations, and E1 and E2 are the residuals. The Granger Causality Test focuses on the improvement of the residual. Denote ei = X Ei(t) 2 (5.7) ei = X Ei(t) 2 (5.8) for i=1,2. Since L2 involves more parameters than L1 and has more degrees of freedom, necessarily L2 improves the performance of prediction (ei < ei). However, the improve- ment could be due to the knowledge (history values in the time series) of the other signal, or it could be simply due to the increase of the degree of freedom in the model. Therefore, F -Test is used to validate this hypothesis. F -Test is a classical approach to compare statistical models for data fitting, and further to identify the model that best fits the population from which the data was sampled. Denote F2!1 = e1 e1 p2 p1 e1 n p2 = e1 e1 e1 n p2 p2 p1 ; (5.9) 70 and F1!2 = e2 e2 p2 p1 e2 n p2 = e2 e2 e2 n p2 p2 p1 ; (5.10) where p1 and p2 are the degrees of freedom in L1 and L2, respectively, and n is the total number of observations in the input signal. In our study, p1 = 2kd2 and p2 = 4kd2. In the F -Test, F2!1 and F1!2 are assumed to have F -distribution with (p2 p1; n p2) and (p2 p1; n p1) degrees of freedom, respectively. F2!1 (F1!2) denotes the improve- ment of predicting X1 (X2) using the information from the history of X2 (X1). There is no simple arithmetic relationship between F2!1 and F1!2. We consider F2!1 as the ?causal influence? that X2 passes to X1, and similarly we call F2!1 the cause from X2 to X1. The Granger Causality Test claims that X1 ?Granger causes? X2 if and only if F1!2 > F2!1: (5.11) For the orchestra performance considered here, we may interpret this result as the causal influence from X1 to X2 being larger than the one from X2 to X1 such that X1 ?drives? X2. 5.3.3 Statistical Analysis of Granger Causality In many real world scenarios, signals are non-linear, and linear models are regarded as lo- cal approximations of the non-linear signals. Thus, the time series is divided into multiple windows, and linear analysis is carried out in each window. To analyze the overall causal influence between two non-linear time series, we com- pute the Granger Causality for each window, and use kernel density estimation [Par62] to summarize the causal influences between the signals. Kernel density estimation is a non-parametric method for estimating the probability density function of a random vari- able. Isotropic kernels are frequently used if no domain-specific prior is given [DHS00]. This provides a more robust measurement of the expected value of the variable in many studies. 5.4 Experiments In this section, we present our experiments and suggest how Granger Causality can be used for studying coordinated human actions. We demonstrate a causality network that is 71 modulated by the influences among coordinated actions of the members of an orchestra. This network can be conceived as a conversation between several individuals. First, we present the data acquisition and preprocessing procedures. Second, we show the results of the experiments. Specifically, we show that a professional conductor has a greater influence than an amateur on average, and influences the orchestra more fre- quently. Furthermore, we show that the musicians in the same orchestra section tend to be less reliant on the influence of peers when the influence from the conductor is greater. Finally, we compare our results to the non-causal data obtained by scrambling the causal data, and demonstrate the significance of the above results. Figure 5.2: The percentage of the time when the conductor leads the players. This figure shows that the professional conductor consistently leads the players more frequently than the amateur conductor. 5.4.1 Data Acquisition and Preprocessing 5.4.1.1 Data Acquisition We recorded the absolute positions of the endpoints of the music instruments. Data acqui- sition was performed using a Qualisys system consisting of three cameras recording the 72 Figure 5.3: The error bar of Fc!i, which denotes the causal influence strength from the conductor to the players. This figure shows that the professional conductor?s influence is consistently stronger than the amateur. positions of the markers on players? bows and conductors? batons at 240Hz. Fig. 5.1b il- lustrates the positions of the instrument endpoints in the environment and Fig. 5.1c shows two fragments of the recordings. 5.4.1.2 Data Preprocessing We used the spline method [Deb78] to handle missing data in the 3d trajectories and low- passed the signal for artifact removal. The data was then divided into overlapping win- dows. Two consecutive windows have 13 overlap . The window size is set to 2 seconds to incorporate a reasonable amount of information. We select the order of the AR processes as follows. First, we computed the AR model order in each window using the BIC criterion. Then, we used kernel density estimation to compute the expected value of the order and set k = 8. The AR model assumes that the input time series is stationary, which means the eigen- vector of the model must be inside the unit circle [Hay]. If this does not hold, one com- 73 mon practice is to differentiate the signal because the derivative may provide better results practically. In our experiment, we found that the second order derivative of the 3d motion series satisfies this assumption. Essentially, we used the acceleration of the kinematic data as the input. The use of higher order derivatives may seem to make the system more sensitive to noise. In our experiment, we found that the acceleration is reasonably accurate, and we use a Gaussian filter to remove the noise. The usefulness of the acceleration further shows the importance of the causality in the physical interaction. This may be very important to understanding how a human being cause or change the behavior of others by modifying their planned actions. 5.4.1.3 Creating Non-Causal Data for Comparison In order to test the statistical significance of our estimates, we need to verify the Granger Causality result against the null hypothesis of non-causal relations. In other words, the Granger Causality Test may indicate causal interactions between two random actions that do not have any causality, therefore, the causal relationship is valid only when the value is larger than a statistical threshold obtained from non-causal data. Intuitively, the movements of two players at different times should have no causal in- teraction statistically. Therefore, we randomized the windows for each musician to create the non-causal data for comparison in our experiments. 5.4.2 Experimental Results In the experiments, we numerically quantified how the musicians accommodate their per- formance according to the causal information passed by other musicians in a causality net- work. The idea is that different conductors exhibit different influences towards musicians. This difference, in turn, also affects the causal influences among players. First, we computed the causality flow between the conductors and the players. Then, we computed the average influence of causality within the same orchestra sections un- der the direction of different conductors. Finally, we compared the above results to the outcomes from the non-causal data, and to demonstrate that the results in the previous experiments are significantly meaningful. 74 5.4.2.1 Experiment 1: The Causality between the Conductors and the Players This experiment shows that the professional conductor influenced the orchestra more fre- quently than the amateur conductor, with a stronger average causal interaction. Denote Di(m) = 8 >< >: 1; if Fc!i(m) > Fi!c(m) 0; otherwise (5.12) where Fc!i(m) denotes the causality from the conductor to the ith player in the mth win- dow. Fig. 5.2 shows that the percentage of the time (i.e., the expected value ofDi(m)) when the conductor leads the players for the fifteen pieces of music and the eight players. This experiment shows that the professional conductor has a higher percentage than the amateur conductor in 95:8% of the cases. Interestingly, Fig. 5.2 also shows that in some pieces the amateur conductor has less than 50% ?leading time? for some players (e.g., Player 1 in the 5th piece), which means the players actually ?lead? the conductor. We further show that the professional conductor has a stronger influence than the am- ateur conductor on average. We computed the average causality flow from the conductor to the players for each piece of music (Fig. 5.3). The error bar visualizes that in 77.51% of the time the professional conductor has a stronger causal influence than the amateur. The analysis demonstrates that a successful coordination is due to the effective causal information passing between individuals. The causality, on one hand, helps the agent understand the peer?s intentions based on contextual information. On the other hand, it enables efficient prediction, anticipation and planning of appropriate actions in response. 5.4.2.2 Experiment 2: The Causality within the Players This experiment demonstrates that the players in the same orchestra section (i.e., ?Violin I? and ?Violin II?) have different behaviors in terms of causal interactions when they play under the direction of two conductors (Fig. 5.4). We represent the level of coordination for each player using the maximal value of the causal interactions from his/her peers in the mth window as follows: Fi(m) = max j Fj!i(m); (5.13) 75 where i and j are the players in the same orchestra section. Denote F(m) as the expected value of Fi(m) for all the players in the same orchestra section. It can be regarded as the average ?in-group? influence in the mth time window. Figure 5.4: The average influence and its standard deviation of the causality among the players in the same orchestra section. (a) The players in the ?Violin I?; (b) The players in the ?Violin II? (The 9th-15th piece of music are not shown due to many missing samples). Figure 5.5: The percentage of the time when the conductor leads the players in the non- causal data. Fig. 5.4 shows the average in-group influence in these two orchestra sections for each piece of music. Under the direction of the professional conductor (whose influence is larger) players have less influence on each players. The ?Violin I? and the ?Violin II? have weaker average in-group causal interaction with the professional conductor than the amateur, with 6.18% and 6.04% relatively on average, respectively. This experiment demonstrates that a dynamical causality network supports the coor- dination of the musicians. Each player has to follow the fundamental direction of the 76 conductor and accommodate his/her performance of the peers. If the conductor is unable to convey enough causality, the players adjust their internal communication by increasing the necessary causal interactions among themselves. Figure 5.6: The error bar of Fc!i, which denotes the causal influence from the conductor to the ith player, in the non-causal data. 5.4.2.3 Comparison using the Non-Causal data We repeated the above experiments using non-causal data obtained by randomizing the trajectory windows. The same procedures in the previous experiments were used to test the statistical significance of the results. Figs. 5.5 and 5.6 show the results for the non-causal data. Compared to Experiment 1, the percentage of leading time (Fig. 5.5) is approximately close to 50% for both conduc- tors, and the average influence (Fig. 5.6) is 3.04 (37.1% smaller than Fig. 5.3). These figures demonstrate that there is no significant difference between the two conductors in the ran- domized data. Thus, this shows that the results in Experiment 1 are useful for assessing the causality network. Fig. 5.7 shows that the average in-group influences in non-causal data are 1.92 and 2.01 for the professional conductor and the amateur, respectively. Compared to Fig. 5.4, one can see the in-group influence in causal data is 31% larger. Fig. 5.7 also does not sug- gest a difference in coordination between the players under two conductors. The results show that the causal interactions within the players are significant, and the interactions contribute to the coordination of actions. This further demonstrates the importance of a dynamical causality network in coordinated actions of music performance. 77 These two comparisons suggest that the results of Experiment 1 and 2 are valid and meaningful. The experiments show the usefulness of Granger Causality in analyzing co- ordinated actions. The difference between the two conductors is due to the causal infor- mation passing between individuals in a dynamical causality network. Figure 5.7: The average strength and its standard deviation of the causality among the players in the same orchestra section using the non-causal data. (a) The players in the ?Violin I?; (b) The players in the ?Violin II?. 5.5 Chapter Conclusion The aim of this study was to quantitatively demonstrate the existence of a dynamical net- work of causal interactions among coordinated actions. Specifically, we examined the Granger Causality between the conductors and the players (Experiment 1), and the in- group causal interaction among the players under different conductors (Experiment 2). The study suggests that causality is a useful concept for analyzing coordinated actions, and that it may be applied to many human interaction scenarios. 78 Chapter 6 Detecting Discontinuity in Human Movement for Action Representation and Recognition 6.1 Chapter Summary We propose an approach for detecting discontinuities of human movements in videos and motion capture data. Our approach aims at providing natural temporal segmentations of human actions, which can serve as a tool for action analysis and recognition. The idea is based on the jerks (third order derivatives) of the signals. For videos, we first reduce the dimension of the visual space using Gaussian Process Dynamical Models (GPDM). Then the discontinuities are estimated by finding the clusters of maxima of the jerks in the embedded signals. We demonstrate by examples that the action discontinuities in videos consistently correspond to the changes in the underlying body dynamics. Experiments suggest that the approach is robust to changes in viewpoint and background, and that the proposed algorithm finds consistent poses, which lead to better recognition rates. This chapter is based on the paper appear in the First Workshop of Social Signal Pro- cessing in conjunction with the IEEE Conference on Affective Computing 2009. A video lecture is available at: http://sspnet.eu/2010/08/ieee-ssp-workshop-september-2009/. 79 6.2 Introduction Many recent studies in computer vision have been concerned with recognizing human activity. Ideally, we would like a formalism that allows us to describe actions in a way that facilitates recognition, characterization, and visualization. Although it is not clear yet how this formalism should be, it is very clear that whatever the approach may be, we need to be able to break the human movements into meaningful simpler, shorter movements. How can we find these break points? If we have muscle signals over time, it is nat- ural to partition the data at the points where we have discontinuities in the acceleration [ddSB03], an empirical observation supported by Newton?s law. The poses corresponding to the discontinuities become natural segmentations of human movements, which in turn facilitate human action analysis. In this chapter, we present an intuitive and efficient algorithm for detecting the discon- tinuities in unrestricted human movement both in videos and in motion capture (MoCap) data. We empirically found that the segmentations in videos and in MoCap data are consis- tent with each other. The reason for the consistency probably is because groups of muscles are exercised consistently, as has been found in [ddSB03] and are termed the ?muscle syn- ergies?. The estimated discontinuities provide effective measurements for human action analysis in videos that are robust under different conditions including subject variation, video quality, frame rate, camera motion, and changes in viewpoint. In motoric space, we find that different body joints are activated coherently. The acti- vations of the different joints are estimated as the maxima of the changes in acceleration of individual joint angles. An action discontinuity occurs when a number of joints are ac- tivated at nearly the same time. Based on this fact, we propose an alignment algorithm to robustly detect the clusters of jerk maxima of different joints over time. In visual space, certainly, we cannot measure the muscle signals in videos from any view. However, we can empirically transfer the idea of discontinuities to the visual do- main using dimension reduction. We detect the discontinuities in reduced visual space obtained by the Gaussian Process Dynamical Model (GPDM). This model has been pro- posed recently for analyzing human MoCap data, and we extend it here to videos of single subjects moving in uncluttered scenes. We found that when there is an action discontinuity in the video, the changes in acceleration occur simultaneously along the different GPDM 80 dimensions. This allows us to use the same alignment algorithm for detecting action dis- continuities in videos as well as in MoCap data. Our key contributions are: 1. We propose an algorithm for creating natural temporal segments of human actions in videos and in MoCap data, based on the concept of action discontinuities. 2. We demonstrate that the action segments in visual space and in motoric space are consistent. 3. We show that the poses corresponding to the action discontinuities effectively im- prove the action recognition rates on public datasets. The remainder of the chapter is organized as follows. Sec. 6.3.1 and 6.3.2 present the algorithm for detecting action discontinuities in MoCap data and in videos. Sec. 6.4 shows experiments and comparisons, and Sec. 6.5 concludes the chapter with a discussion . 6.3 Action Discontinuities in Human Movement This section presents the discontinuity detection algorithm for human movements. In Sec. 6.3.1, we analyze MoCap data. We directly compute the change in acceleration in the mo- tion capture data, and propose an alignment algorithm for finding the clusters of changes across different joints. In Sec. 6.3.2, we use dimension reduction algorithm to map videos to a low dimensional space. On this data we then detect the discontinuities using the same alignment algorithm as for MoCap data. 6.3.1 Action Discontinuities in Motion Capture Data The data from a motion capture suit are time series of three rotation angles each at a num- ber of joints on the human body. The quaternion series of a certain joint is a 4D vector X(t) = [x1(t); x2(t); x3(t); x4(t)]T (6.1) The jerk of X(t) is computed as J(t) = jj d3(X(t)) dt3 jj; (6.2) 81 Figure 6.1: I redraw Fig. 4.2 here for convenient purpose. Action discontinuities of the speed-varying action ?walking to running? from the University of Bonn dataset [MRC+07]. The jerk envelopes of the six joints are color coded and normalized in mag- nitude. The purple dashed lines indicate that the locations of the envelope extrema of certain joints coherently occur at the same time. The trajectories of the head, the left elbow, and the right ankle are drawn in red. where jj jj is the L2 norm. We need a robust measurement to estimate the jerk, because the estimation of higher order derivatives may be sensitive to noise. We thus compute the jerk envelope as Env(t) = jjHilbert(J(t))jj (6.3) for every joint, where Hilbert( ) is the Hilbert transform. This is a standard approach for computing the signal envelope [Bra99]. Then we process Hilbert( ) using a low pass Butterworth filter, and compute the extrema of the filtered signal Env(t) for every joint. An action discontinuity occurs when there are a number of joints activated nearly at the same time. This is demonstrated in Fig. 6.1 for the speed-varying action ?walking to run- ning? from the University of Bonn dataset [MRC+07]. Here, we show the jerk envelopes of the joints ?Hip?, ?Knee?, and ?Ankle? of both legs. The envelopes are color coded and normalized in magnitude. Since not all joints need to be activated together in an action, 82 our goal is to robustly detect the temporal clusters of the activated joints. We model this problem by finding the optimal ?alignment? of the envelope extrema for the different joints (the purple dashed lines in Fig. 6.1). We define the optimal alignment as the collection of the envelope extrema of the different joints with the minimal sum of the pairwise distances. The goal is to choose one of the discontinuities from each latent variable such that the sum of the pairwise distances is minimal. The alignment algorithm automatically detects the activated joints over time. This clus- tering method is applicable to many scenarios where other measurements, such as global curvature, may fail. Alternative alignment approaches, such as dynamic programming, focus on global alignment and may align discontinuities at completely different locations. We use an indicator function im for the location of the i th discontinuity of themth latent variable Iim such that im = 8 >< >: 1; if Iim is selected 0; otherwise (6.4) for all possible i, and X i im = 1; (6.5) for m = [1; :::; 4]. The distance between the locations of two discontinuities Iim and I j n (m 6= n) is defined as d(Iim; I j n) = exp( 1 jIim I j nj) 1; (6.6) where is the parameter for the distance measurement. Then, the cost of the consistency between a set of discontinuities in different latent variables is the sum of the pairwise distances, c = X m X n;n6=m X i X j im j nd(I i m; I j n): (6.7) The optimization problem is therefore to find im which minimize c, = arg min c (6.8) subject to the constraints stated in Eq. 6.5. 83 Eq. 6.8 is a constrained quadratic integer programming optimization problem. This can be reformulated to an unconstrained quadratic programming problem by relaxation and changing variables. im is relaxed to 0 im 1; (6.9) and the variables are changed to im = exp imP i exp i m ; (6.10) where 1 im 1: (6.11) Therefore, if we pre-compute the cost matrix M where M(m;n; i; j) = d(Iim; I j n), we simply need to find the im that minimize C = X m X n;n 6=m X i X j exp imP i exp i m exp jn P j exp j n M(m;n; i; j): (6.12) After minimizing Eq. 6.12 with respect to im, i m = 1 if and only if the value of exp im= P i exp i m is maximal for all possible i in the m th latent variable, and im = 0 otherwise. The number of variables in Eq. 6.12 could be large (e.g., 100-500). An efficient solver for this quadratic optimization is necessary. We use L-BFGS in the proposed approach1. The alignment algorithm is iteratively applied until the value of C is smaller than a threshold . 6.3.2 Action Discontinuities in Videos A human motion video is represented by the variables of a Gaussian Process Dynami- cal Model (GPDM) in a low dimensional space. The discontinuities are characterized by the maxima of the third order derivative of the latent variables, and the same alignment algorithm in Sec 6.3.1 is used for locating the discontinuities. Gaussian Processes have become popular for regression and classification in machine 1www.cs.ubc.ca/~schmidtm/Software/minFunc.html 84 learning. A Gaussian Process is a collection of random variables which have joint Gaussian distributions [RW05]. It can be regarded as a probability distribution over functions in Bayesian inference. Figure 6.2: (a) The motion trajectory of a video in the Weizmann dataset [BGS+05] in the GPDM dimensions 1-3 (blue curve). (b) The discontinuity in acceleration for variables x1; :::; x6. Each row represents the locations and the strengths of the local maxima of the 3rd order derivative of a variable. The poses corresponding to the center of the discontinuities (dashed red line) in different variables are shown on top. Frame numbers are displayed both below the poses and in the motion trajectory (red dots). The Gaussian Process Dynamical Models (GPDM) [WFH08] considers a mapping from a high dimensional data space to a low dimensional latent space and a dynamical model in the latent space. This allows high dimensional data to be compressed in a latent space using Gaussian priors. In [WFH08], the GPDM maps the joint angles in human MoCap data to a 3D latent space. The relation between the latent variables X and the original data Y is as follows: p(X;Y; ; ;W ) = p(YjX; ;W )p(Xj )p( )p( )p(W ) (6.13) where , , and W are the parameters. Given Y, one estimates X, , , and W . Refer to 85 [WFH08] for details. We first apply the GPDM to videos. A high dimensional vector of image intensities in each frame is formed by concatenating the pixel values from left to right and then from top to bottom in each image. The GPDM embeds the sequence of intensity vectors in a low dimensional latent space. In our experiment, 6D is generally sufficient. Fig. 6.2 visualizes the trajectory of the latent variables for a video in the GPDM for dimensions 1-3. Then, the alignment algorithm of Sec. 6.3.1 is used for locating the discontinuities. Our approach does not require foreground/background segmentation. The dimen- sion reduction algorithm handles the smooth background change between two subsequent frames implicitly, while still retaining the changes in acceleration of the human actor. In the Sec. 6.4.2, we further suggest that a background subtraction may not be helpful for our algorithm. 6.4 Experiments Experiments show that our algorithm finds consistent human poses in MoCap data and in videos, and this consistency facilitates action recognition. First, we test the algorithm on 30 actions from the CMU MoCap dataset [Lab] for which both MoCap data and videos have been recorded. The evaluation shows that the action discontinuities in motoric space and in visual space are consistent. Second, we show that the detection for action discontinuities in videos outperforms a baseline algorithm, and that the approach is robust to the changes in viewpoint and background. Finally, we demonstrate the potential of our method for recognition by showing that the represen- tation based on action discontinuities improves the performance of action recognition al- gorithm in [LJD09]. Our approach works for a reasonable range of parameter values, because the localiza- tion of the position of the local extrema is not sensitive. In all experiments, we resize the image frames to have the same height (50 pixels) while preserving the height/width ratio for dimension reduction. The parameters for filtering the MoCap data and the reduced visual signal depend on their frame rates. The distance measurement, in Eq. 6.6, is fixed to 10 in all experiments. Our algorithm is also very efficient. L-BFGS is a very fast algorithm. The alignment 86 algorithm only takes approximately a second on average for a one-minute video in a laptop with 2GHz CPU and 2GB memory, and the current GDPM implementation [WFH08] takes one minute on average. 6.4.1 Effective Detection of Action Discontinuities in Video In this section, we demonstrate the effectiveness of the action discontinuities detection in videos. 6.4.1.1 Results on the CMU Dataset We evaluated the performance of our algorithm on the CMU dataset using simultaneous MoCap data and videos. Our goal is to directly compare whether changes in joint accel- erations do correspond to jerk extrema in the signals in the reduced visual space learned from video. We first selected 30 sequences that have both MoCap data and video available. The actions are ?running? (10), ?walking? (10), and sport activities (10, including ?boxing?, ?swordplaying?, ?jumping?, etc.). Then we ran the algorithms on the MoCap data and the videos, respectively. Fig. 6.3a-6.3c show the results of the action discontinuities detection in videos for the activities ?walking?, ?running?, and ?dancing?, and Fig. 6.3d-6.3f show the results for the corresponding MoCap data. We overlaid the human subjects for the purpose of visualization. From Fig. 6.3 we can see that the poses for the videos correspond to the ones in simul- taneous MoCap data. This suggests that the dimension reduction is capable of encoding the dynamics of human activities in the reduced space. The videos and the MoCap data in the CMU dataset were not synchronized. There- fore, in order to evaluate the consistency, we downsampled the MoCap data to 25 fps, and then aligned the results for videos and for MoCap data using dynamic programming. The consistency is defined as the zero-mean standard deviation of the differences between corresponding discontinuities. The first row of Table 6.1 shows the consistency for the three categories of actions. On average, the standard deviation is within 4 frames ( 0:16 second). This consistent result shows that the discontinuities in videos are closely related to the changes in the underlying human body dynamics. 87 (a) Walking (b) Running (c) Dancing (d) (e) (f) (g) (h) (i) Figure 6.3: a)-c) Results for action discontinuities in videos; d)-f) Results for action discon- tinuities in corresponding MoCap data; g)-i) Results of the baseline algorithm. Comparing the figures in each column, we clearly see that the action discontinuities for videos corre- spond to the ones in MoCap data, and that the results for the baseline algorithm are not consistent. 88 Table 6.1: Consistency between action discontinuities in videos and MoCap data. The consistency is defined as the zero-mean standard deviation of the differences between cor- responding discontinuities after applying dynamic programming (fps=25). Algorithm Walking Running Sport Activities Proposed Algorithm 2.3 3.1 4 Baseline 10.1 6.4 9.5 6.4.1.2 Evaluation First, we compared the performance of our algorithm on the selected sequences in the CMU dataset to a baseline algorithm. Then, we evaluated different dimension reduction algorithms using labeled sequences in the UMD Gesture dataset [LJD09]. We used a baseline algorithm [OKA06] for comparison. This algorithm uses the local extrema of the change in the optical flow to compute the action discontinuities. The results show that the baseline algorithm is useful for the actions whose trajectories are parallel to the image plane (Fig. 6.3b) where the perspective projection can be approximated by or- thographic projection. However, it may not provide consistent results for the cases where the perspective distortion is severe (Fig. 6.3a) or when the actions are complicated (Fig. 6.3c). We used the results for MoCap data as the ground truth. The second row in Table. 6.1 further shows that the standard deviations of the baseline algorithm are much larger. The results show that our algorithm is effective and outperform the baseline algorithm. Figure 6.4: UMD Gesture dataset [LJD09]. The dataset contains videos of 14 different gestures of military signals. Finally, we evaluated the accuracy of our detection algorithm on manually labeled data. The UMD Gesture dataset consists of 14 different gestures of military signals (Fig. 6.4). Each gesture was performed by three subjects, and the same gesture was repeated three 89 times by each subject. In the dataset, the first and the last pose of each action were man- ually labeled. This allows us to quantitatively evaluate the accuracy in the discontinuities detection. The estimated discontinuities naturally divide the video sequences into intervals. We computed the optimal one-to-one-mapping between the estimated intervals and the ground truth intervals. Then the accuracy is defined as the sum of the overlap between corre- sponding pairs divided by the total length of the intervals. This ?assignment? problem can be solved by the Hungarian algorithm (see [YZDJ08] for details). Three other algorithms, namely, PCA, Laplacian [Niy03], and Isomap [TdSL00] are evaluated for comparison. Compared to the ground truth, Table 6.2 shows that the GPDM has a better accuracy than other algorithms. This result suggests that the GPDM models the process of human activities in videos better than other dimension reduction algorithms which are based on the pairwise relationship of data samples. Therefore, it is more suitable for human motion analysis in a reduced dimension. This set of experiments demonstrates that action discontinuity detection in video is closely related to the physical dynamics and visual perception of human actions. As a result, the poses at the computed discontinuities are useful representations for processing human actions. Table 6.2: Accuracies of the detection algorithms using different dimension reduction methods. The accuracy is computed using the Hungarian algorithm on the UMD Gesture dataset [LJD09]. Algo Ours PCA Laplacian ISOMAP Rate 92.3% 78.6% 85.7% 87.9% 6.4.2 Robust Detection of Action Discontinuities in Videos In this section, we demonstrate that our proposed algorithm produces robust results under changes in viewpoint and smooth changes in background. Then in Sec. 6.4.3, we will show how to use these results for action recognition. 6.4.2.1 Robustness to Changes in Viewpoint We show that our detection algorithm provides consistent results on actions viewed from different directions. This makes the results useful for representing actions in multiple view 90 videos. We tested our algorithm on synchronized cameras. Fig. 6.5 shows an action from the UMD Pose dataset [OKA06]. The walking sequences of the same person were taken by six synchronized cameras in an indoor environment. Figure 6.5: Results for a multiple view sequence from the UMD Pose dataset [OKA06]. Each column displays the poses from a different viewpoint. Each row suggests that the estimated key poses from different cameras are consistent. The camera (C#) and the frame number (F#) are shown in the lower right corner in each representative frame. Fig. 6.5 shows the results for all the six cameras. Each column displays the estimated poses from a different viewpoint. Comparing the frame numbers in individual rows, one can see that the action poses from different cameras are consistent. The reason for this consistency is probably because all the action discontinuities in different views correspond to the changes in the same human motoric dynamics. Therefore, this consistency and robustness of the algorithm make the detection results useful for action recognition. 91 6.4.2.2 Robustness to Changes in Background We demonstrate that our approach is robust to smooth background change using videos in public domain. Fig. 6.6a shows an indoor video, which was captured with a moving cam- era in a gymnastic room. Discontinues in action are correctly identified even though the changes in viewpoints and the background are large. Fig. 6.6b, taken with a moving cam- era, shows a more challenging situation. The algorithm captures the action discontinuities when the girl moves from left to right and vice versa. Figure 6.6: Action discontinuities of two videos in public domain. For each action, the motion trajectory (blue curve) of the video in the reduced space is shown. The frame numbers corresponding to the key poses are displayed under the individual frames as well as in the GPDM space 1-2 (red dots). The blue arrow denotes the time direction. We further demonstrate that background subtraction, even on static cameras, may not be necessary in detecting the action discontinuities. For static cameras, background sub- traction can be applied for computing the human masks. In this example, we use the pre-computed mask in the Weizmann dataset to obtain and align the foreground images using correlation, and apply the proposed algorithm on the aligned foreground images. Compared to the poses in Fig. 6.2, the result (Fig. 6.7a) for the masks is not consistent. The performance of discontinuity detection degrades because of inaccurate segmentation and the accumulation of quantization error at the pixel level. This can be easily improved by using optical flow as the input (Fig. 6.7b), which provides similar results as in Fig. 6.2. Performing background segmentation on videos captured by a moving camera is even more challenging. The current state of the art is to use human body detectors to locate humans and align the centers of the gravity for tracking. We use an outdoor sequence where the subject is performing the ?limping? action. Fig. 6.7c show the results of our algorithm. One can see that the asymmetric walking style is 92 correctly identified. Fig. 6.7d is the result using a state-of-the-art detector [SKHD09b] as the input. The human detection (yellow bounding boxes) is useful for tracking, but it may not be useful for pose analysis and the detection of the poses that correspond to the action discontinuities. Figure 6.7: (a) and (b): Results for the video in Fig. 6.2 using foreground (a) and flow fields (b) as the input. Frame numbers are displayed below the images. The magnitude of the flow field is shown in (b). (c) Results for a video (?limping? action, taken with moving camera) using our algorithm. This result shows that each step is clearly identified. (d) Action discontinuities based on the human detection results. The above results show that our algorithm for videos is robust against changes in back- ground. Such consistent results can be adopted for pose and action recognition. In addi- tion, they may be used for action synopsis and representation. 6.4.3 Action Discontinuities Facilitate Visual Action Recognition In this experiment, we show that the poses corresponding to the action discontinuities lead to accurate action recognition. The primary reason is that these ?key poses? reduces the uncharacteristic images in both the training and test datasets. By removing the un- characteristic poses that are not unique to individual categories, the inter-class variation is increased. We evaluated the action recognition performance by comparing the results against the algorithm in [LJD09], which uses different sets of poses as input. In our experiments, we employ shape, color, and motion cues of the key poses. Three datasets ? the UMD Gesture dataset, the KTH action dataset, and the Weizmann datasets ? were used for comparison. Results show that the poses extracted from our algorithm improve the performance of action recognition (Table. 6.3). For each dataset, we compared the recognition rates using the key poses only and using all cues together, respectively. For the UMD Gesture dataset, 93 we process them in a slightly different way than in the original paper ? we use all the three action instances in each video as the input for the alignment algorithm, instead of using each instance separately. Table 6.3: Improving the recognition performance using action discontinuities. The results for [LJD09] are copied from the original paper. Dataset [LJD09], Pose only Ours, Pose only [LJD09], All Cues Ours, All Cues UMD 53.57% 69.44% 91.07% 94.44% Weizmann 81.11% 83.33% 100% 100% KTH S1 71.95% 76.46% 98.83% 99.04% KTH S2 61.33% 69.74% 94% 94.52% KTH S3 53.03% 60.13% 94.78% 96.32% KTH S4 57.36% 65.54% 95.48% 96% In Table. 6.3, the recognition rates using the key poses are consistently higher than the other method. This demonstrates that the key poses are useful for action recognition, which in turn suggests that our modeling of action discontinuities in videos is practical. 6.5 Chapter Conclusion We proposed an action-independent algorithm for detecting human movement disconti- nuities (defined as the maxima of the third order derivatives of time series) in video and in MoCap data based on the dynamics of human motion. Experiments demonstrate that our algorithm is useful for extracting consistent poses from different videos, and improves the performance of recognition algorithms by providing more discriminative poses as inputs. The current approach is designed for a single subject in an uncluttered scene. There- fore, it is limited in its abilities at handling complicated scenes of human interactions or activities of a group of subjects. We may be able to handle such scenes using better human visual filters that label the pixels of the body parts. This way, a more refined part based action analysis may be used. 94 Chapter 7 Conclusion As seeing agents, human beings have the basic ability to collaborate with other human beings in daily tasks such as cooking, crafting, and cleaning. Action mirroring is a low- level mechanism that allows agents to reproduce 3D human poses directly from 2D videos. In this dissertation, I performed a number of studies to advance the understanding of the computational methods of this mechanism. First, I suggest that the mapping between the 3D trajectories of a body joint and their 2D projections in video can be learned using a statistical method called Partial Least Squares (PLS). This mirroring module can infer the 3D positions of the body joint from visual data instantly. Synthetic and real experiments showed that the regression framework can ro- bustly mirror human actions. Second, I studied the primitives in 3D MoCap data of individual human subjects. In our learning procedure, we solve for both the basis functions of the actions and the times when these functions are "activated". The sparse activations explicitly express the coordi- nation among different joints. Third, I used the Granger Causality to analyze the MoCap data of a group of human subjects in an orchestra. Evidence shows a causality network among human motions in a coordinated setting. Prof. Rosenfeld once said that we need to "See Far Away"1. Therefore, I would like to conclude this dissertation by showing rising areas and applications for computer vision and human action analysis. 7.1 What is Next? I have a dream that one day Cognitive Robots will rise up, and be our friends. There are two applications directly related to this dream: robots and blind patients. 1This is possibly why he named his center "CfAR". 95 7.1.1 Robots Need Vision We want to build intelligent machines. Among all components, human action analysis is critical because it facilitates human-robot interaction and robot-object interaction. As Aristotle said, humans are social animals. We cannot survive without interaction, so do cognitive robots. My PhD study in Maryland will certainly help towards building these humanoid robots. Imitating human action and designing immediate control strategy for response will both entertain users in living room and help them working in the garden. Certainly, this long term goal of Cognitive Robots is still elusive, but let us believe we will see prototypes in the near future. 7.1.2 Blind Patients Need Vision "There will be a light!" This is what we are telling the blinds nowadays, and it will eventually come true in 10 years. Bionic eye becomes reality recently, and human trials were carried out in UK, US, and Australia. However, this visual restoration is limited to a very low resolution. The visual repre- sentation is, and will still be, approximately 32 by 32 pixels. Therefore, we must use com- puter vision algorithms to extract useful information from the environment. As a result, human actions analysis, scene and context understanding, and saliency detection will be very critical. Visual Processing for Bionic Eye, which is the project I am currently working on , will be a useful attempt to see how we merge human with machines. 96 Appendix A Publications 1. Y. Li, C. Teo, C. Fermuller, and Y. Aloimonos, ?Learning Visual-Motoric Mappings of Actions?, submitted to Computer Vision and Pattern Recognition (CVPR), 2011. 2. Y. Li, C. Teo, Y. Yang, C. Fermuller, and Y. Aloimonos, ?Action Mirroring for Home Assistant Robots?, submitted to Intl. Conf. on Robotics and Automation (ICRA) 2011. 3. C. Teo, Y. Li, Y. Yang, C. Fermuller, and Y. Aloimonos, ?Where is My Tool??, submit- ted to Intl. Conf. on Robotics and Automation (ICRA) 2011. 4. G. Zhu, X. Yu, Y. Li, and D. Doermann, ?Language Identification for Handwritten Document Images Using A Shape Codebook?, Pattern Recognition (PR), 2008. 5. Y. Li, Y. Zheng, D. Doerman, and S. Jaeger, ?Script-Independent Text Line Segmenta- tion in Freestyle Handwritten Documents?, IEEE Trans. Pattern Anal. Machine Intell. (T-PAMI), vol. 30, no. 8, pp. 1313-1329, August, 2008. 6. Y. Li, Z. Wang, and H. Zeng, ?Correlation Filter: An Accurate Approach to Detect and Locate Low Contrast Character Strings In Complex Table Environment:, IEEE Trans. Pattern Anal. Machine Intell. (T-PAMI), vol. 26, no. 12, pp. 1639-1644, December, 2004. 7. Y. Li, C. Fermuller, and Y. Aloimonos, ?Learning Shift-Invariant Sparse Representa- tion of Actions?, Computer Vision and Pattern Recognition (CVPR) 2010. 8. Y. Li, C. Fermuller and Y. Aloimonos, ?Illusory lightness perception due to signal compression and reconstruction?, Vision Sciences Society (VSS) Conference 2010. 9. Y. Li, K. Bitsakos, C. Fermuller, and Y. Aloimonos, ?Real-Time Shape Retrieval for Robotics Using Skip Tri-Grams?, In Proc. of Intl. Conf. on Intell. Robots and Sys. (IROS) 2009. 97 10. G. Zhu, X. Yu, Y. Li and D. Doermann, ?Learning Visual Shape Lexicon for Document Image Content Recognition?, In Proc. of European Conf. on Computer Vision (ECCV) 2008, pp. 745-758. 11. K. Bitsakos, H. Yi, Y. Li, and C. Fermuller, ?Bilateral symmetry of object silhouettes under perspective projection?, In Proc. of Intl. Conf. on Pattern Recognition (ICPR) 2008. 12. Y. Li and D. Jacobs, ?Efficiently Determining Silhouette Consistency?, In Proc. of Computer Vision and Pattern Recognition (CVPR) 2007. 13. X. Yu, Y. Li, C. Fermuller, and D. Doermann, ?Object Detection Using A Shape Code- book?, In Proc. of British Machine Vision Conference (BMVC) 2007. 14. K. Bitsakos, Y. Li, and C. Fermuller, ?Combining Motion from Texture and Lines for Visual Navigation?, In Proc. of Intl. Conf. on Intell. Robots and Sys. (IROS) 2007, pp. 232-239. 15. Y. Li, Y. Zheng and D. Doermann, ?Detecting Text Line in Handwritten Documents?. In Proc. of Intl. Conf. on Pattern Recognition (ICPR) 2006, pp 1030-1033. 16. Y. Li, Z. Wang, and H. Zeng, ?String Extraction From Color Airline Coupon Image Using Statistical Approach?, In Proc. of Intl. Conf. on Doc. Anal. and Rec. (ICDAR) 2003, pp. 289-294. 98 Appendix B Brief Resume B.1 Education 2004-present, PH.D student in Electrical and Computer Engineering University of Maryland, College Park, MD 20742. 2001-2004, M.ENG in Computer Science (First Rank), South China Univ. of Tech., China. 1998-2001, B.SC in Computer Science (First Honor), South China Univ. of Tech., China. B.2 Awards and Honors 2008-present, FUTURE FACULTY FELLOW, A. James Clark School of Engineering, University of Maryland. 2007, SECOND PLACE, the First Semantic Robot Vision Challenge, AAAI 2007, Vancouver, Canada. 2006BEST STUDENT PAPER, the Intl. Conf. on Frontier in Handwriting Recognition (ICFHR) 2006. B.3 Research Experience 2007-present, RESEARCH ASSISTANT, Computer Vision Lab, University of Maryland. 2004-2007, RESEARCH ASSISTANT, Language and Media Lab, University of Maryland. 2001-2004, STUDENT TECHNICAL MEMBER, Huagong Tomorrow Co. Ltd., China. 99 B.4 Participated Projects 2010-present, VISUAL PROCESSING FOR BIONIC EYE, National ICT Australia and Bionic Vision Australia. 2007-present, POETICON, European Commission FP7. 2007-2008, PARKINSON?S DISEASE ANALYSIS, National Institutes of Health. 2004-2007, CONTRACT MDA-9040-2C-0406, Department of Defense. 100 Bibliography [Abd07] H. Abdi. Partial Least Squares (PLS) Regression. Thousand Oaks, 2007. [ACCO05] Jackie Assa, Yaron Caspi, and Daniel Cohen-Or. Action synopsis: pose selec- tion and illustration. ACM Trans. Graph., 24(3):667?676, 2005. [AGC10] Giorgio Metta Arjan Gijsberts, Tatiana Tommasi and Barbara Caputo. Object recognition using visuo-affordance maps. In IROS, 2010. [Ari] Aristotle. On the Gait of Animals. [Ari06] Okan Arikan. Compression of motion capture databases. ACM Trans. Graph., 25(3):890?897, 2006. [ARS09] M. Andriluka, S. Roth, and B. Schiele. Pictorial structures revisited: People detection and articulated pose estimation. In CVPR, pages 1014?1021, 2009. [AT06] Ankur Agarwal and Bill Triggs. Recovering 3d human pose from monocular images. 28(1):44?58, 2006. [BD06] Thomas Blumensath and Mike E. Davies. Sparse and shift-invariant repre- sentations of music. IEEE Transactions on Audio, Speech & Language Processing, 14(1):50?57, 2006. [BGS+05] Moshe Blank, Lena Gorelick, Eli Shechtman, Michal Irani, and Ronen Basri. Actions as space-time shapes. In ICCV ?05: Proceedings of the Tenth IEEE In- ternational Conference on Computer Vision, pages 1395?1402, Washington, DC, USA, 2005. IEEE Computer Society. [Bis05] A. Bissacco. Modeling and learning contact dynamics in human motion. In CVPR?05, pages 421?428, 2005. [BJ90] George Edward Pelham Box and Gwilym Jenkins. Time Series Analysis, Fore- casting and Control. Holden-Day, Incorporated, 1990. [BJ01] Y.Y. Boykov and M.-P. Jolly. Interactive graph cuts for optimal boundary re- gion segmentation of objects in n-d images. In Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on, volume 1, pages 105? 112 vol.1, 2001. 101 [BM10] T. Brox and J. Malik. Large displacement optical flow: Descriptor matching in variational motion estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 99(PrePrints), 2010. [Bou08] D. Bouchard. Automated Motion Capture Segmentation using Laban Movement Analysis. PhD thesis, University of Pennsylvania, 2008. [Bra99] R. Bracewell. The Fourier Transform & Its Applications. McGraw-Hill Science, June 1999. [Can06] Emmanuel Candes. Compressive sampling. Int. Congress of Mathematics, pages 1433?1452, 2006. [CDS01] Scott Shaobing Chen, David L. Donoho, and Michael A. Saunders. Atomic decomposition by basis pursuit. SIAM Rev., 43(1):129?159, 2001. [CGM+] R. Chalodhorn, D. Grimes, G. Maganis, R. Rao, and M. Asada. Learning humanoid motion dynamics through sensory-motor mapping in reduced di- mensional spaces. In ICRA?06. [CJ06] M. Chhabra and R. Jacobs. Properties of synergies arising from a theory of optimal motor behavior. Neural Comput., 18(10):2320?2342, 2006. [CJLS09] J.F. Cai, H. Ji, C.Q. Liu, and Z.W. Shen. Blind motion deblurring from a single image using sparse approximation. pages 104?111, 2009. [Cle94] W. S. Cleveland. Coplots, Nonparametric Regression, and Conditionally Para- metric Fits. In T. W. Anderson, K. T. Fang, and I. Olkin, editors, Multivariate Analysis and Its Applications, pages 21?36. IMS Lecture Notes ? Monograph Series, Vol. 24, Hayward, California, 1994. [CVV09] Antonio Camurri, Giovanna Varni, and Gualtiero Volpe. Measuring entrain- ment in small groups of musicians. In ACII ?09, 2009. [ddSB03] A. d?Avella d?Avella, P. Saltiel, and E. Bizzi. Combinations of muscle synergies in the construction of a natural motor behavior. Nature Neuroscience, 6:300? 308, 2003. 102 [Deb78] C. Deboor. A Practical Guide to Splines. Springer-Verlag Berlin and Heidelberg GmbH & Co. K, December 1978. [DGJL07] Alexandre D?aspremont, Laurent E. Ghaoui, Michael I. Jordan, and Gert R. G. Lanckriet. A direct formulation for sparse pca using semidefinite program- ming. SIAM Review, 49(3):434?448, 2007. [DGL09] Z. Deng, Q. Gu, and Q. Li. Perceptually consistent example-based human motion retrieval. In I3D ?09, pages 191?198, 2009. [DHS00] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, pages 84?97. Wiley-Interscience Publication, second edition, 2000. [DS98] N. Draper and H. Smith. Applied Regression Analysis (Wiley Series in Probability and Statistics). Wiley-Interscience, third edition, April 1998. [DT05a] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition (CVPR) 2005, pages 886?893, 2005. [DT05b] David L. Donoho and Jared Tanner. Sparse nonnegative solution of underde- termined linear equations by linear programming. Proceedings of the National Academy of Sciences, 102(27):9446?9451, 2005. [EBMM03] A. A. Efros, A. C. Berg, G. Mori, and J. Malik. Recognizing action at a distance. pages 726?733 vol.2, 2003. [EL04] Ahmed M. Elgammal and Chan-Su Lee. Inferring 3d body pose from silhou- ettes using activity manifold learning. In CVPR (2), pages 681?688, 2004. [FCO05] L. Fadiga, L. Craighero, and E. Olivier. Current Biology, 2005. [FH05] P. Felzenszwalb and D. Huttenlocher. Pictorial structures for object recogni- tion. International Journal of Computer Vision, 61(1):55?79, January 2005. [FMJZ] V. Ferrari, M. Marin-Jimenez, and A. Zisserman. Progressive search space reduction for human pose estimation. In CVPR 2008. [Fri08] C. Frith. Social cognition. Philos Trans R Soc Lond B Biol Sci, pages 2033?2039, 2008. 103 [GKS02] Shaogang Gong, Jeffrey Ng Sing Kwong, and Jamie Sherrah. On the semantics of visual behaviour, structured events and trajectories of human action. IVC, 20(12):873?888, 2002. [GLL+04] Enrico Giunchiglia, Joohyung Lee, Vladimir Lifschitz, Norman McCain, and Hudson Turner. Nonmonotonic causal theories. Artif. Intell., 153(1-2):49?104, 2004. [GO09] Tom Goldstein and Stanley Osher. The split bregman method for l1- regularized problems. SIAM J. on Imaging Sciences, 2(2):323?343, 2009. [Gra69] Clive W. J. Granger. Investigating causal relations by econometric models and cross-spectral methods. Econometrica, 37(3):424?438, July 1969. [Hay] Monson H. Hayes. Statistical Digital Signal Processing and Modeling. Wiley. [JM02] Odest Chadwicke Jenkins and Maja J Mataric. Deriving action and behavior primitives from human motion data. In In International Conference on Intelligent Robots and Systems, pages 2551?2556, 2002. [Joh73] G. Johansson. Visual perception of biological motion and a model for its anal- ysis. Perception and Psychophysics, 14:201?211, 1973. [Kan87] Immanuel Kant. Critique of Pure Reason. 1787. [KSH] Yan Ke, Rahul Sukthankar, and Martial Hebert. Spatio-temporal shape and flow correlation for action recognition. CVPR?07. [KWC] Tae-Kyun Kim, Shu-Fai Wong, and Roberto Cipolla. Tensor canonical correla- tion analysis for action classification. CVPR?07. [Lab] CMU Graphics Lab. http://mocap.cs.cmu.edu. [LDB+09] A. Londei, A. D?Ausilio, D. Basso, C. Sestieri, C. Gratta, G. Romani, and M. Be- lardinelli. Sensory-motor brain network connectivity for speech comprehen- sion. Hum Brain Mapp., 2009. [LF04] ChunMei Lu and Nicola J. Ferrier. Repetitive motion analysis: Segmentation and event classification. 26(2):258?263, 2004. 104 [LFAJ10] Y. Li, C. Fermuller, Y. Aloimonos, and H. Ji. Learning shift-invariant sparse representation of actions. In CVPR, pages 2630?2637, 2010. [LJ07] Haibin Ling and David W. Jacobs. Shape classification using the inner- distance. IEEE Trans. Pattern Anal. Mach. Intell., 29(2):286?299, 2007. [LJD09] Zhe Lin, Zhuolin Jiang, and Larry S. Davis. Recognizing actions by shape- motion prototype trees. CVPR?09, 0:1?8, 2009. [LM] G. Liu and L. McMillan. Segment-based human motion compression. In SCA ?06, pages 127?135. [LN07] Fengjun Lv and Ramakant Nevatia. Single view human action recognition using key pose matching and viterbi path searching. CVPR?07, 0:1?8, 2007. [LP09] James LeSage and Kelley Pace. Introduction to Spatial Econometrics. CRC Press, 2009. [lTFHB+] F. De la Torre Frade, J. Hodgins, A. Bargteil, X. Artal, J. Macey, A. Collado I. Castells, and J. Beltran. Guide to the carnegie mellon university multimodal activity (cmu-mmac) database. Technical Report CMU-RI-TR-08-22. [Lut07] Helmut Lutkepohl. New Introduction to Multiple Time Series Analysis. Springer, 1st ed. 2006. corr. 2nd printing edition, July 2007. [MHS] Pyry Matikainen, Martial Hebert, and Rahul Sukthankar. Representing pair- wise spatial and temporal relations for action recognition. In ECCV 2010. [MOV] MOVEN. http://www.moven.com/. [MPK] Ross Messing, Chris Pal, and Henry Kautz. Activity recognition using the velocity histories of tracked keypoints. In ICCV ?09. [MRC+07] M. M?ller, T. R?der, M. Clausen, B. Eberhardt, B. Kr?ger, and A. Weber. Doc- umentation mocap database hdm05. Technical Report CG-2007-2, Universit?t Bonn, June 2007. [MRPK09] C. Maidhof, M. Rieger, W. Prinz, and S. Koelsch. Nobody is perfect: ERP effects prior to performance errors in musicians indicate fast monitoring pro- cesses. PLoS One, page e5032, 2009. 105 [MSN+06] G. Metta, G. Sandini, L. Natale, L. Craighero, and L. Fadiga. Understanding mirror neurons: A bio-robotic approach. In Interaction Studies, pages 197?232, 2006. [MT97] Norman McCain and Hudson Turner. Causal theories of action and change. In AAAI, pages 460?465, 1997. [Muy] Eadweard Muybridge. The Human Figure in Motion. [MV82] D. Marr and L. Vaina. Representation and recognition of the movements of shapes. Proceedings of the Royal Society of London B, 214:501?524, 1982. [Niy03] Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15:1373?1396, 2003. [NS01] Arnold Neumaier and Tapio Schneider. Estimation of parameters and eigen- modes of multivariate autoregressive models. ACM Trans. Math. Softw., 27(1):27?57, 2001. [OKA06] Abhijit S. Ogale, Alap Karapurkar, and Yiannis Aloimonos. View-invariant modeling and recognition of human actions using grammars. In WDV, pages 115?126, 2006. [Par62] Emanuel Parzen. On estimation of a probability density function and mode. Ann. Math. Stat, (33):1065?1076, 1962. [Pea00] Judea Pearl. Causality : Models, Reasoning, and Inference. Cambridge University Press, 2000. [PHRA] N. Pollard, J. Hodgins, M. Riley, and C. Atkeson. Adapting human motion for the control of a humanoid robot. In ICRA ?02. [Ple03] Robert Pless. Image spaces and video trajectories: Using isomap to explore video sequences. ICCV?03, 2:1433, 2003. [PRAP08] Yael Pritch, Alex Rav-Acha, and Shmuel Peleg. Nonchronological video syn- opsis and indexing. IEEE Trans. Pattern Anal. Mach. Intell., 30(11):1971?1984, 2008. 106 [PSZ08] S. Paisitkriangkrai, C. Shen, and J. Zhang. Fast pedestrian detection using a cascade of boosted covariance features. IEEE TCSVT, 18(8):1140?1151, 8 2008. [Qua] http://en.wikipedia.org/wiki/Quaternion. [RA] Y. Rui and P. Anandan. Segmenting visual actions based on spatio-temporal motion patterns. In CVPR?00, pages I: 111?118. [Ram06] D. Ramanan. Learning to parse images of articulated bodies. Advanced in Neural Information Processing Systems, 2006. [RC04] G. Rizzolatti and L. Craighero. The mirror-neuron system. Annu Rev Neurosci, 27:169?192, 2004. [RKB04] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. "grabcut": inter- active foreground extraction using iterated graph cuts. ACM Trans. Graph., 23(3):309?314, 2004. [RKN07] Roger Grosse Rajat Raina, Helen Kwong, and Andrew Y. Ng. Shift-invariant sparse coding for audio classification. In UAI, 2007. [RS] M. Raptis and S. Soatto. Tracklet descriptors for action modeling and video analysis. In ECCV 2010. [RSH+05] Liu Ren, Gregory Shakhnarovich, Jessica K. Hodgins, Hanspeter Pfister, and Paul A. Viola. Learning silhouette features for control of human motion. ACM Trans. Graph., 24(4):1303?1331, 2005. [RW05] Carl E. Rasmussen and Christopher K. I. Williams. Gaussian Processes for Ma- chine Learning (Adaptive Computation and Machine Learning). The MIT Press, December 2005. [SB06] L. Sigal and M. J. Black. Humaneva: Synchronized video and motion cap- ture dataset for evaluation of articulated human motion. Brown Univertsity Technical Report, 2006. [SC08] Aravind Sundaresan and Rama Chellappa. Model driven segmentation of articulating humans in laplacian eigenspace. 30(10):1771?1785, 2008. 107 [Sca02] Brian Scassellati. Theory of mind for a humanoid robot. Autonomous Robots, 12(1):13?24, January 2002. [SE07] Anil K. Seth and Gerald M. Edelman. Distinguishing causal interactions in neural populations. Neural Computation, 19(4):910?933, April 2007. [SHK09] Takashi Shibuya, Tatsuya Harada, and Yasuo Kuniyoshi. Causality quantifi- cation and its applications: structuring and modeling of multivariate time se- ries. In KDD ?09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 787?796, 2009. [SKHD09a] W. Schwartz, A. Kembhavi, D. Harwood, and L. Davis. Human detection using partial least squares analysis. In ICCV, 2009. [SKHD09b] W.R. Schwartz, A. Kembhavi, D. Harwood, and L.S. Davis. Human detection using partial least squares analysis. In Accepted to be presented in the Interna- tional Conference on Computer Vision, 2009. [SLC04] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: a local svm approach. In ICPR?04, volume 3, pages 32?36 Vol.3, 2004. [SMI07] S. Schaal, P. Mohajerian, and A. Ijspeert. Dynamics systems vs. optimal con- trol a unifying view. Progress in Brain Research, 165:425?445, 2007. [SMV07] G. Sandini, G. Metta, and D. Vernon. The iCub Cognitive Humanoid Robot: An Open-System Research Platform for Enactive Cognition. In 50 Years of Artificial Intelligence, Lecture Notes in Computer Science, chapter 32, pages 358?369. 2007. [SNI04] Takaaki Shiratori, Atsushi Nakazawa, and Katsushi Ikeuchi. Detecting dance motion structure through music analysis. In AFGR?04, 0:857, 2004. [Ste08] Keith Stenning. Human Reasoning and Cognitive Science. Cambridge University Press, 2008. [STT10] B. Sapp, A. Toshev, and B. Taskar. Cascaded models for articulated pose esti- mation. In ECCV, 2010. 108 [SvG] K. Schindler and L. van Gool. Action snippets: How many frames does hu- man action recognition require? In CVPR?08. [TBB] Moritz Tenorth, Jan Bandouch, and Michael Beetz. The TUM Kitchen Data Set of Everyday Manipulation Activities for Motion Tracking and Action Recog- nition. In IEEE Int. Workshop on Tracking Humans for the Evaluation of their Motion in Image Sequences (THEMIS). In conjunction with ICCV 2009. [TdSL00] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric frame- work for nonlinear dimensionality reduction. Science, 290(5500):2319?2323, December 2000. [TG07] J. A. Tropp and A. C. Gilbert. Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on Information Theory, 53(12):4655?4666, 2007. [TRE] http://www-nlpir.nist.gov/projects/trecvid. [UD08a] R. Urtasun and T. Darrell. Sparse probabilistic regression for activity- independent human pose inference. In CVPR?08, pages 1?8, 2008. [UD08b] Raquel Urtasun and Trevor Darrell. Sparse probabilistic regression for activity-independent human pose inference. In CVPR, 2008. [VIC] VICON. http://www.vicon.com/. [VMP03] D. Vecchio, R. Murray, and P. Perona. Decomposition of human motion into dynamics based primitives with application to drawing tasks. Automatica, 39:2085?2098, 2003. [WFH08] Jack M. Wang, David J. Fleet, and Aaron Hertzmann. Gaussian process dy- namical models for human motion. 30(2):283?298, 2008. [WWW08] Xiaozhe Wang, Liang Wang, and Anthony Wirth. Pattern discovery in motion time series via structure-based spectral clustering. In CVPR?08, 2008. [XG08] Tao Xiang and Shaogang Gong. Video behavior profiling for anomaly detec- tion. IEEE Trans. Pattern Anal. Mach. Intell., 30(5):893?908, 2008. 109 [YJSB09] Yang, R. Jafari, S. S. Sastry, and R. Bajcsy. Distributed recognition of human actions using wearable motion sensor networks. Ambient Intelligence and Smart Environments, 2009. [YZDJ08] L. Yi, Y. Zheng, D. Doermann, and S. Jaeger. Script-Independent Text Line Segmentation in Freestyle Handwritten Documents. IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2008. [Zat97] V. Zatsiorsky. Kinematics of Human Motion. Human Kinetics Publishers, September 1997. 110