ABSTRACT Title of dissertation: NEURO-INSPIRED AUGMENTATIONS OF UNSUPERVISED DEEP NEURAL NETWORKS FOR LOW-SWAP EMBODIED ROBOTIC PERCEPTION E. Jared Shamwell Doctor of Philosophy, 2017 Dissertation directed by: Professor Donald Perlis Department of Computer Science Despite the 3-4 saccades the human eye undergoes per second, humans perceive a stable, visually constant world. During a saccade, the projection of the visual world shifts on the retina and the net displacement on the retina would be identical if the entire visual world were instead shifted. However, humans are able to perceptually distinguish between these two conditions and perceive a stable world in the first condition, and a moving world in the second. Through new analysis, I show how biological mechanisms theorized to enable vi- sual positional constancy implicitly contain rich, egocentric sensorimotor representations and with appropriate modeling and abstraction, artificial surrogates for these mechanisms can enhance the performance of robotic systems. In support of this view, I have developed a new class of neuro-inspired, unsuper- vised, heterogeneous, deep predictive neural networks that are approximately 5,000%- 22,000% faster (depending on the network configuration) than state-of-the-art (SOA) dense approaches and with comparable performance. Each model in this new family of network architectures, dubbed LightEfference (LE) (Chapter 2), DeepEfference (DE) (Chapter 2), Multi-Hypothesis DeepEfference (MHDE) (Chapter 3), and Inertial DeepEfference (IDE) (Chapter 4) respectively, achieves its substantial runtime performance increase by leveraging the embodied nature of mobile robotics and performing early fusion of freely available heterogeneous sensor and mo- tor/intentional information. With these architectures, I show how embedding extra-visual information meant to encode an estimate of an embodied agent’s immediate intention supports efficient computations of visual constancy and odometry and greatly increases computational efficiency compared to comparable single-modality SOA approaches. NEURO-INSPIRED AUGMENTATIONS OF UNSUPERVISED DEEP NEURAL NETWORKS FOR LOW-SWAP EMBODIED ROBOTIC PERCEPTION by Earl Jared Shamwell Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2017 Advisory Committee: Professor Don Perlis, Chair/Advisor Professor Dan Butts Professor Avis Cohen Professor Timothy Horiuchi Dr. William Nothwang Professor Peter Carruthers, Dean’s Representative c© Copyright by Earl Jared Shamwell 2017 Dedicated to my parents. ii Acknowledgments I owe my deepest gratitude to the many people who have made this dissertation possible. Looking back, it’s hard not to wonder at the multitude of fortuitous events that have culminated in this doctoral dissertation. I would like to thank my Advisor, Professor Don Perlis, for his tireless patience and faith in me as I explored my curiosity. I would also like to thank my Mentor, Dr. William Nothwang, for his guidance, support, and helping me to see things in myself that I did not always know were there. For support (financial and otherwise), I would like to thank the Sensors and Elec- tron Devices Directorate, Army Research Laboratory (ARL), and in particular, Dr. Brett Piekarski. Without Don, Will, and ARL, this dissertation would not have been possible. I would like to acknowledge Professors Dan Butts and Avis Cohen for the many conversations over the years that continue to shape my thinking as a scientist and engineer. I would also like to acknowledge my committee members Professors Timothy Horiuchi and Peter Carruthers for helping shape my dissertation. For instilling in me a stubbornness and belief that anything is possible and support- ing me in everything I have pursued, my parents, to whom this dissertation is dedicated, deserve a special thank you. Finally, I must acknowledge Lunet Luna with deep thanks for being a constant source of encouragement, even across three-time zones. iii Table of Contents Dedication ii Acknowledgements iii List of Tables vii List of Figures viii List of Abbreviations x 1 Introduction 1 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivations and Background . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Saccadic Suppression . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1.1 Mathematical Representation . . . . . . . . . . . . . . 4 1.2.1.2 Failures of Saccadic Suppression . . . . . . . . . . . . 4 1.2.2 Efference Copy . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.2.1 Mathematical Representation . . . . . . . . . . . . . . 7 1.2.3 Failures of Efference Copy (EC) . . . . . . . . . . . . . . . . . . 8 1.2.4 Alternate Theories and Synthesis . . . . . . . . . . . . . . . . . . 12 1.2.4.1 Referent Control (RC) and High-Level Interactions . . 12 1.2.4.2 Landmark Theory . . . . . . . . . . . . . . . . . . . . 14 1.2.4.3 Mathematical Representation . . . . . . . . . . . . . . 15 1.2.5 Visual Odometry . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.2.6 Feed-Forward Feature Selection . . . . . . . . . . . . . . . . . . 21 1.2.7 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.3 Chapter Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 1.3.1 Chapter 2 - DeepEfference: Learning to Predict the Sensory Con- sequences of Action Through Deep Correspondence . . . . . . . 26 1.3.2 Chapter 3 - A Deep Neural Network Approach to Fusing Vision and Heteroscedastic Motion Estimates for Low-SWaP Robotic Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 iv 1.3.3 Chapter 4 - An Embodied Deep Neural Network Approach to Dense Visual Correspondence for Low-SWaP Robotics . . . . . . 28 2 DeepEfference: Learning to Predict the Sensory Consequences of Action Through Deep Correspondence 29 2.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.3.1 Deep Approaches to Spatial Transformation Encoding and Learning 33 2.3.2 Extra-Visual Motion Estimates . . . . . . . . . . . . . . . . . . . 34 2.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.4.1 Training and Loss Rule . . . . . . . . . . . . . . . . . . . . . . . 36 2.4.2 Pathway 1: Local, Pixel-Level Shifter . . . . . . . . . . . . . . . 36 2.4.3 Pathway 2: Global Spatial Transformer . . . . . . . . . . . . . . 37 2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.5.1 Experimental Methods . . . . . . . . . . . . . . . . . . . . . . . 40 2.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3 A Deep Neural Network Approach to Fusing Vision and Heteroscedastic Motion Estimates for Low-SWaP Robotic Applications 50 3.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3.1 Visual Odometry and Multi-Sensor Fusion . . . . . . . . . . . . 54 3.3.2 Deep Spatial Transformations . . . . . . . . . . . . . . . . . . . 56 3.3.3 Extra-Modal Motion Estimates and Heteroscedastic Noise . . . . 58 3.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.4.1 Winner-Take-All (WTA) Loss Rule . . . . . . . . . . . . . . . . 59 3.4.2 Pathway 1: Global Spatial Transformer . . . . . . . . . . . . . . 61 3.4.3 Pathway 2: Local, Pixel-Level Shifter . . . . . . . . . . . . . . . 62 3.5 Experimental Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.5.1 Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.5.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.5.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.9 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.9.1 Extended Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 69 3.9.2 Extended Training Procedures . . . . . . . . . . . . . . . . . . . 70 4 An Embodied Deep Neural Network Approach to Dense Visual Correspondence for Low-SWaP Robotics 75 4.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 v 4.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.5 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.5.1 Pathway 1: Global Shifter . . . . . . . . . . . . . . . . . . . . . 82 4.5.2 Pathway 2: Local, Pixel-Level Shifter . . . . . . . . . . . . . . . 83 4.5.3 Spatial Transformations . . . . . . . . . . . . . . . . . . . . . . 83 4.6 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.6.1 Datasets and Data Generation . . . . . . . . . . . . . . . . . . . 86 4.6.1.1 EuRoC MAV . . . . . . . . . . . . . . . . . . . . . . . 87 4.6.1.2 KITTI Odometry . . . . . . . . . . . . . . . . . . . . 89 4.6.2 Network Parameters and Training Procedures . . . . . . . . . . . 90 4.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.7.1 EuRoC Ground Truth Generation . . . . . . . . . . . . . . . . . 91 4.7.2 KITTI Ground Truth Generation . . . . . . . . . . . . . . . . . . 92 4.8 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.9 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 102 Bibliography 105 vi List of Tables 2.1 Average runtimes for DM, LE, and DE . . . . . . . . . . . . . . . . . . . 43 2.2 Pixel errors for DM, LE, and DE on KITTI and SceneNet . . . . . . . . . 48 3.1 Average runtimes for DeepMatching (DM) and Multi-Hypothesis Deep- Efference (MHDE) with equivalent frames per second (FPS) . . . . . . . 67 4.1 Pixel MSE for the EuRoC dataset . . . . . . . . . . . . . . . . . . . . . 100 4.2 Mean Pixel Error for the KITTI dataset . . . . . . . . . . . . . . . . . . 101 4.3 Average runtimes with equivalent frames per second (FPS) . . . . . . . . 102 vii List of Figures 2.1 Sample results from KITTI Odometry. A: Sample source image. B: Sam- ple target image. C: DeepEfference output reconstruction of the source image in A using pixel intensity values sampled from the target image in B. D and E: Source and target images with marked correspondence points computed by DeepEfference. . . . . . . . . . . . . . . . . . . . . . . . 30 2.2 DeepEfference network diagram showing the linked global and local learn- ers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.3 Pixel error boxplots for DM, LE, and DE using DM and FAST key- points. Y-axis is actual mean pixel error. Middle lines are the medians and whiskers indicate 1.5 interquartile of the lower and upper quartiles. Note outliers are not shown for clarity and instead minimum and maxi- mums are presented in the table below. . . . . . . . . . . . . . . . . . . 42 2.4 Unusual DE and LE sample results from KITTI and SceneNet. The red boxes highlight areas in the reconstructed image that were imagined by DeepEfference. Green boxes highlight instances where DE was able to better predict object positions in a scene with strong depth contrast while LE generated a poorer reconstruction. The last row shows a failure case where both LE and DE were unable to generate a reconstruction. This is most likely due to the extreme transformation between the camera at the time of source image capture versus target image capture. . . . . . . . . 49 3.1 Sample MHDE outputs from different hypothesis pathways. A-E: MHDE ouputs from 5 pathways. D shows the output from an inactive pathway (i.e. a pathway that the network did not optimize). F-E: Reconstruction error for the hypotheses shown in A-E. From F we can see that the recon- struction shown in A had the lowest error (yellow-dashed box). . . . . . 53 3.2 MHDE network diagram with two hypotheses shown for brevity. We experimented with up to 8 hypotheses in this work. . . . . . . . . . . . . 56 3.3 Heteroscedastic noise as a function of transform magnitude for the X and Y components of the transform input over the test set for a network with a noise parameter α = 0.25. . . . . . . . . . . . . . . . . . . . . . . . . 64 viii 3.4 Inverse mean pixel error (higher is better) for several noise conditions produced by MHDE networks trained to generate 1, 2, 4, or 8 maximum hypotheses. Dashed line is DM (SOA) error. . . . . . . . . . . . . . . . . 72 3.5 Inverse mean pixel error (higher is better) for several noise conditions pro- duced by MHDE networks. Results are from the same networks shown in Fig. 3.4 but are instead plotted as a function of active pathways learned by each network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.6 Activation by pathway for the different noise conditions. Only networks with maximum hypotheses of 4 or 8 are shown. . . . . . . . . . . . . . . 74 4.1 IDE network diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.2 Histogram of error between the ground truth and k-means cluster to which that exemplar was assigned. . . . . . . . . . . . . . . . . . . . . . . . . 87 4.3 Inverse pixel error (higher is better) for each condition for EuRoC and KITTI. The KM condition for EuRoC is omitted for plotting convenience but included in the table below. . . . . . . . . . . . . . . . . . . . . . . 93 4.4 Sample KITTI correspondence results. Note that only every other key- point (horizontally and vertically) is shown and the actual unaltered out- put is fully dense. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.5 Sample KITTI correspondence results. Note that only every only key- point (horizontally and vertically) is shown and the actual unaltered out- put is fully dense. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.6 Sample KITTI correspondence results. Note that only every only key- point (horizontally and vertically) is shown and the actual unaltered out- put is fully dense. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.7 Sample EuRoc correspondence results. Note that only every only key- point (horizontally and vertically) is shown and the actual unaltered out- put is fully dense. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.8 Sample EuRoc correspondence results. Note that only every only key- point (horizontally and vertically) is shown and the actual unaltered out- put is fully dense. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.9 Sample EuRoc correspondence results. Note that only every only key- point (horizontally and vertically) is shown and the actual unaltered out- put is fully dense. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 ix List of Abbreviations A1 Primary Auditory Cortex BPTT Back Propagation Through Time CD Corollary Discharge CNN Convolutional Neural Network CNS Central Nervous System DE DeepEfference DM DeepMatching DNN Deconvolutional Neural Network DOF Degree of Freedom DSP Deformable Spatial Pyramids EC Efference Copy FAST Features from the Accelerated Segment Test FC Fully Connected FFMS Feed-Forward Motor Signal GPS Global Positioning System GPU Graphics Processing Unit HLEC High Level Efference Copy IDE Inertial DeepEfference IMU Inertial Measurement Unit INS Inertial Navigation System KLT Kanade-Lucas-Tomasi LIDAR Light Detection and Ranging LK Lucas-Kanade MAV Micro Aerial Vehicle MF Motor Feedback MHDE Multi-Hypothesis DeepEfference MLP Multi-Layer Perceptron PID Proportional-Integral-Derivative x RANSAC Random Sample Consensus RC Referent Control RCS Referent Command Signal ReLU Rectified Linear Unit RGB-D Red Green Blue Depth RMSE Root Mean Squared Error RTK Real-Time Kinematic SC Superior Colliculus SGD Stochastic Gradient Descent SOA State-of-the-Art ST Spatial Transformer SWaP Size Weight and Power V1 Primary Visual Cortex/Striate Cortex VI Visual Inertial VIO Visual Inertial Odometry VO Visual Odometry VSLAM Visual Simultaneous Localization and Mapping WTA Winner Take All xi Chapter 1: Introduction 1.1 Overview Despite the 3-4 saccades the human eye undergoes per second, humans perceive a stable, visually constant world. During a saccade, the projection of the visual world shifts on the retina and the net displacement on the retina would be identical if the entire visual world were instead shifted. However, humans are able to perceptually distinguish between these two conditions and perceive a stable world in the first condition (where visual positional constancy is maintained), and a moving world in the second (where visual positional constancy is not maintained). I argue that biological mechanisms supporting visual constancy contain a rich, ego- centric representation of the environment and, with appropriate modeling and abstraction, can serve as prototypical models of both early heterogeneous sensor fusion and early sen- sorimotor fusion/interactions. Artificial surrogates of these prototypical mechanisms can enable enhanced: • Prediction of the sensory consequences of self-induced actions; • Action-based and perceptual anomaly detection; and • Robotic visual odometry and localization. 1 In support of this view, I have developed a class of neuro-inspired, unsupervised deep predictive neural networks for robotic applications that are approximately 5,000%- 22,000% faster (depending on the network configuration) than state-of-the-art (SOA) dense approaches and with comparable performance. Each model in this new family of network architectures, dubbed LightEfference (LE) (Chapter 2), DeepEfference (DE) (Chapter 2), Multi-Hypothesis DeepEfference (MHDE) (Chapter 3), and Inertial DeepEfference (IDE) (Chapter 4) respectively, per- forms early fusion of heterogeneous information. With these architectures, I show how embedding extra-visual information meant to encode an estimate embodied agent’s im- mediate intention supports efficient computations of visual constancy and odometry and greatly increases computational efficiency compared to comparable single-modality SOA approaches. The remainder of this chapter is organized as follows: Section 1.2 describes the mo- tivations and background for this dissertation; and Section 1.3 summarizes the remaining chapters in this dissertation. 1.2 Motivations and Background In this section, I provide a limited discussion of the theoretical motivations for this work. Additional detail can be found in Chapter 2, Chapter 3, and Chapter 4. As earlier alluded, visual positional constancy is a well-studied phenomena that grants us a window into the brain’s ability to separate the self from the other in low- /sensorimotor-level representations. 2 Experimental evidence and theoretical analysis has suggested that the maintenance of constancy is achieved through various physiological mechanisms including efference copy (or “effort of will”) (Holst and Mittelstaedt, 1950; Sperry, 1950), saccadic suppres- sion (Matin, 1974; Noda, 1975; Ross et al., 2001; Thilo et al., 2004; Volkmann, Schick, and Riggs, 1968), and low-latency matching of sequential visual sensory information (Deubel, Schneider, and Bridgeman, 1996; Deubel, Koch, and Bridgeman, 2010). Despite conflicting experimental results and analysis on visual positional constancy in the neuroscience literature, I argue that a synthesis of the landmark theory (Deubel, Bridgeman, and Schneider, 2004; Deubel, Koch, and Bridgeman, 2010) and a relaxed version of classical efference copy (EC) (Holst and Mittelstaedt, 1950; Sperry, 1950) is better able to pose solutions to the problem of visual constancy and the larger question of how biological systems separate the self from the other. 1.2.1 Saccadic Suppression Early theories of saccadic visual positional constancy relied on saccadic suppres- sion, a phenomenon where visual perception seems to be inhibited during a saccade (Matin, 1974; Noda, 1975; Ross et al., 2001; Thilo et al., 2004; Volkmann, Schick, and Riggs, 1968). Saccades are among the human body’s fastest movements and can reach speeds up to 1000 degrees per second (Bridgeman, Heijden, and Velichkovsky, 1994) and a longstanding view of saccadic eye movements has held that saccades are ballistic, open-loop movements with time courses too fast to benefit from ongoing modulation from sensory feedback and generally unable to use feedback control and sensory information. 3 1.2.1.1 Mathematical Representation Using a theory relying on saccadic suppression for visual constancy, constancy could then be confirmed post-saccade by comparing a post-saccadic state to an estimate of a predicted post-saccadic state. This can be expressed as: C = ∥∥X t+1 − F [X t]∥∥2 Y =  True, if C ≤ φ False, if C > φ (1.1) where F [] represents the desired relationship that transforms the state X t (pre-saccade) to the state X t+1 (post-saccade), and φ is some threshold under which constancy is main- tained. An appropriate choice of the function F [] would then produce a mapping that transforms the pre-saccadic state X t to the post-saccadic state X t+1 and values of C less than φ would signify failures of constancy. 1.2.1.2 Failures of Saccadic Suppression In conflict with suppressive accounts of transsaccadic visual constancy, studies have found that subsets of V1 neurons fail to exhibit classical suppression during micro- saccades and saccades (Troncoso et al., 2015; McFarland et al., 2015). Additionally, while perceptually humans report experiencing saccadic suppression, this does not pre- clude the possibility that perisaccadic visual information is available to the visual system. Indeed recent work in oculomotor adaptation has shown the dependence of saccadic adap- tation on visual information briefly presented perisaccadically (Panouille`res et al., 2016). 4 This finding suggests not only a link between perisaccadic visual information and senso- rimotor control, but that perisaccadic visual information is actively used for sensorimotor control (additional detail on the limitations of saccadic suppression are discussed in Sec- tion 1.2.3). 1.2.2 Efference Copy Classically, a forward model with an efference copy (EC) (Holst and Mittelstaedt, 1950; Sperry, 1950) has been proposed as the relational mechanism and extra-visual sig- nals have been extensively documented around the time of saccades (Bremmer et al., 2009; Duhamel, Colby, and Goldberg, 1992; Ibbotson and Krekelberg, 2011; Kagan, Gur, and Snodderly, 2008; Kayama et al., 1979; Rajkai et al., 2008; Sommer and Wurtz, 2006). EC has been theorized as a critical component of biological sensorimotor control and has been used to explain a myriad of observed sensorimotor phenomena including visual positional constancy. The traditional theory of EC relies on the direct motor com- mands that produce actions being used to generate the EC and has generally supposed that the effects of ECs result in suppression of neural activity. For example, the vestibulo- ocular reflex (VOR) is a reflex arc that activates eye muscles to maintain an image in the center of view during head motions detected by the vestibular system. While the VOR triggers automatic compensatory movements, self-generated eye movements initi- ated during a head movement can prevent the VOR from controlling gaze. Without an EC or an EC-like mechanism, the VOR would instead override any intentional eye movement 5 and leave us unable to self-direct gaze during heads movements. In now famous experiments with the fly Eristalis, Van Holst and Mittelstaedt (Holst and Mittelstaedt, 1950) demonstrated a major shortcoming of classical reflex chain theory and introduced their principle of reafference. Describing the prevailing view of the cen- tral nervous system (CNS) as that of ”a sort of automat, which reflexly delivers a given ticket when a particular coin is inserted in it”, they were among the first to realize and demonstrate the biological necessity of separating reafferent (resulting from the organ- ism’s actions) from exafferent (resulting from the external world) sensory information and its consequences for sensorimotor processing. Studying the optokinetic reflex of the fly Eristalis, Van Holst and Mittelstaedt saw that reafferent and exafferent sensory information were quantitatively and qualitatively the same and yet organisms are able to nonetheless distinguish between the two. A classical example of the optokinetic reflex can be demonstrated by placing the fly Eristalis inside of a hollow cylinder that is painted with vertical black and white stripes. With the fly hovering inside the cylinder, rotation of the cylinder causes the fly to rotate itself in the same direction to maintain the same view (i.e. it rotates its body so the projections of the stripes on its retina remain identically located). However, the fly is still able to initiate self-generated motion and independently rotate itself without snapping back to its starting point at the cessation of movement, somehow overriding the optokinectic reflex. Van Holst and Mittelstaedt challenged the classical explanation that the optokinec- tic reflex is only active during spontaneous movements by rotating the fly’s head 180 degrees and repeating the experiments. By essentially creating a negative feedback loop, the voluntary movement behaviors previously seen in normal fly were no longer seen - the 6 rotated fly would fly spin in tight circles until eventually freezing. They thus concluded that the optokinectic reflex is not simply inactive during spontaneous movements but in- stead that the sensory consequences of an action are modulated based on very specific sensory feedback expectations in response to an action’s associated motor commands. They theorized that motor commands associated with a self-generated movement were provided to the CNS through what they termed an EC of the motor command. Their work suggested that the CNS anticipated self-generated sensory signals and led to a theory of EC that has been incorporated into forward models of sensorymotor processing in many systems and in many organisms. The theory broadly encompasses signals generated in motor regions that target circuits engaged in sensory processing. The advantages of the motor system providing the CNS an EC are that they are available even before the movement begins whereas sensory information is available only afterward. 1.2.2.1 Mathematical Representation In a simplified linear case as has often been used to model forward dynamics in the visuomotor system (Wolpert, Ghahramani, and Jordan, 1995; Mehta and Schaal, 2002; Miall, 1995; Wolpert, Ghahramani, and Jordan, 1994), F [] would incorporate motor in- formation u and might take the form F [X t, U t] = AX t +BU t (1.2) where X t is a representation of the current state, and A and B are constant state and in- put matrices, respectively. However, experimental results and theoretical analysis suggest 7 that a purely motor based prediction is insufficient for perceptual visual constancy and fails to explain post-saccadic target blanking (Deubel, Schneider, and Bridgeman, 1996), afterimage effects following saccadic movements (Gru¨sser, Krizicˇ, and Weiss, 1987), and in general, the observed sensory dependence of saccades (Bridgeman, Hendry, and Stark, 1975; Deubel, Bridgeman, and Schneider, 2004; Deubel, Koch, and Bridgeman, 2010; Laidlaw and Kingstone, 2010; Walker, McSorley, and Haggard, 2006). The failure to incorporate sensory dependence can be predicted from Eq. 1.2 where X t and U t are un- coupled. This suggests a non-linear formulation of Eq. 1.2 would be needed to explain the myriad of sensorimotor dependencies observed experimentally to affect saccadic visual constancy. 1.2.3 Failures of Efference Copy (EC) Recent experimental evidence and theoretical analysis has challenged EC as a mech- anism of visual constancy. As mentioned earlier, the classical theory of EC relies on motor commands themselves being used to generate the copy. In the case of vocalization, this suggests that the information provided to the au- ditory cortex should directly reflect the actual action being undertaken. However results from the auditory cortex suggest that the EC signals may instead correspond to the ex- pected results of an action (Niziolek, Nagarajan, and Houde, 2013). An implication of EC is that the separation of reafferent from exafferent sensory feedback is performed by means of neural response depression where primary sensory afferents that correspond to reafferent sensory information are depressed. However sub- 8 sequent experimental results showing mixed suppressive and enhancement sensory pro- cessing effects in the auditory cortex during during self-induced vocalization (Behrooz- mand et al., 2009; Eliades and Wang, 2008; Flinker et al., 2010; Greenlee et al., 2011; Eliades and Wang, 2003) and new theoretical analysis on postural stability (Bridgeman, 2007; Feldman, 2009; Ostry and Feldman, 2003) have demonstrated that the physiolog- ical mechanisms presumably used to separate the reafferent from the exafferent are far more complex than simple subtractions. Recently, neurons in the primate striate cortex have been shown to exhibit biphasic response modulation during saccadic eye move- ments that result in a period of reduced activity followed by enhancement (McFarland et al., 2015). However in Mcfarland et al., the authors also found a small subset of stri- ate neurons (677) that exhibited a reversed response of enhanced excitation followed by inhibition. In Troncoso et al. (Troncoso et al., 2015), a subset of striate neurons (16145) also failed to exhibit the suppressive response modulation observed for microsaccades in the majority of neurons. Collectively, these studies point to more complex sensorimo- tor interactions and encodings than the simple shunting or linear subtraction previously imagined. EC also fundamentally fails to explain phenomena associated with saccadic sup- pression. Studies on saccadic suppression have shown that when saccadic visual targets are artificially shifted during the saccade, under certain circumstances, the visual tar- get can still appear stable while instability is perceived for surrounding objects (Deubel, Bridgeman, and Schneider, 2004; Deubel, Koch, and Bridgeman, 2010). This is in con- flict with theories of EC as EC does now allow for differences in perception of objects in a single visual scene. 9 Investigations in retinal afterimage displacement during saccades present another line of evidence against EC as the sole mediator of visual constancy. Grusser et al. (Gru¨sser, Krizicˇ, and Weiss, 1987) had subjects perform auditory saccades in a dark room from a bight light that was turned off at saccadic onset to a new fixation location cued by an auditory tone. When subjects saccaded at rates below one saccadesecond, afterim- ages were perceived to displace with an amplitude consistent with the saccadic ampli- tude. In other words, the after image appeared to shift across the retinal plane and was perceived at an egocentrically consistent retinal position. However above frequencies of 1.5 saccadesecond, the perception of after-image displacement decreased with saccadic frequency. Failures in egocentric visual constancy at saccadic rates well within normal behavior further suggest that EC alone is unable to provide the perceptually consistent world we experience and that an additional mechanism dependent on the currently per- ceived visual scene is necessary. The dependence of displacements in retinal coordinates on saccadic frequency further suggests that efference alone is unable to provide an accu- rate forward model to predict displacements in the visual scene caused by saccades. The inaccuracy of saccadic control (reviewed in (Bridgeman, 2007)) suggests that the brain does not have a sufficiently high-accuracy motor forward model to enable visual positional constancy based solely on an EC of a motor signal and instead, other sup- plementary mechanisms must at a minimum be used in conjunction with an EC-based forward model. For example, when a stimulus at the target location of a saccade is shifted perisaccadically, target displacement may not be detected up to a threshold of one-third of the total saccadic displacement (Bridgeman, Hendry, and Stark, 1975; Deubel, Schnei- der, and Bridgeman, 1996). If extra-visual forward modeling was the only mechanism by 10 which visual constancy is determined, then the visual system would have no idea where targets in the visual world were located if positional errors of one-in-three could be toler- ated. If constancy can be maintained in the face of such a large error, then the brain must use some additional mechanism to determine constancy and localization. Investigations into saccadic drifts emphasize the dependence of saccadic trajecto- ries on visual sensory information. Saccadic trajectories were originally thought to repre- sent the maximal acceleration and velocity profiles physiologically achievable by the eye (Clark and Stark, 1975; Enderle and Wolfe, 1987) but subsequent experimental character- ization of the structured error in saccadic end-points (e.g., the speed-accuracy trade-off) (Abrams, Meyer, and Kornblum, 1989) established alternate models of saccadic trajecto- ries including minimum variance (Harris and Wolpert, 1998; Harwood, Mezey, and Har- ris, 1999), smoothness maximizing (Wiegner and Wierzbicka, 1992), and optimal control models (Harris, 1998; Harris and Wolpert, 2006). While saccadic trajectories generally exhibit a propensity to curve toward either the horizontal or vertical meridians, visual context influences both the magnitude and direction of trajectory curvature (Laidlaw and Kingstone, 2010). Depending on the stimuli present at the time of the saccade, saccadic trajectories may curve toward or away from distracting stimuli and the relationship be- tween trajectory deviation and saccade latency shows an increasing linear trend (Walker, McSorley, and Haggard, 2006). The predictability of the target location has also been found to be important in modulating the direction of distractor-related trajectory devi- ations. If a target location is predictable then saccades reliably deviate away from the distractor. If however, the location is unpredictable then trajectory deviation follows the pattern dictated by the latency of the saccade (Walker, McSorley, and Haggard, 2006). 11 These interactions suggest a sensory contextual dependence of saccadic visual constancy. 1.2.4 Alternate Theories and Synthesis Results from the experiments listed in the previous sub-sections have influenced new theoretical models of sensorimotor interactions to explain phenomena traditionally attributed solely to EC. For visual positional constancy, the landmark theory and active sensing/referent control (RC) (Feldman, 2009) provide alternatives to classical EC and are discussed below. I argue that a synthesis of a relaxed version of classical EC that inherits attributes from RC and the landmark theory is better able to pose solutions to the problem of visual constancy. 1.2.4.1 Referent Control (RC) and High-Level Interactions In referent control and active sensing (Feldman, 2009), direct copies of motor com- mands are not theorized to be used but rather new positions are set with referent command signals (RCS) that indicate a desired position for the body in some generalized coordinate system. The active sensing theory parallels EC in that visual information is not needed (Feldman, 2016). Additionally, the sensorimotor system would still need to pre-determine how the world would appear post-saccade implying that both EC and active sensing ne- cessitate highly accurate forward models to explain human behavioral data. While active sensing allows for part of the model to be abstracted to other areas of the brain (Feldman, 2016), it still requires an extremely accurate forward model to predict resulting sensory 12 information following a referent shift. This implies that the brain contains a sufficiently high accuracy model of eye movements and is able to carefully predict the outcomes of saccadic eye movements and the resulting sensory perceptions and conflicts with findings to the contrary (see Section 1.2.3 for a review). Extra-visual RCSs may share features of both decisions and intentions (as they are described in (Carruthers, 2015)). Decisions are momentary events that give rise to intentions. Intentions then initiate corresponding chains of actions to satisfy the original decision. Intentions exist throughout the entire process of acting. There are arguments for describing the extra-visual RCSs as either decisions or intentions: a decision in that information may not need to be retained throughout the entire saccade but only encoded at its beginning, and an intention in that intentions are presumably more specific as they are hierarchically closer and more similar to action and thus may include more accurate information on the spatial and temporal characteristics of the transformation from a pre-saccadic state to a post-saccadic state. While decisions and intentions can be either for the here-and-now or for the future (Carruthers, 2015), it is generally thought that the extra-visual information being embed- ded through either ECs or RCs always represent what an agent is immediately about to attempt. Things may not go as planned (e.g., internally, the noisy motor commands may muck things up or externally, an object may move) but it’s certain that an attempt will at least be made. In this case, the resulting behavior can be thought of as reflexive from the time at which the extra-visual signal is generated. However, it remains unclear if contri- butions from higher-level mechanisms (e.g., working memory or some type of high-level efference copy (HLEC)-like mechanism containing higher-level cognitive or semantic in- 13 formation) are also included in the extra-visual signals that bias the early sensory system around the time of self-generated action. For example, saccadic target selection is tied to higher-level reasoning processes (e.g. attention, goal direction, etc.). While the spe- cific mechanism addressed in the models is reflexive and of a character closer to that of the here-and-now 1, this does not preclude the possibility of using HLEC extra-visual information. For example, in Chapter 4, I experiment with using multiple extra-visual, heterogeneous information sources in a single model and hope to extend the architectures to use new sources of higher-level information that will provide the feed-forward network with context. 1.2.4.2 Landmark Theory The landmark theory (Deubel, Koch, and Bridgeman, 2010) proposes a solution to visual constancy emphasizing the importance of sensory context. The theory suggests that the brain uses visual landmarks to update a visual constancy map following a saccade. By incorporating stimulus context dependence, the theory posits that visual objects present pre-saccade and post-saccade can act as landmarks to localize objects across the visual scene. The landmark theory suggests an alternative explanation for the sensory depen- dence of visual constancy and posits that rather than using an efference copy, the visual system determines constancy by matching a sparse gist of landmarks from the previous fixation point to the new fixation point. Should the points in the first scene be inconsis- 1An implementation intention, or an intention that is moved immediately to the here-and-now when some implementing condition has been met ((Carruthers, 2015), is perhaps closer in describing the extra- visual information primarily used in this dissertation 14 tently projected to the second (presumably according to some threshold), then consistency is not maintained. Here, the state representation is in the visual domain and thus theo- retically able to provide the sufficiently complex state representation needed to explain observed experimental results. 1.2.4.3 Mathematical Representation However, the landmark theory fails to adequately include the feed-forward (Rajkai et al., 2008; Zirnsak and Moore, 2014; Joiner, Cavanaugh, and Wurtz, 2013; Burr, Mor- gan, and Morrone, 1999) and extra-visual (Bremmer et al., 2009; Duhamel, Colby, and Goldberg, 1992; Ibbotson et al., 2008; Kagan, Gur, and Snodderly, 2008; Kayama et al., 1979; Rajkai et al., 2008; Sommer and Wurtz, 2006) interactions observed experimen- tally, as well as recent results suggesting visual space constancy maintenance through optimal fusion of visual and extra-visual information (Ostendorf and Dolan, 2015). This can be emphasized by formalizing a landmark model as: C = argmin θ ∑ i,j ∥∥pt+1j − F [pti; θ]∥∥2 Y =  True, if C ≤ φ False, if C > φ (1.3) where pt+1j is a point in a retinal location post-saccade, p t i is a corresponding point in a retinal location pre-saccade, F [] is again a function that transforms a point pt to a point pt+1 according to parameters θ, and φ is some threshold under which constancy is main- tained. The landmark theory fails to offer a satisfactory characterization of the relational 15 function F [] in Eq. 1.3 and fails to provide an explanation of the role of feed-forward extra-visual information. The landmark theory has two large consequences for visual constancy and odome- try: (1) only sparse information from previous fixation points is needed transsaccadically; and (2) motor-related information may not be needed for constancy. First, very little visual information need be carried between saccades to effectively re-localize objects in the visual scene and maintain egocentric and exocentric visual con- stancy. A sparse set of landmarks, or key-points, should then be chosen that best allow matching between two scenes. While estimates of the amount of transsaccadic informa- tion needed to enable visual constancy remain elusive, the landmark theory assumes at most a sparse gist from previous visual locations is needed. This would mean that fol- lowing a saccade, the brain must match the gist of previous features to the current scene and generate features for the current scene. Optimally, features should be selected in a frame that will not only remain present in the second frame, but will also be maximally discriminative in that second frame. If operating with limited memory, features should be selectively chosen to maximize inter-frame matching. Second, feed-forward, intentional information can constrain the scene matching process to provide a better informed decision. The landmark theory discards feed-forward intentional input and generally fails to explain how feed-forward information is used to maintain constancy. This is contrary to the abundance of experimental evidence of extra- visual and feed-forward effects on visual processing during saccades in the early visual system (Burr, Morgan, and Morrone, 1999; Joiner, Cavanaugh, and Wurtz, 2013; Ra- jkai et al., 2008; Zirnsak and Moore, 2014; Bremmer et al., 2009; Duhamel, Colby, and 16 Goldberg, 1992; Ibbotson et al., 2008; Kagan, Gur, and Snodderly, 2008; Sommer and Wurtz, 2006; Sylvester and Rees, 2006). An unconstrained key-point match between two images across a large temporal window and spatial extent is at least exponentially com- plex (Brox, Malik, and Bregler, 2009). If all keypoints correspond to static features in both scenes, then the matching problem can be greatly simplified by framing the 2D op- tical flow problem as a 1D disparity estimation problem, thus turning a 2D exponentially complex problem into a 1D polynomially complex problem Brox, Malik, and Bregler, 2009. Alternatively, by constraining matching to a particular region of the visual scene computational requirements can also be reduced. Previous work has has applied extra- visual feedback signals from IMUs or GPS (Maimone, Cheng, and Matthies, 2007) and signals from an elementary motion model assuming constant velocities (Davison, 2003) to constrain the matching process. Either considering only static features or constraining the matching process to be consistent with a narrow range of transforms could lead to increased performance relative to computational requirements and processing time. To further explain the computational consequences of the landmark theory and be- gin to introduce potential solutions, the next subsection will discuss computer and ma- chine vision approaches to solving the similar problem of visual odometry (VO) (for a more in depth review see Fraundorfer and Scaramuzza, 2012. 1.2.5 Visual Odometry Visual odometry (VO) can be formalized by imagining a rigidly attached camera moving through an environment and recording images at discrete time instants k. Then 17 the set of images taken by such a camera could be I0:n = {I0, ..., In}. Two sequential camera images taken at time k − 1 and k are related by the transform Tk,k−1 ∈ R4×4 following the form: Tk,k−1 = Rk,k−1 tk,k−1 0 1  (1.4) where Tk,k−1 ∈ SO(3) is the rotation matrix and tk,k−1 ∈ R3×1 is the translation vector. The main task of VO is to compute the camera transformations Tk,k−1 from images Ik to Ik−1 and then concatenate each transformations to recover the most recent camera pose from the set of camera trajectories Cn. A camera pose is determined by beginning with a known camera pose C0, computing the next camera pose with C1 = T0,1C0, and processing iteratively until the current camera pose Cn = Cn−1Tn,n−1 is found. There are two general approaches to VO. Global methods use the entire image and feature-based methods only use particular image keypoints or features. Feature-based approaches are both faster and more accurate and will be the focus of this discussion. Ego-motion can be estimated by matching points in a visual scene between two sequential frames and optimizing for a velocity that would result in the points in one scene projecting to their observed locations in the next scene. Approaches can be generalized as feature- tracking and feature-matching. Feature-matching approaches independently find features in all image frames and then match features between frames. Feature-tracking approaches find features in a frame and then detect their locations in subsequent frames. Feature- tracking approaches are best when each frame is spatially close to the previous frame and feature-matching is best for large translations between frames. 18 A common feature-tracking approach is the Kanade-Lucas-Tomasi (KLT) tracker (Tomasi and Kanade, 1991) (which is based on the more general Lucas-Kanade (LK) method (Lucas, Kanade, et al., 1981)). Assuming brightness constancy and small motion between frames, optical flow can be calculated by solving for the velocities in the x and y directions (u and v respectively) given the spatial change between a point in the image and the temporal derivative, using a truncated Taylor series expansion: 0 = It(pi) +∇I(pi) · [uv] (1.5) This equation is under-determined as there are two unknowns. By including the additional constraint that motion in a neighborhood of pixels should be approximately the same, an overdetermined system of equations can be found: Ad = b (1.6) where A =  Ix(p1) Iy(p1) Ix(p2) Iy(p2) ... ... Ix(pn) Iy(pn)  , d = u v  , b = −  It(p1) It(p2) ... It(pn)  (1.7) A least squared minimization of d in the form of ATAd = AT b will thus provide a reasonable estimate of u and v, or the transform that best explains movement of features from the first to the second image. However the solution to the above system of equations will only approach optimal- 19 ity in the presence of an otherwise static scene. In the presence of independent motion, optimization will be biased by the inconsistent transforms of independently moving fea- ture points compared to the transforms of static points in the scene whose motion is wholly influenced by ego-motion. This is partially because the least squares approach of the LK- based solutions (Lucas, Kanade, et al., 1981) allow only for zero-mean Gaussian noise in the underlying image data. Structured noise, such as errors from inconsistent motion profiles, violate this condition. Additionally, independently moving points in the scene move inconsistently with their statically moving neighbors, adding additional error to the least-squares fit. Outlier detection can help mitigate the effects from independently mov- ing objects. However the problem of outlier removal is equivalent to the NP-complete maximum clique problem and thus can not be optimally solved. Approximate solutions are computationally expensive and can still fail with sufficiently large amounts of inde- pendent motion. Feature-mapping approaches similarly fall prey to independent motion. Solutions that best explain a transform from points in one scene to points in another scene will be biased by independent motion and cause any estimate of ego-motion motion to be biased by inconsistent movement profiles stemming from mixed static and dynamic objects in the visual scene. Furthermore, an accurate matching method need ensure that the same features are present pre- and post- saccade with a high probability. A moving object that may be out of the field of view post-saccade may then be a poor choice as a key-point. If using a sparse transsaccadic memory to transfer landmarks between saccades and aid in localization, then the presence of even a small number of landmarks that correspond to dynamic objects could greatly bias ego-motion estimates. 20 1.2.6 Feed-Forward Feature Selection The early visual system could benefit from information on which areas of the visual scene correspond to static features and which to dynamic features. Problems of differen- tiating between areas of the image plane composed of static versus independent features and choosing features to compare in the future can both be solved with predictive forward modeling in a fused sensorimotor space based on vision and extra-visual inputs. Indeed evidence from the early visual systems suggests a feed-forward solution to this problem. While a data-driven approach is possible, we see changes in the early visual system before the onset of saccadic eye movements (Duhamel, Colby, and Goldberg, 1992; McFarland et al., 2015) suggesting a forward approach where extra-retinal signals (potentially from the superior colliculus (SC)) modulate neural activity in accordance with expected future stimuli. Similarly, results from V4 and IT in primates show a shift in receptive field lo- cation pre-saccade (Zirnsak and Moore, 2014; Connor, 2001; Ni, Murray, and Horwitz, 2014). Computationally, forward sensorimotor approaches could allow the early visual system to selectively determine features that will best enable global re-localization fol- lowing a particular action. While perceptually humans report experiencing saccadic suppression, this does not preclude the possibility that perisaccadic visual information is available to the visual sys- tem. Indeed recent work in oculomotor adaptation has shown the dependence of sac- cadic adaptation on visual information briefly presented perisaccadically, demonstrating that not only is information available to the visual system, but that it is actively used (Panouille`res et al., 2016). 21 1.2.7 Neural Networks An important aspect of learning in a convolutional neural network (CNN) can be thought of as a sub-sampling or noise-reduction problem (Bradley, 2010). Modules in a network can remove signal components that are unrelated to the current task and enhance task-relevant components. One way this can be accomplished is by filtering a signal to reduce information along certain dimensions while preserving, or enhancing, information along others. In other words, by devoting additional bandwidth to certain dimensions of a signal and reducing bandwidth for other dimensions a signal can be compressed by selec- tively preserving relevant information. In contrast with most uses of CNNs, the predictive task of this sensorimotor domain requires the network not just to learn a condensed rep- resentation of the sensory input space, but instead to selectively learn a compressed space that best allows the application of an efferent-related transform to a target location. Complex learning machines can be built by stacking filtering modules on top of one another to form a network where each layer transforms an input into a representation at a higher and more abstract level by enhancing relevant information and attenuating irrelevant information. However applying linear transforms from stacked linear filters is equivalent to a transform by a single linear filter and a linear classifier can only carve its input space into very simple regions (namely half-spaces separated by a hyperplane (LeCun, Bengio, and Hinton, 2015)). As most real-world problems are non-linear and inadequately approximated with purely linear functions, including a non-linearity after each linear filer has become a common method to enhance a network’s discriminative abilities and make its response simultaneously sensitive to minute details and insensitive 22 to large irrelevant variations such as background noise. A CNN is trained using stochastic gradient descent (SGD) and the back-propagation algorithm where the loss E must be computed and then the error gradient must also be computed at each layer i with respect to that layer’s weight parameters Wi. Weights are updated according to Vit+1 = µVit − α∇E(Wit) (1.8) Wit+1 = Wit + Vit+1 (1.9) where Vit+1 is the update value for the weight Wit+1 and µ and α are hyperparameters representing the learning rate and momentum, respectively (Bottou, 2012). Computing E of the final layer is relatively straightforward as we know the true label of the target with which we can easily compute the error between the output layer’s prediction and the true label. To illustrate this, if a network of stacked filters includes layers, then the final layer of the network (the output layer) produces output activations: Xn = Fn(Xn−1,Wn) (1.10) where Xn−1 is the input to layer n, Wn are the filter weights, and Fn is the function that filters Xn−1 with Wn. For a dense output, then the loss E of the output layer can be computed by taking the L2 error between the network output and ground truth training label: 23 E = ‖L−Xn‖2 (1.11) Then the remaining layers of the network can similarly be defined as Xi = Fi(Xi−1,Wi),∀i ∈ [1, n] (1.12) At lower levels (i.e. every layer except the final output layer), we have no explicit training signal or label with which we can calculate intermediate errors. However, we do have the error from layer n and thus can compute δE δXi , i = n−1. Applying the chain rule, we can then compute δE δWi with: δE δWi = δE δXi δFi(Xi− 1,Wi) δWi (1.13) By again applying the chain rule we can compute δE δXi−1 as: δE δXi−1 = δE δXi δFi(Xi− 1,Wi) δXi−1 (1.14) We then have recurrence equations that allow computation of the gradient of the loss function with respect to each layer’s weight parameters. SGD relies on the ability to estimate the gradient of each model parameter and model input with respect to overall model error. However this not only requires that each function implemented in the network is differentiable and at-least nearly smooth, approaching optimality in a learning task requires that these transforms also preserve suf- ficient information about their inputs. At the very least, the network needs a mechanism to estimate the true generating input so as to better estimate the non-linear transformation 24 function in order to optimize parameters with respect to that function. This can also be thought of in terms of identifiability where the range of a function’s identifiability can be considered a correlate of the function’s ability to reconstruct its input. This partially explains the successes of the rectified linear non-linearity unit (ReLU) in deep networks trained with SGD where its inclusion has led to increased classification performance (He et al., 2014; He et al., 2015; Maas, Hannun, and Ng, 2013; Sermanet et al., 2013; Simonyan and Zisserman, 2014). The ReL function is generally implemented as: Y = max(0,WX) (1.15) where W could be a scaler or vector of weights and X is some input. Taking δY/δX yields δY/δX =  W, if WX > 0 0, if WX ≤ 0 (1.16) where the derivative is W when WX is greater than zero and zero when is less than or equal to zero. Thus when the output is greater than zero, the ReLU non-linearity preserves substantial information about its input from it’s gradient. Put another way, the linear operation with a subsequent ReL is identifiable when leading to enhanced differentiability. Additionally, with a random network initialization, only 50% of units in the network are active at a given time (Glorot, Bordes, and Bengio, 2011). 25 1.3 Chapter Summaries 1.3.1 Chapter 2 - DeepEfference: Learning to Predict the Sensory Con- sequences of Action Through Deep Correspondence Chapter 2, titled DeepEfference: Learning to Predict the Sensory Consequences of Action Through Deep Correspondence, introduces the DeepEfference architecture. DeepEfference is a bio-inspired, unsupervised, deep sensorimotor network that learns to predict the sensory consequences of self-generated actions. DeepEfference computes dense image correspondences [1] at over 500 Hz and uses only a single monocular grayscale image and a low-dimensional extra-modal motion estimate as data inputs. Designed for robotic applications, DeepEfference employs multi-level fusion via two parallel pathways to learn dense, pixel-level predictions and correspondences between source and target im- ages. Quantitative and qualitative results from the SceneNet RGBD (McCormac et al., 2016) and KITTI Odometry (Geiger et al., 2013) datasets are presented and demonstrate an approximate runtime decrease of over 20,000% with only a 12% increase in mean pixel matching error compared to DeepMatching (Revaud et al., 2016) on KITTI Odometry. 26 1.3.2 Chapter 3 - A Deep Neural Network Approach to Fusing Vision and Heteroscedastic Motion Estimates for Low-SWaP Robotic Ap- plications Chapter 3, titled A Deep Neural Network Approach to Fusing Vision and Het- eroscedastic Motion Estimates for Low-SWaP Robotic Applications, presents Multi- Hypothesis DeepEfference (MHDE) which is a multi-hypothesis extension of the DeepEf- ference architecture that learns to intelligently combine noisy heterogeneous sensor data to predict several probable hypotheses for the dense, pixel-level correspondence between a source image and an unseen target image. MHDE is augmented to handle dynamic, het- eroscedastic sensor and motion noise and compute hypothesis image mappings and pre- dictions at 150-400 Hz depending on the number of hypotheses being generated. MHDE fuses noisy, heterogeneous sensory inputs using two parallel architectural pathways and n (1, 2, 4, or 8 in this work) multi-hypothesis generation subpathways to generate n pixel- level predictions and correspondences between source and target images. I evaluated MHDE on the KITTI Odometry dataset (Geiger et al., 2013) and benchmarked it against DeepEfference (Shamwell, Nothwang, and Perlis, 2017) and DeepMatching (Revaud et al., 2016) by root mean squared (RMSE) pixel error and runtime. MHDE with 8 hypothe- ses outperformed DeepEfference in root mean squared (RMSE) pixel error by 103% in the maximum heteroscedastic noise condition and by 18% in the noise-free condition. MHDE with 8 hypotheses was over 5,000% faster than DeepMatching with only a 3% increase in RMSE. 27 1.3.3 Chapter 4 - An Embodied Deep Neural Network Approach to Dense Visual Correspondence for Low-SWaP Robotics Chapter 4, titled An Embodied Deep Neural Network Approach to Dense Visual Correspondence for Low-SWaP Robotics, introduces Inertial DeepEfference (IDE), which builds upon the DeepEfference and Multi-Hypothesis DeepEfference architectures by using raw sensor information. IDE is unique in that it uses real sensor data for an end- to-end trainable dense correspondence network that is orders of magnitude faster than other SOA deep approaches. We evaluated IDE on the EuRoC MAV dataset (Burri et al., 2016) and on the KITTI Odometry dataset (Geiger et al., 2013) and benchmarked it against the deformable spatial pyramid (DSP) (Kim et al., 2013) and DeepMatching (Re- vaud et al., 2016) dense correspondence approaches. The IMU+surrogate feed-forward motor signal (FFMS) IDE network was 167x faster than DM and 516x faster than DSP. 28 Chapter 2: DeepEfference: Learning to Predict the Sensory Consequences of Action Through Deep Correspondence 2.1 Abstract As the human eyeball saccades across the visual scene, humans maintain egocentric visual positional constancy despite retinal motion identical to an egocentric shift of the scene. Characterizing the underlying biological computations enabling visual constancy can inform methods of robotic localization by serving as a model for intelligently inte- grating complimentary, heterogeneous information. Here we present DeepEfference, a bio-inspired, unsupervised, deep sensorimotor network that learns to predict the sensory consequences of self-generated actions. DeepEfference computes dense image correspon- dences (Scharstein and Szeliski, 2002) at over 500 Hz and uses only a single monocular grayscale image and a low-dimensional extra-modal motion estimate as data inputs. De- signed for robotic applications, DeepEfference employs multi-level fusion via two parallel pathways to learn dense, pixel-level predictions and correspondences between source and target images. We present quantitative and qualitative results from the SceneNet RGBD (McCormac et al., 2016) and KITTI Odometry (Geiger et al., 2013) datasets and demon- strate an approximate runtime decrease of over 20,000% with only a 12% increase in 29 mean pixel matching error compared to DeepMatching (Revaud et al., 2016) on KITTI Odometry. 2.2 Introduction (A) (B) (C) (D) (E) Figure 2.1: Sample results from KITTI Odometry. A: Sample source image. B: Sample target image. C: DeepEfference output reconstruction of the source image in A using pixel intensity values sampled from the target image in B. D and E: Source and target images with marked correspondence points computed by DeepEfference. For an autonomous agent (be it robotic or organic), understanding how self-produced 30 actions affect the environment is critically important to survival and successful operation in the real-world. Similarly important is an agent’s understanding of how its actions affect its sensory perceptions. In the case of visual positional constancy, this corresponds to an agent’s ability to separate motion across the retinal plane induced by self-motion (e.g., from a saccade) from motion induced externally (e.g., from a charging predator). A comparable understanding of the perceptual consequences of self-induced actions could be used by autonomous robots to measure action-based and perceptual anomalies (among others). Take for example the act of turning to the left. In the case of the former, this action should result in an object within the visual field-of-view (e.g., a soda can) shifting to the right on the imaging plane by a commensurate amount. If this shift does not occur, it could mean that the action was not properly performed and that there could be a problem with the system’s actuators. For the latter, the same expectation violation might mean that the soda can moved independently (e.g., was blown by the wind or kicked by a passerby) and subsequently, would be a poor choice as a landmark for visual dead-reckoning1. We argue that biological mechanisms supporting visual constancy contain a rich, egocentric representation of the environment and with appropriate models and compu- tational architectures, these representations can be extracted to enable enhanced robotic 1While the immediate focus of this paper is on sensorimotor modeling and prediction in the visual domain, this work is situated within a broader class of issues including that of how an agent can respond appropriately to anomalies in a complex world (see (Kohashi and Oda, 2008; Korn and Faber, 2005) for Mauthner Cell anomaly detectors in teleost fish; (Brody, Perlis, and Shamwell, 2015; Brody et al., 2016) for EC in the auditory domain for human-robot interaction; (Kumar et al., 2015; Irani and Anandan, 1998; Nelson, 1991) for independent motion detection). 31 visual navigation and localization (e.g., dead-reckoning). Humans maintain perceptual stability and visual constancy despite the 3-4 saccades the human eyeball undergoes per second. While the shift in the projection of the visual world on the retina elicited by a saccade is identical to the shift that alternatively would be elicited by a quick, external shift of the visual world, humans are able to perceptually distinguish between the two conditions and perceive a stable world in the first, and a moving world in the second. The apparent conundrums of human visual positional constancy can be resolved when considering humans as complex, embodied agents with access to information from multiple overlapping sensory modalities including vision, audition, proprioception, ‘thought perception’(Bhargava et al., 2012; Shamwell et al., 2012), and intentional/motor informa- tion (Holst and Mittelstaedt, 1950; Sperry, 1950; Brody, Perlis, and Shamwell, 2015). For example, Efference Copy (EC) (Holst and Mittelstaedt, 1950) and the closely related Corollary Discharge (CD) (Sperry, 1950) neural theories have long been implicated in the brain’s ability to maintain visual positional constancy trans-saccadically. EC and CD posit that early sensory centers access extra-modal information about intended actions to influence subsequent sensory processing by priming early sensory centers with a prior on expected incoming sensory signals. We drew inspiration from the theories of EC and CD in developing a computational solution to robotic visual localization. Similar to how EC can be used by biological systems to estimate expected sensory information, robotic systems often have access to information with which they can glean an estimate of ego-motion from a separate, non- visual modality. If intelligently integrated, this extra-visual estimate can serve as a prior on expected post-movement visual perceptions and improve visual motion estimates. 32 Paralleling the biological theory of EC where visual processing centers receive motion/intention-related information to aid in sensory processing, we have designed Deep- Efference as an unsupervised, feed-forward, heterogeneous, deep network that computes dense correspondence (Scharstein and Szeliski, 2002) and performs next-frame prediction at over 500 Hz. Critical to achieving this update rate, DeepEfference uses monocular images and only processes the source image from each pair. The network learns (x,y) pixel locations of where to sample in the target image to best reconstruct the source image. This translates to learning which pixels in the source image best correspond to the target image, and thus, a correspondence mapping between source and target images. The remainder of the paper is organized as follows: Section 2.3 describes the moti- vations for this work; Section 2.4 outlines the DeepEfference network architecture; Sec- tion 2.5 describes the datasets and experiments used for validation; Section 2.6 discusses results from the validation experiments; and Section 2.7 offers concluding thoughts and directions for future work. 2.3 Background 2.3.1 Deep Approaches to Spatial Transformation Encoding and Learn- ing Learning spatial transformations and relationships between successive images has been a topic of great interest both in computer vision and robotics and deep, bio-inspired, solutions have already begun to show promise for the correspondence problem (see (Scharstein 33 and Szeliski, 2002) for a review of the correspondence problem). In computer vision, multiplicative interactions have been used to great success for relationship learning be- tween images (Memisevic and Hinton, 2007; Ranzato and Hinton, 2010; Memisevic and Hinton, 2010; Hinton, Krizhevsky, and Wang, 2011; Kivinen and Williams, 2011; Memi- sevic, 2013). However, both the initial and transformed image are required as inputs and there is no readily-available means to provide the model extra information from another modality as a motion prior. Both points (but in particular the latter) have implications for the correspondence problem and image relationships for robotics. These and other deep approaches (Revaud et al., 2015; Revaud et al., 2016) to spatial transformation encoding rely on siamese-like networks where both source and target images are available and computed on. If we want to deploy deep approaches on SWaP-constrained systems, networks require significant size reductions. 2.3.2 Extra-Visual Motion Estimates For any two visual measurements taken successively, robots often have an indepen- dent measurement of self-motion between those two images. These measurements could come from IMUs, GPS, LIDAR, ultrasonic ranging sensor, or input motor commands. When estimating motion based on the movement of feature points on the visual imaging plane, additional non-visual motion estimates could be used as a prior for estimating cam- era motion. Similar in spirit to this work is (Ciliberto et al., 2012) where heteroscedastic models were learned for independent motion detection for an actuated camera. However, camera motions were limited to pure rotations, which are not affected by varying depths 34 within a scene. Additionally, (Ciliberto et al., 2012) was constrained to use Gaussian process models and was not end-to-end trainable. 2.4 Approach Deconvolutional decoderConvolutional encoder Source image Local Pixel Shifts Reconstructed source image Bilinear Sampler Transform input Localization pathway 2D Affine matrix Spatial transformer Global pixel shifts Target image for sampling Figure 2.2: DeepEfference network diagram showing the linked global and local learners DeepEfference is an unsupervised, deep heterogeneous neural network that learns to predict how source images correspond to unseen (i.e., unprocessed) target images. Rather than learning how to transform each pixel (e.g., via a fully-connected layer), we employ a trainable 2D spatial transformer to impose a global estimate of image motion. Inspired by the Landmark Theory of visual positional constancy (Bridgeman, Heijden, and Velichkovsky, 1994), DeepEfference carries only a sparse gist of the previous visual scene forward and instead uses the currently perceived image from which to sample. As shown in Fig. 2.2, DeepEfference has two interconnected pathways: one for de- 35 termining the global 2x3 affine 2D transformation matrix, and a second encoder-decoder pathway that predicts local, pixel-level shifts to be applied to the affine-transformed im- age. The network does not generate images from scratch, but rather learns how to sample from a target image to recreate the initial image. Given a source image and an estimated transform, DeepEfference learns coordinates (x, y) at which to sample in a target image to reconstruct the source image. The result is a correspondence map between pixel loca- tions in the source image and pixels in the target image (see Fig. 2.1 for example learned correspondences). 2.4.1 Training and Loss Rule DeepEfference is trained to minimize reconstruction errors between a given source image and a reconstruction of that source image generated by selectively sampling from a target image. We compute Euclidean error and use it to train the network via backprop- agation. DeepEfference is trained by minimizing the following loss function: L(θ, It, Is) = argmin θ ‖Ir(θ, It)− Is‖2 (2.1) where Ir is an image reconstruction, It is the image target, and Is is the image source being reconstructed. 2.4.2 Pathway 1: Local, Pixel-Level Shifter DeepEfference’s first pathway provides localized, object-level shift information. We implemented this pathway as a convolutional-deconvolutional encoder-decoder. The 36 encoder compresses the source image through a series of convolutional filtering opera- tions and the decoder generates magnitudes of pixel shifts by expanding the compressed convolutional outputs using deconvolutions2. We used five convolutional layers followed by five deconvolutional layers. All convolutional and deconvolutional layers used filters of size 3, pad of 1, and stride of 2. The first convolutional layer outputted 32 feature maps and the number of output maps doubled for each subsequent convolutional layer with the fifth and final convolutional layer outputting 512 feature maps. The output sizes of the generative deconvolutional layers were arranged oppositely with the first layer outputting 512 maps and the final layer outputting 32 maps. The local, pixel-level shifts of DeepEfference are similar to mappings learned by the recent view synthesis method (Zhou et al., 2016) that has been used to render new, unseen views of objects and scenes. Besides different network structures and inclusion of a global spatial transformer module, the largest difference between their method and our own is that rather than learning to generate novel viewpoints of objects or scenes, we learn how to reconstruct a source image using pixel locations in a target image. 2.4.3 Pathway 2: Global Spatial Transformer While the output of the first encoder/decoder pathway provides estimates of lo- calized, pixel-level movement to account for depth, non-rigidity, etc., the second spa- tial transformer (ST) pathway provides an estimate of the global transformation between source and target images. 2We use the term ’deconvolution’ as is common in the deep learning literature but the operation we use is more properly referred to as a transposed convolution 37 ST modules (Jaderberg et al., 2015) enable parametrized geometric transformations to be applied to inputs or intermediate feature-maps in deep networks. While the param- eters for the geometric transformation can either be learned or provided to the network as an input, we provided the network with an estimate of the true 3D transformation between source and target images (δx, δy, δz, δα, δβ, δγ) and used four fully-connected layers (each followed by a rectified linear unit (ReLU) (Glorot, Bordes, and Bengio, 2011)) to approximate the true linear 3D warp matrix as a 2D affine transformation. However, the failure of a 2D ST-only approach is seen with translational camera movements in scenes with varying depth. Following a camera translation, the new loca- tion of an object in the image frame will depend on its distance from the camera: objects that are closer to the camera exhibit greater displacements on the imaging plane compared to objects further from the camera. A purely 2D affine transformation in the absence of depth cannot accurately warp an image with varied scene depths and thus the localization pathway can at best learn parameters that correspond to a dominant plane of a fixed depth. As we show, ST modules can be used to efficiently embed action or motor in- formation in a standard deep network that may then be trained end-to-end with back- propagation. These motor-related signals can be derived from high-level actions, referent signals to PID controllers, GPS, or IMU measurements. We implemented an ST module in Caffe (Jia et al., 2014) using Nvidia CUDA Deep Neural Network library (cuDNN) primitives. We created two layers, one to perform the affine transformation and output target coordinates and a second layer that performs the bilinear sampling given coordinates and an image. Three fully connected layers take the input estimate 3D camera pose transformation and generate a 2x3 2D affine transforma- 38 tion matrix for the spatial transformer module. Although the sampling component of our ST module takes an input image as input, no learnable parameters are based on image content and thus our global pathway is a function only of the input transformation esti- mate and is image content-independent. 2.5 Experiments We primarily experimented with two different network architectures. The first ar- chitecture, LightEfference, only used the first global pathway. The second architecture, DeepEfference, implemented both the global pathway and the local pathway. LightEf- ference and DeepEfference were evaluated on the SceneNet RGB-D (McCormac et al., 2016) and KITTI Odometry (Geiger et al., 2013) datasets and compared against corre- spondence matching results from the SOA DeepMatching approach (Revaud et al., 2016). We also experimented with a third architecture that only used the local pathway. However, networks trained with this architecture failed to converge or decrease network loss (see Section 2.7 for additional discussion on this point). SceneNet RGB-D (McCormac et al., 2016) is a dataset of 5 million photo-realistically rendered images from a dynamically moving camera in a total of 15 different scenes. Im- ages in SceneNet are rendered at 1 Hz and groundtruth camera pose and depth are pro- vided at each camera exposure. All objects in the visual scenes are rigid, thus fulfilling the static scene assumption and allowing for ground truth to be computed from scene depth and camera position (described in Section 2.5.1). The KITTI Visual Odometry dataset (Geiger et al., 2013) is a benchmark dataset 39 for the evaluation of visual odometry and LIDAR-based navigation algorithms. Images in KITTI were captured at 10 Hz from a Volkswagen Passat B6 as it traversed city, resi- dential, road, and campus environments. Groundtruth poses at each camera exposure are provided by an RTK GPS solution and depth is provided with coincident data from a Velo- dyne laser scanner. Groundtruth pixel projections were calculated just as for SceneNet. 2.5.1 Experimental Methods For SceneNet and KITTI, data was separated into train (80%) and test (20%) sets. For SceneNet RGB-D, we used a total of 44, 850 image pairs with 80% (35, 880) for training and 20% (8, 970) for testing. For KITTI, we used a total of 23, 190 image pairs with 80% (18, 552) for training and 20% (4, 638) for testing. In all experiments, we randomly selected an image for the source, used the successive image for the target, and subtracted the two 6-DOF camera poses for the transform input. For each image in each dataset, we cropped the middle 224x224 pixel region for network inputs. Predicted pixel correspondences between source and target images were evalu- ated against groundtruth correspondence and SOA DeepMatching correspondence pre- dictions. With access to scene depth and true camera pose for both KITTI and SceneNet, groundtruth pixel shifts were calculated by applying a 3D warp to 3D pixel locations in the source images to generate the expected pixel locations in the target images. We projected each 3D point in the frame of camerat0 to the world frame using the derived projection matrix for camerat0 and then reprojected these points in the world frame to camerat1 using the inverse projection matrix for camerat1. Finally, we transformed points in the 40 frame of camerat1 to the image plane. This resulted in a correspondence map between pixel locations in camerat0 and camerat1 for each point where depth was available (e.g., where ray tracing did not go to infinity in the case of SceneNet or depth was outside of the Velodyne laser scanner’s range for KITTI). We evaluated DeepEfference and LightEfference using keypoints generated us- ing the feature points detected by DeepMatching and from the accelerated segment test (FAST) feature detector (Rosten, Porter, and Drummond, 2010). For each type of feature, we measured how the keypoints detected in the source images were projected into the target images. The projection errors were compared to groundtruth projections and were used to determine mean pixel errors for each method. We trained DeepEfference and LightEfference for 500, 000 iterations on KITTI Odometry scenes 1 − 11 and a subset of SceneNet RGBD (10 randomly selected tra- jectories of 300 image pairs for each of the 15 different scene types). We used the Adam solver with batch size=32, momentum=0.9, momentum=0.99, gamma=0.5, and a step learning rate policy of 100, 000 for all experiments. We used a Euclidean loss rule to train all networks. All experiments were performed with a Nvidia Titan X GPU and the Caffe deep learning framework (Jia et al., 2014). 2.6 Results Tab. 2.6 details the comparative runtimes between DeepMatching, LightEfference, and DeepEfference3. Fig. 2.3 and Tab. 2.2 detail the predictive error for LightEffer- 3We used a CPU version of DeepMatching for these comparisons. The latest available version of the GPU implementation of DeepMatching took over 7 seconds per image to run on our workstation on a 41 Figure 2.3: Pixel error boxplots for DM, LE, and DE using DM and FAST keypoints. Y-axis is actual mean pixel error. Middle lines are the medians and whiskers indicate 1.5 interquartile of the lower and upper quartiles. Note outliers are not shown for clarity and instead minimum and maximums are presented in the table below. ence, DeepEfference, and DeepMatching on the KITTI Odometry and SceneNet RGBD datasets. Using keypoints generated by DeepMatching on KITTI, DeepEfference shows a 1,100% performance increase in mean pixel error over LightEfference (significant with t(2.94e5) = 60.01, p < 1e−5)4 while DeepMatching showed a 12% increase over Deep- Efference (significant with t(5.33e5) = 31.33, p < 1e−5). When using FAST keypoints, DeepEfference outperformed LightEfference by 1,200% on KITTI odometry (significant with t(6.18e5) = 87.87, p < 1e−5). However, in runtime performance, LightEfference was 447% faster than DeepEfference, and DeepEfference was over 23,000% faster than DeepMatching. 256x256 image so we elected to use the faster CPU version for all experiments 4The distributions of pixel errors appeared non-uniform but as we had greater than 200,000 samples to test in each condition, we elected to include t-test analysis. We used Welch’s t-test where the degrees of freedom are approximated by Satterthwaite’s method. 42 Table 2.1: Average runtimes for DM, LE, and DE Mean Standard Deviation Median Frames Per Second DM 0.35225 sec. 0.0094525 sec. 0.351561 2.8 LE 0.000332 sec. 2.95788e-05 sec. 0.000318 3012 DE 0.0014864 sec. 2.03606e-05 sec. 0.00148484 672 The performance gap between LightEfference and DeepEfference narrowed on the SceneNet dataset. DeepEfference outperformed LightEfference by 11% (significant with t(2.78e6) = 58.65, p < 1e−5) and was outperformed by DeepMatching by 240% (sig- nificant with t(2.70e6) = 382.56, p < 1e−5) using DeepMatching generated keypoints. With FAST keypoints, DeepEfference performed only 13% better than LightEfference (significant with t(4.76e6) = 68.64, p < 1e−5). The performance differences between DeepEfference and DeepMatching may be influenced by the differences in movement statistics between the SceneNet and KITTI datasets. For SceneNet, rotational speeds5 (in deg/s) were typically much larger (mean=5.14, std=9.82, median=2.843, min=0.039, max=177.84) while translational speeds (in m/s) were smaller (mean=0.16, std=0.091, median=0.14, min=0.003, max=0.66). Motions in KITTI followed reversed distributions where rotational speeds (in deg/s) were typically small (mean=0.997, std=6.65, median=0.086, min=0.0001, max=85.72) while transla- tional speeds (in m/s) were much larger (mean=9.22, std=4.20, median=9.04, min=0.005, 5SceneNet only contains renders of 1 in every 25 frames so these quantities are based on the differences in positions between successively rendered frames 43 max=26.41). The larger variety of movements in SceneNet may have proved too difficult for the current version of DeepEfference to learn a consistent motion model. Future work will include experiments on deeper versions of DeepEfference. The similar performance of LightEfference and DeepEfference on SceneNet may have been influenced by the depth of objects in the dataset. Image scenes in KITTI had both larger and more varied depths (mean=12.97, std=10.005, median=9.46, min=5.00, max=79.99) compared to SceneNet (mean=3.92, std=2.45, median=3.40, min=0.00, max=19.99). This might explain why LightEfference’s average pixel prediction error was within 11- 13% of DeepEfference as errors from translations of objects at different depths would have been smaller. For example, the green boxes in Fig. 2.4 highlight an unusual instance in SceneNet where LightEfference was unable to rectify the depth differences between the foreground features and background features. The poorer performance of DeepEfference and LightEfference on SceneNet may also be due to shifts between successive images resulting in objects in the source image no longer being present in the target image. In SceneNet, transformations between successive camera frames often resulted in occlusions of large areas of the field of view which may have led DeepEfference and LightEfference to incorrectly sample from the target images. We were surprised by the large performance difference between DeepEfference and LightEfference on the KITTI dataset. As LightEfference is comprised completely of fully-connected layers and neither network uses dropout, one possibility is that Light- Efference overfit to the training dataset. This is unlikely as DeepEfference’s training Euclidean loss on KITTI was ≈50 at 500, 000 training iterations while LightEfference’s loss was ≈5x larger at ≈250. This suggests that LightEfference was also performing 44 poorly on the training data and thus most likely not overfitting. A second possibility is that the large range of depths in scenes in KITTI prevented the limited, 2D-only transformations of LightEfference from learning a single coherent transformation model. This possisibility is supported by the large number of outliers produced by LightEfference which suggests that LightEfference was unable to success- fully process the full range of input transforms. For DeepEfference, the mean pixel er- ror percentile scores at 5, 10, 25 and 99 were percentile(5) = 0.45, percentile(25) = 1.07, percentile(50) = 1.74, and percentile(99) = 11.69 while for LightEfference, they were percentile(5) = 0.51, percentile(25) = 1.22, percentile(50) = 2.21, and percentile(99) = 1109.48. While LightEfference has higher error scores across per- centiles, it is the percentile score at 99 that demonstrate its large number of high error outliers which cause its mean pixel error to be an order of magnitude greater than Deep- Matching and DeepEfference. 2.7 Discussion We have shown that providing a network with heterogeneous inputs and combining a parametrized global transformation pathway with a pixel-level, local pathway allows for far more computationally efficient predictions with minimal degradation in predictive performance. Agents must understand which elements in the environment their actions do and not not have the power to affect. A potentially powerful future use for DeepEfference lies in teaching systems what they can and cannot interact with. While deep learning approaches 45 have traditionally been limited in their applications due to their need for large, annotated training sets, DeepEfference’s ability to learn without supervision can allow for robots to learn meaningful sensorimotor relationships via bootstrapping simply by operating in an environment. While the aim of DeepEfference is to generate correspondences between pixels in the source image and pixels in the target image, an unintended side-effect of the predictive training is the generation of image areas where there is no actual overlap between source and target images. Several of these cases are shown in Fig. 2.4. In first row of Fig. 2.4, DeepEfference learned to sample from different areas in the target image to imagine what the front of the van looked like despite it not being present in the target image. The same can be seen in the second row where DeepEfference imagines what the left-side of the house looked like. Performance of the dual-pathway DeepEfference architecture surpassed the global- pathway-only LightEfference architecture in all experimental conditions. As mentioned briefly, we also attempted to use a local-pathway-only architecture but networks trained with this local-only architecture failed to converge. Additional work is needed to deter- mine why these networks failed to learn but we suspect that it is due to the fully-connected layers attempting to learn a complex pixel-level transform beyond their capacity. Currently, the actual displacement between the camera at the time of each frame are used as transform inputs to DeepEfference. One possible source for this information in real-world robotic applications is from IMUs. However, constant velocity motions may prove difficult for DeepEfference if the expected transforms are being generated from IMU signals. In these cases, it may instead be possible to use a motor command as a 46 surrogate transform signal, but this has yet to be investigated. Finally, the transform estimates fed to DeepEfference are computed from ground- truth camera poses and do not exhibit noise characteristics that will most likely be found in real-world applications where we will rarely, if ever, have access to a comparably clean extra-visual motion measurement. One possibility for overcoming measurement noise is to expand DeepEfference to produce n image reconstruction predictions and include an additional decision node that chooses the best reconstruction. DeepEfference’s runtime could allow for many possible reconstructions to be generated similar to learning and sampling from a noise distribution. 47 Table 2.2: Pixel errors for DM, LE, and DE on KITTI and SceneNet KITTI DeepMatching Keypoints Mean Standard Deviation Median Min Max DeepMatching 2.1 2.6 1.6 0.0 187.1 LightEfference 29.2 242.3 2.2 0.0 4661.4 DeepEfference 2.3 3.7 1.7 0.0 268.2 KITTI FAST Keypoints Mean Standard Deviation Median Min Max LightEfference 31.6 261.3 2.0 0.0 4791.2 DeepEfference 2.4 3.7 1.7 0.0 251.3 SN DeepMatching Keypoints Mean Standard Deviation Median Min Max DeepMatching 3.3 16.2 1.3 0.0 2620.9 LightEfference 12.9 19.8 7.8 0.0 2596.1 DeepEfference 11.5 19.2 5.9 0.0 2691.8 SN FAST Keypoints Mean Standard Deviation Median Min Max LightEfference 14.8 28.1 8.0 0.0 2957.8 DeepEfference 13.1 26.2 5.9 0.0 3055.4 48 Figure 2.4: Unusual DE and LE sample results from KITTI and SceneNet. The red boxes highlight areas in the reconstructed image that were imagined by DeepEfference. Green boxes highlight instances where DE was able to better predict object positions in a scene with strong depth contrast while LE generated a poorer reconstruction. The last row shows a failure case where both LE and DE were unable to generate a reconstruction. This is most likely due to the extreme transformation between the camera at the time of source image capture versus target image capture. 49 Chapter 3: A Deep Neural Network Approach to Fusing Vision and Het- eroscedastic Motion Estimates for Low-SWaP Robotic Ap- plications 3.1 Abstract Due both to the speed and quality of their sensors and restrictive on-board com- putational capabilities, current state-of-the-art (SOA) size, weight, and power (SWaP) constrained autonomous robotic systems are limited in their abilities to sample, fuse, and analyze sensory data for state estimation. Aimed at improving SWaP-constrained robotic state estimation, we present Multi-Hypothesis DeepEfference (MHDE) - an unsu- pervised, deep convolutional-deconvolutional sensor fusion network that learns to intel- ligently combine noisy heterogeneous sensor data to predict several probable hypotheses for the dense, pixel-level correspondence between a source image and an unseen tar- get image. This new multi-hypothesis formulation of our previous architecture, Deep- Efference (Shamwell, Nothwang, and Perlis, 2017), has been augmented to handle dy- namic heteroscedastic sensor and motion noise and computes hypothesis image mappings and predictions at 150-400 Hz depending on the number of hypotheses being generated. MHDE fuses noisy, heterogeneous sensory inputs using two parallel architectural path- 50 ways and n (1, 2, 4, or 8 in this work) multi-hypothesis generation subpathways to gener- ate n pixel-level predictions and correspondences between source and target images. We evaluated MHDE on the KITTI Odometry dataset (Geiger et al., 2013) and benchmarked it against DeepEfference (Shamwell, Nothwang, and Perlis, 2017) and DeepMatching (Revaud et al., 2016) by mean pixel error and runtime. MHDE with 8 hypotheses out- performed DeepEfference in root mean squared (RMSE) pixel error by 103% in the max- imum heteroscedastic noise condition and by 18% in the noise-free condition. MHDE with 8 hypotheses was over 5, 000% faster than DeepMatching with only a 3% increase in RMSE. 3.2 Introduction The sensing and processing pipelines of autonomous and semi-autonomous robotic systems pose a fundamental limit on how fast these systems may safely travel through an environment. For example, when moving at 20 m/s, a 30 Hz sensor-derived state estimate update rate means that a given robot will travel 0.66 meters between state updates. While traveling those 0.66 meters, the robot will effectively be blind to any unexpected changes in the environment (e.g., a tree branch blown by a wind gust or an unexpectedly opened door). As a result, current size, weight, and power (SWaP) constrained autonomous and semi-autonomous robotic systems are forced to move very slowly through their environ- ments. The slow operational speeds of SWaP-constrained autonomous systems are espe- cially pronounced for mobile robots operating in dynamic, gps-/communications-denied 51 environments where safe navigation must be performed only with on-board sensors and computational resources. For unmanned aerial vehicles (UAVs), navigation is typically performed through a fusion of visual odometry (VO) estimates, inertial measurements, and simplified predictive linear motion models in a Kalman filter framework. These SWaP-constrained VO-pipelines force the use of lightweight feature matching approaches for visual correspondence that are out-performed by computationally heavier SOA ap- proaches. For example, the visual matching algorithm DeepMatching has enabled SOA matching and optical flow (Revaud et al., 2015) but the correspondence-finding step alone can require from 16 seconds to 6.3 minutes per RGB image pair depending on the param- eter regime used for matching (Revaud et al., 2016). For real-time operation on SWaP- constrained systems, correspondence must be computed orders of magnitude faster (e.g., a minimum of 33 ms per matching pair for a 30 FPS camera commonly used for SWaP- constrained robotic applications). We argue that contextual information can greatly reduce the computational burden for image correspondence approaches and enable both higher-quality and lower-latency state estimation. One way to provide context is by fusing measurements from multiple sensory modalities. However, intelligently integrating multimodal information into low- level sensory processing pipelines remains challenging, especially in the case of SWaP- constrained robotic systems. We have previously shown that our architecture DeepEfference (Shamwell, Noth- wang, and Perlis, 2017) can efficiently fuse visual information with motion-related infor- mation to greatly increase runtime performance ( 20, 000%) with minimal performance degradation ( 12%) for dense image correspondence matching. However, in our previous 52 Figure 3.1: Sample MHDE outputs from different hypothesis pathways. A-E: MHDE ouputs from 5 pathways. D shows the output from an inactive pathway (i.e. a pathway that the network did not optimize). F-E: Reconstruction error for the hypotheses shown in A-E. From F we can see that the reconstruction shown in A had the lowest error (yellow- dashed box). work, we used motion estimates as inputs to DeepEfference that were accurate to within approximately 10 cm of actual pose. In the real-world, systems will rarely have access to comparatively clean signals. Additionally, real noise sources are often heteroscedastic and input-dependent. With the original DeepEfference’s fast runtime, we saw the possibility of generat- ing many different hypothetical outputs for each input image and then selecting the most accurate at execution time. By learning how to produce n image reconstruction predic- tions, the DeepEfference architecture could be expanded to better handle real-world noise sources. 53 In this work, we introduce Multi-Hypothesis DeepEfference (MHDE) which is an extension of DeepEfference (Shamwell, Nothwang, and Perlis, 2017) that mitigates per- formance impacts of noisy motion estimates. A side-effect of this multi-hypothesis ap- proach is enhanced performance even in the absence of added noise that achieves a mean pixel error within 3% of SOA approaches with an over 5, 000% decrease in runtime. By learning how to generate multiple hypothetical outputs, MHDE can effectively sample the space of possible image transformations. This is enabled by a multi-pathway net- work architecture and novel loss rule that enables the network to explicitly learn multiple, independent network pathways. The remainder of the paper is organized as follows: Section 2.3 describes the back- ground and motivations for MHDE; Section 2.4 outlines our deep network approach to fusing noisy heterogeneous sensory inputs and describes the MHDE architecture; Section 2.5 outlines our experimental and evaluation approaches; Section 2.6 presents our exper- imental results; Section 2.7 discusses the results from Section 2.6; and Section 3.8 offers a summary, concluding thoughts, and directions for future work. 3.3 Background 3.3.1 Visual Odometry and Multi-Sensor Fusion In VO as well as many other vision tasks such as motion understanding and stere- opsis, a key challenge is discovering quantitative relationships between temporally or spatially adjacent images. Within the last decade, bio-plausible approaches for the visual task of object recognition have set new benchmarks and are now the defacto standard. 54 We agree strongly with Memisevic that bio-plausible, local filtering-based approaches similarly hold promise for the correspondence problem (Memisevic, 2013). A known failure mode for visual odometry (VO) is in highly dynamic scenes. Most VO algorithms are subject to the static scene assumption whereby additional error is in- troduced when independently moving points in the scene move inconsistently with their dependently moving neighbors. Feedback outlier detection approaches based on algorithms such as RANSAC (Kitt, Moosmann, and Stiller, 2010) seek to discover the most likely motion that has caused a given transform. However an unconstrained key-point match between two images across a large temporal window and spatial extent is at least exponentially complex (Brox, Ma- lik, and Bregler, 2009). By fusing sensor information from separate modalities, we can effectively constrain the matching process. Constraining the matching process to be consistent with a narrow range of trans- forms gleamed from another modality can lead to increased VO performance relative to computational requirements and processing time. Previous work has applied extra-visual feedback signals from IMUs or GPS (Maimone, Cheng, and Matthies, 2007; Agrawal and Konolige, 2006) to constrain the matching process. Simple motion models (Enkelmann, 1991; Davison, 2003) have also been used to predict future images based on previously observed image motion. These approaches have been extended to use quadratic motion models (Lefaix, Marchand, and Bouthemy, 2002) which showed improved performance in specific environments (e.g., on flat roads). However, these models implicitly sacrifice responsiveness as they wait for changes in an underlying sensory distribution rather than detecting dominant motion from a separate extra-visual modality. 55 Deconvolutional decoderConvolutional encoder Source image Local pixel shifts Reconstructed source images Global pixel shifts Localization pathways 2D Affine matrices Spatial transformers Transform input Target image for sampling Sampler Hypothesis i Hypothesis i+i Sampler Figure 3.2: MHDE network diagram with two hypotheses shown for brevity. We experi- mented with up to 8 hypotheses in this work. 3.3.2 Deep Spatial Transformations The correspondence problem describes the challenge of determining how the pix- els in one image spatially correspond to the pixels in another image. Traditionally, the correspondence problem has been tackled with closed-form, analytical approaches (see (Scharstein and Szeliski, 2002) for a review) but recently, deep, bio-inspired, solutions have also begun to show promise. These deep approaches solve the correspondence prob- lem by learning to estimate the 3D spatial transformations between image pairs. In computer vision, siamese-like deep network architectures such as those based on 56 multiplicative interactions have been used successfully for relationship learning between images (Memisevic and Hinton, 2007; Ranzato and Hinton, 2010; Memisevic and Hinton, 2010; Hinton, Krizhevsky, and Wang, 2011; Kivinen and Williams, 2011; Memisevic, 2013). However,there are two problems with these and other deep approaches (e.g. the DeepMatching (Revaud et al., 2015; Revaud et al., 2016) algorithm described earlier) to image transformation learning. First, these approaches require expensive computation on both initial and target images. They employ siamese architectures that require parameter-heavy learning and expensive computations to be performed on both source and target images. For SWaP- constrained robots, the number of computational operations required by these siamese networks must be significantly reduced. Approaches such as L1 and group lasso-based pruning (Han, Mao, and Dally, 2015; Wen et al., 2016; Anwar, Hwang, and Sung, 2015) offer potential mechanisms to reduce the size of networks but fundamentally still require extensive computation on both source and target images. Second, these approaches do not provide a mechanism to include extra information from another modality as a motion prior while maintaining end-to-end trainability. For robotic applications, heterogeneous sensor information is often available that can be lever- aged and may allow for reduced computational constraints and increased performance (see Section 3.3.3). 57 3.3.3 Extra-Modal Motion Estimates and Heteroscedastic Noise Unlike algorithms in pure computer vision domains, algorithms intended for robotic applications need not rely solely on vision. For example, when estimating a robotic sys- tem’s egomotion by tracking changes in feature point locations on a robot’s camera’s imaging plane, additional non-visual motion estimates can be fused with visual informa- tion(i.e. to bias or serve as a motion prior) to improve egomotion estimation. On real-world systems, additional non-visual motion estimates could be derived from measurements taken from IMUs, GPS, LIDARs, ultrasonic ranging sensors, or the actual input motor commands given to the system. Furthermore, motor errors exhibit heteroscedastic noise properties where larger movements generate larger sources of noise (Ciliberto et al., 2012). Any approach that seeks to leverage extra-modal motion estimates needs to be robust to real-world heteroscedastic noise. 3.4 Approach MHDE is an unsupervised deep heterogeneous neural network that employs multi- ple separable pathways to fuse noisy, heterogeneous sensory information and predict how source images correspond to unseen target images. MHDE effectively reverses the prediction pipeline - rather than using the previous image to reconstruct the future image, it uses the target image to reconstruct the source image. The network receives a noisy estimate of the change in 3D camera position be- tween source and target frame acquisitions and learns: 58 1. 2D affine transformation parameters that are applied as a global spatial transform; and 2. Local, pixel-level shifts that encapsulate aberrations due to varied scene depth, non- rigid scene objects, etc. The affine transformations and localized shifts are learned and applied via two inter- connected architectural pathways: one for determining global 2x3 affine 2D transforma- tion matrices, and a second encoder-decoder pathway that predicts localized, pixel-level shifts that are not captured by the global, approximated 2D affine transformation (see (Shamwell, Nothwang, and Perlis, 2017) for more information on the DeepEfference ar- chitecture). Unlike the original DeepEfference, MHDE generates several hypothetical recon- structions which enable increased robustness to noisy inputs. Thus, while DeepEfference only has two architectural pathways, MHDE has the same two architectural pathways plus n additional hypothesis generation pathways (2− 8 in this work). 3.4.1 Winner-Take-All (WTA) Loss Rule MHDE generates multiple hypothesis reconstructions to enable robustness to stochas- tic, heteroscedastic, input noise such as found in the real-world. The previous DeepEf- ference architecture that generated only a single predicted reconstruction used Euclidean error to train the network by minimizing the loss function L(θ, It, Is) = argmin θ ‖Ir(θ, It)− Is‖2 (3.1) 59 where Ir is an image reconstruction, It is the image target, and Is is the image source being reconstructed. If instead of generating a single reconstruction Ir, the network generated n recon- structions I ir, i ∈ N , the loss rule would need to be expanded to train across all hypothesis pathways in the new network. A naive way to compute error for such a multi-hypothesis network would be to simply sum the Euclidean error from all hypotheses and divide by the total number of hypotheses. Then, the network would be trained by minimizing the loss function L(θ, It, Is) = argmin θ ∑N i ‖I ir(θ, It)− Is‖2 N (3.2) where I ir is a hypothesis image reconstruction and the remaining terms are the same as before. The naive multi-hypothesis loss rule of Eq. 3.2 would lead the network to optimize all pathways simultaneously with each update. However, this may not be optimal for increased robustness to noise. Effectively, we desire the network to generate distinct predictive hypotheses by sampling from a noise distribution that the network implicitly learns. For example, consider when the network has perfectly optimized the loss function of Eq. 3.2: L(θ, It, Is) = ∑N i ‖I ir(θ, It)− Is‖2 N ≈ 0 (3.3) In this case, ‖I ir(θ, It) − Is‖2 ≈ 0,∀i ∈ N which means that each hypothesis reconstruction I ir(θ, It) is approximately equal. As the network is trained and converges 60 to a local minima, loss will affect parameters in each pathway approximately equally and drive outputs from all pathways to a common approximate solution. This is the opposite of what we want from MHDE. Effectively, such a loss rule is equivalent to the standard Euclidean loss rule used in (Shamwell, Nothwang, and Perlis, 2017) where a single prediction is generated and fails to leverage the multiple outputs that can be generated by MHDE. To leverage its multiple outputs, we train MHDE using what we call a winner-take- all (WTA) Euclidean loss rule: I∗r (θ, It)←− argmin i ‖I ir(θ, It)− Is‖2 (3.4) L(θ, It, Is) = ‖I∗r (θ, It)− Is‖2 (3.5) where I∗r is the lowest error hypothesis. Loss is then only computed for this one hy- pothesis and error is backpropagated only to parameters in that one pathway. Now, only parameters that contributed to the winning hypothesis are updated and the remaining pa- rameters are left untouched. 3.4.2 Pathway 1: Global Spatial Transformer Spatial transformer (ST) modules (Jaderberg et al., 2015) apply parametrized ge- ometric transformations to feature-maps (either data inputs or intermediate outputs) in deep networks. The parameters for these transformations (2D affine transformations in our case) can be directly provided to the network as input or can be learned and optimized alongside the other network parameters (e.g., network weights and biases). 61 MHDE was provided with estimates of the true 3D transformation between source and target images (δx, δy, δz, δα, δβ, δγ). Note, however, that the visual input to MHDE was a single grayscale source image without any depth information. Even if the provided 3D transformation was noise-free and perfectly accurate, it is not possible to analytically perform a 3D warp (assuming translation) on a 2D image due to unknown scene depth at each pixel location. Thus, MHDE approximated 3D warps as 2D affine transformations through a linear-nonlinear optimization using four fully-connected layers, each followed by an additional rectified linear unit (ReLU) (Glorot, Bordes, and Bengio, 2011) non- linearity layer. We modified the standard ST module in tensorflow (Abadi et al., 2015) by splitting the layer into two layers - one to perform the affine transformation on grids of source pixel locations (xs, ys) and output target pixel coordinates (xt, yt) and a second layer to perform bilinear sampling given pixel coordinates and an image to sample from. Although the sampling component of our ST module takes an input image as input, no learn-able parameters are based on input image content and thus our global pathway is a function only of the input transformation estimate and is image content-independent. 3.4.3 Pathway 2: Local, Pixel-Level Shifter The pixel-level encoder/decoder pathway refines the ST estimate from the first pathway and provides localized estimates of pixel movement to account for depth, non- rigidity, etc. We implemented this pathway as a convolutional-deconvolutional encoder-decoder. 62 First, the convolutional encoder compresses a source image through a cascade of convo- lutional filtering operations. The output of the convolutional encoder is concatenated with intermediate outputs from the fully-connected layers from the first, global pathway (the black and blue vertical lines in the center of Fig. 3.2). This concatenated representation is then expanded using a deconvolutional decoder to generate n pairs of (xt′ , yt′) pixel loca- tions that are summed with the target pixel coordinates (xt, yt) from the global pathway before bilinear sampling (see (Shamwell, Nothwang, and Perlis, 2017) for more details). 3.5 Experimental Methods We conducted experiments with MHDE using four different noise conditions and four different architectures. All architectures were based on DeepEfference (Shamwell, Nothwang, and Perlis, 2017) and implemented both global pathway and local pathways. MHDE was evaluated on the KITTI Odometry dataset (Geiger et al., 2013) and results were benchmarked against correspondence matching results from the SOA DeepMatch- ing approach (Revaud et al., 2016) (see (Shamwell, Nothwang, and Perlis, 2017) and the Appendix for more information). We experimented with four noise conditions where α was 0.0 , 0.1, 0.25, or 0.5. We trained networks with 1, 2, 4 or 8 hypothesis generation pathways. For each noise and and hypothesis combination, we trained three networks for a total of 48 different networks. 63 Figure 3.3: Heteroscedastic noise as a function of transform magnitude for the X and Y components of the transform input over the test set for a network with a noise parameter α = 0.25. 3.5.1 Noise As shown in Fig. 3.3, we simulated real-world noise conditions by applying het- eroscedastic noise to each transform input. For each transform T = (δx, δy, δz, δα, δβ, δγ), we introduced heteroscedastic noise to create network input T ∗ according to: T ∗ = T +N (0, α √ T ) (3.6) where α was a constant modifier that was either 0.0, 0.1, 0.25, or 0.5. 3.5.2 Evaluation We evaluated MHDE by measuring the mean pixel error of MHDE projections of DeepMatching keypoints from source images to target images. The projection errors for 64 each method compared to groundtruth projections were used to determine mean pixel errors for each method (see (Shamwell, Nothwang, and Perlis, 2017) and Appendix A. for a more thorough explanation of the experimental evaluation). 3.5.3 Training We trained MHDE for 200, 000 iterations on KITTI Odometry scenes 1− 11 for all experiments. We used the Adam solver with batch size=32, momentum1=0.9, momentum2=0.99, gamma=0.5, learning rate=1e − 4, and an exponential learning rate policy for all exper- iments. All networks were trained using our modified WTA loss rule. All experiments were performed with a Nvidia Titan X GPU and Tensorflow (see (Shamwell, Nothwang, and Perlis, 2017) and Appendix B. for a more thorough explanation of training proce- dures). 3.6 Results Fig. 3.4 shows the performance of MHDE with various maximum hypotheses com- pared to DM. A network’s maximum hypotheses is the maximum number of hypothesis generation pathways a given network was allowed to learn. Because of our WTA loss rule, this does not mean that the network effectively learned how to use all pathways. For example in Fig. 3.6(d), a network with four maximum hypotheses predominantly trained and used a single pathway. This can also be seen in Fig. 3.5 where the same results from Fig. 3.4 are plotted as a function of the total active hypotheses. Active hypotheses are considered hypothesis 65 pathways that performed better than all other pathways for at least one testing expemplar (for reference, Fig. 3.1(d) shows the network output of an inactive pathway). There is positive relationship between performance and both maximum hypotheses and active hypotheses. This is true for all noise conditions as well as the no-noise con- dition. We also see that the rate of improvement when moving from one to six active hypotheses is greater for higher noise levels. Fig. 3.6 shows the activations by pathway for networks trained with four or eight maximum hypotheses. Surprisingly, regardless of noise conditions, we see no strong relationship between active pathways (pathways that produced the best result for at least one test exemplar) and noise level. Tab. 3.1 details the comparative runtimes between DeepMatching and MHDE with various numbers of hypotheses1. MHDE runtime scales linearly with number of hypothe- ses. Overall, the runtime gains of MHDE compared to DM show that providing a strong prior on camera motion allows for far more computationally efficient image predictions and matchings. 3.7 Discussion We have shown the unsupervised learning of correspondence between static grayscale images in a deep sensorimotor fusion network with noisy sensor data. We were concerned that MHDE networks might only optimize a single pathway. 1We used a CPU version of DeepMatching for these comparisons. The latest available version of the GPU implementation of DeepMatching took over 7 seconds per image to run on our workstation on a 256x256 image so we elected to use the faster CPU version for all experiments 66 Table 3.1: Average runtimes for DeepMatching (DM) and Multi-Hypothesis DeepEffer- ence (MHDE) with equivalent frames per second (FPS) # Hypoth. Mean StDev. Med. FPS DM N/A 0.4115 s 0.00132 0.407 s 2.4 MHDE 1 0.0024 s 0.00008 0.00238 s 417.4 MHDE 2 0.00303 s 0.00009 0.00302 s 330 MHDE 4 0.00422 s 0.00010 0.00421 s 237.2 MHDE 8 0.00675 s 0.00016 0.00677 s 148.2 For example, if one pathway consistently produced the lowest estimate error at the begin- ning of training, then perhaps only that pathway would be updated and thus the network would not be used to it’s fullest potential. As seen in Fig. 3.6, this generally was not the case as networks were able to learn to use multiple pathways without intervention outside of the WTA loss rule. Future work will look at how to include pure sensor measurements (e.g., from an IMU) and how to encourage networks to train and use all available hypothesis pathways. Like it’s predecessor, MHDE only uses single grayscale images as inputs. Another pos- sible avenue of research is to use multiple images as input, or an LSTM like architecture to give the network additional temporal context. One of the more important aspects of this network is that it does not generate images 67 from scratch and instead works mostly in the space of pixel locations rather than pixel intensities. Given that geometry is consistent across image domains even though image content varies, this network architecture is a promising candidate to leverage transfer learning. While we used noise-corrupted motion estimates derived from ground-truth for the MHDE transform input, IMUs are a possible real-world source for this information. How- ever, IMUs only measure accelerations and thus we speculate that using raw IMU mea- surements as MHDE inputs will result in poor performance during constant velocity ma- neuvers. Additional work is needed to determine a suitable real-world analog for deriving the motion estimates needed by MHDE. We hope to experiment with this architecture on other visual odometry datasets. Specifically, we seek a larger dataset is needed with a wider range of movements. With- out a wide range of movements, we speculate that trained networks will only be able to transfer to new, previously unseen datasets that follow similar movement statistics as the datasets on which they were trained. To overcome many of these limitations, we are cur- rently working to collect a multi-modal dataset with stereo imagery, depth imagery, high- resolution IMU data, action commands, low-level motor-commands, and ground-truth VICON poses. With this dataset, we will be able to better address limitations inherent in the current MHDE architecture. 68 3.8 Conclusion While increased performance in the noise-free conditions was an unintended conse- quence of the multi-hypothesis formulation, the central contribution of this work is in the handling of noise-contaminated input data. In summary, we have shown the unsupervised learning of correspondence between static grayscale images in a deep sensorimotor fusion network with noisy sensor data. In this work, we have presented a multi-hypothesis for- mulation of our previous DeepEfference architecture. MHDE outperformed DE by 103% in RMSE in our maximum noise condition, by 18% in the noise-free condition, and was 181% slower (417 FPS vs 148 FPS). Compared to DM, MHDE was 5192% faster with 8 hypotheses (2.8 FPS vs 148 FPS) and was outperformed by 3% in the noise-free condition with 8 hypotheses and by 57% in the maximum noise condition with 8 hypotheses. 3.9 Appendix The following methods are largely reproduced from (Shamwell, Nothwang, and Perlis, 2017) and are included here for completeness. 3.9.1 Extended Evaluation As in (Shamwell, Nothwang, and Perlis, 2017), we evaluated MHDE on the KITTI Visual Odometry dataset (Geiger et al., 2013). KITTI is a benchmark dataset for the evaluation of visual odometry and LIDAR-based navigation algorithms. Images in KITTI were captured at 10 Hz from a Volkswagen Passat B6 as it traversed city, residential, road, 69 and campus environments. Groundtruth poses at each camera exposure were provided by an RTK GPS solution and depth is provided with coincident data from a Velodyne laser scanner. All objects in the visual scenes are rigid, thus fulfilling the static scene assumption and allowing for ground truth to be computed from scene depth and camera position. Predicted pixel correspondence between source and target images was evaluated against groundtruth correspondence and SOA DeepMatching correspondence predictions. With access to scene depth and true camera pose for KITTI, groundtruth pixel shifts were calculated by applying a 3D warp to 3D pixel locations in the source images to generate the expected pixel locations in the target images. We projected each 3D point in the frame of camerat0 to the world frame using the derived projection matrix for camerat0 and then reprojected these points in the world frame to camerat1 using the inverse projection matrix for camerat1. Finally, we transformed points in the frame of camerat1 to the image plane. This resulted in a correspondence map between pixel locations in camerat0 and camerat1 for each point where depth was available (e.g., when depth was outside of the Velodyne laser scanner’s range). 3.9.2 Extended Training Procedures For training and evaluation, data was separated into train (80%) and test (20%) sets. We used a total of 23, 190 image pairs with 80% (18, 552) for training and 20% (4, 638) for testing. In all experiments, we randomly selected an image for the source, used the successive image for the target, and subtracted the two 6-DOF camera poses for 70 the transform input. For each image in each dataset, we cropped the middle 224x224 pixel region for network inputs. 71 Figure 3.4: Inverse mean pixel error (higher is better) for several noise conditions pro- duced by MHDE networks trained to generate 1, 2, 4, or 8 maximum hypotheses. Dashed line is DM (SOA) error. 72 Figure 3.5: Inverse mean pixel error (higher is better) for several noise conditions pro- duced by MHDE networks. Results are from the same networks shown in Fig. 3.4 but are instead plotted as a function of active pathways learned by each network. 73 (a) No noise (b) Noise=0.1 (c) Noise=0.25 (d) Noise=0.5 Figure 3.6: Activation by pathway for the different noise conditions. Only networks with maximum hypotheses of 4 or 8 are shown. 74 Chapter 4: An Embodied Deep Neural Network Approach to Dense Vi- sual Correspondence for Low-SWaP Robotics 4.1 Abstract Estimating the correspondences between pixels in sequences of images is a criti- cal first step for many computer vision and robotics tasks such as visual odometry (VO) and visual simultaneous localization and mapping (VSLAM). While VO and VSLAM are used extensively for localization and navigation in low-SWaP robotic applications, they are usually forced to rely on computationally light-weight, sparse correspondence approaches. We contend that an earlier inclusion of extra-visual sensory information re- lated to system motion will allow for computationally efficient dense image matching even in low-texture regions that cause issues for other heavier dense correspondence ap- proaches. We introduce Inertial DeepEfference (IDE) - an unsupervised deep matching network that learns to compute image mappings by fusing heterogeneous sensor streams. As an extension of our previous approaches (shamwell2017n; Shamwell, Nothwang, and Perlis, 2017), IDE is unique in that it uses real sensor data for an end-to-end trainable dense correspondence network that is orders of magnitude faster than other SOA deep approaches. We evaluated IDE on the EuRoC MAV dataset (Burri et al., 2016) and on the 75 KITTI Odometry dataset (Geiger et al., 2013) and benchmarked it against the deformable spatial pyramid (DSP) (Kim et al., 2013) and DeepMatching (Revaud et al., 2016) dense correspondence approaches. Our IMU+surrogate feed-forward motor signal (FFMS) net- work was 167x faster than DM and 516x faster than DSP. 4.2 Introduction State estimation for size, weight, and power (SWaP) constrained autonomous robotic systems is limited by the lightweight and low-power sensing and computational hardware that they are forced to use. When viewed as complex, embodied agents, robotic sys- tems can generate and access a wide variety of varied sensory information, and as such, a popular approach to mitigating the negative influences of noisy SWaP sensors is to fuse estimates from an array of heterogeneous sensors deployed on the robot. SWaP-constrained GPS-denied navigation has been greatly influenced by this sen- sor fusion philosophy and visual-inertial odometry (VIO), where a sensor array will com- monly consist of a camera and an inertial measurement unnit (IMU), has been an topic that has seen particular success for gpn-denied SWaP-constrained localization and navi- gation. However, solutions to the correspondence problem (see (Scharstein and Szeliski, 2002) for a review) to-date remain dominated by bottom-up, vision-only approaches and neglect other available sources of complemntary information that can increase efficiency. We propose to fully exploit the embodied nature of robotic systems and begin fusing heterogeneous sensor measurements as early as possible by learning a feed-forward model input with heterogeneous sensor data to compute dense image correspondences. 76 Previously, we introduced DeepEfference (DE) (Shamwell, Nothwang, and Perlis, 2017) and Multi-Hypothesis DeepEfference (MHDE) (Shamwell, Nothwang, and Perlis, Accepted) to perform early fusion of vision data and extra-modal motion-related data. In both of these previous approaches, we used the ground truth pose to derive the extra- modal motion input to the network (in the case of MHDE, an artificially noise-corrupted ground truth pose). We showed the DE and MHDE architectures could learn how to approximate a 3D transform given image data and an estimate of the 3D motion transform derived directly from ground-truth. In this work, we present results from experiments where networks were instead provided with raw sensor data in the form of a single grayscale image and either IMU measurements, motor feedback, a surrogate feed-forward motor command, or some com- bination thereof. We hypothesized that IMU data will be able to at least partially replace the ground truth derived motion estimates but will be more challenged during constant velocity maneuvers (e.g., IMU data recorded from a car) compared to high-agility ma- neuvers (e.g., a MAV flying aggressively in an indoor environment). In cases of constant velocity, we further hypothesized that additional state information conveying intention will help mitigate the effects of unmeasurable (by an IMU’s accelerometer) constant ve- locity motions by providing the model with supplementary information not captured in an IMU signal stream. Taking note from biology, we believe that a surrogate motor- /intention-related signal is a prime candidate for such a role (see (Shamwell, Nothwang, and Perlis, 2017; Shamwell, Nothwang, and Perlis, Accepted) for more information). The remainder of the paper is organized as follows: Section 4.3 describes the back- ground and motivations for the network; Section 4.4 presents related work; Section 4.5 77 outlines our deep network approach to fusing noisy heterogeneous sensory inputs and de- scribes the network architecture; Section 4.6 describes our experimental approach; Sec- tion 4.7 describes our evaluation approaches; Section 4.8 presents and discusses our ex- perimental results; and Section 4.9 offers concluding thoughts and directions for future work. 4.3 Background When GPS is available, inertial navigation systems (INS) can leverage the high- temporal resolution of IMU-derived measurements while periodically correcting inte- grated IMU drift error. In GPS-denied environments, vision has been successfully used to correct IMU error in place of GPS. By tracking the projected positions of visual scene elements on the imaging plane, inter-scene relative position differences can be derived. VO-based dead reckoning requires knowledge of how pixels in one image relate to pixels in another image. The problem of determining this relationship is known as the corre- spondence problem. In VO and VSLAM for robotic applications, image correspondence is performed in isolation as it would be in a pure computer vision domain. However, unlike in pure com- puter vision domains, algorithms intended for robotic applications need not rely solely on vision. For example, when estimating a robotic system’s egomotion by tracking changes in feature point locations on a robot’s camera’s imaging plane, additional non-visual mo- tion estimates can be fused with visual information to improve egomotion estimation and overall system performance (see (Martinelli, 2012) for a review of VIO). 78 In VO as well as many other vision tasks such as motion understanding and stere- opsis, a key challenge is discovering quantitative relationships between temporally or spatially adjacent images. Within the last decade, bio-plausible approaches for the visual task of object recognition have set new benchmarks and are now the defacto standard. Bio-plausible/local filtering-based approaches also hold promise for the correspondence problem and VO/VIO/VSLAM (Memisevic, 2013). 4.4 Related Work Deconvolutional decoderConvolutional encoder Source image Local shifts Reconstructed source image xi global ... xn global yi global ... yn global æ è ç ç ö ø ÷ ÷ Sampler Target image for sampling 2D Affine matrix Spatial transformerLocalization pathway a j ω j a j+1 ω j+1 ... ... am ωm é ë ê ê ê ê ê ù û ú ú ú ú ú Mi®i+1 θ 11 θ 12 θ 13 θ 21 θ 22 θ 23 é ë ê ê ù û ú ú Global shifts xi local ... xn local yi local ... yn local æ è ç ç ö ø ÷ ÷ Figure 4.1: IDE network diagram The correspondence problem can be viewed as part of the more general problem of determining how images relate to one another. While the correspondence problem has been traditionally addressed through closed-form, analytical approaches (for example De- formable Pyramid Matching (Kim et al., 2013) and DeepMatching (Revaud et al., 2016); see (Scharstein and Szeliski, 2002) for a review of the correspondence problem), recent bio-inspired, deep neural network approaches that estimate the 3D spatial transformations 79 between image pairs have begun to show increasing success. Unsupervised, siamese-like deep network architectures such as those based on mul- tiplicative interactions (Memisevic and Hinton, 2007; Ranzato and Hinton, 2010; Memi- sevic and Hinton, 2010; Hinton, Krizhevsky, and Wang, 2011; Kivinen and Williams, 2011; Memisevic, 2013; Wohlhart and Lepetit, 2015) and triplet learning rules (Wang and Gupta, 2015) have been used successfully for relationship learning between RGB images at the cost of computational runtime. These architectures require expensive computa- tions to be performed on multiple images per run which can greatly increase the number of needed computations and model complexities due to the high-dimensional nature of image data. Other learning approaches have relied on explicit supervised labeling such as the random decision forest based approaches of (Taylor et al., 2012; Shotton et al., 2013; Brachmann et al., 2014) and semantic segmentation approaches of (Long, Shelhamer, and Darrell, 2015; Hariharan et al., 2015). The supervised nature of these approaches require expensive and time-consuming labeling that greatly limits the size of useable datasets. While (Byravan and Fox, 2017) used depth data as input to their network, we do not provide IDE with depth. This is critical because given the 3D locations of points in the image scene, a 3D affine transformation can be directly performed to project points on the image plane at time ti to some time ti+1. As IDE is designed for SWaP-constrained applications where only intensity information from a single imager might be available, our IDE network is input with only a single grayscale intensity image and uses its local pathway (see Section 4.5 and (Shamwell, Nothwang, and Perlis, 2017; Shamwell, Noth- wang, and Perlis, Accepted) for more information) to infer depth and non-rigidity from a 80 single 2D grayscale image. The self-supervised visual descriptor approach of (Schmidt, Newcombe, and Fox, 2017) used a learning rule that requires a priori labeling such that points in the source image (image at ti) and target image (image at ti+1) already be aligned and thus, for correspondence to have already been solved for the training set. Our approach is instead unsupervised and requires only the raw source and target images for training - in fact, we only compute ground truth correspondence between source and target images to validate our results. Otherwise, no labeling is required. The view synthesis approach of (Zhou et al., 2016) learned local, pixel-level shifts in order to render new, unseen views of objects and scenes. That approach is similar to the mappings that are learned by the second pathway of our DE architectures. Besides dif- ferent network structures and inclusion of a global spatial transformer module, the largest difference between their method and our own is that rather than learning to generate novel viewpoints of objects or scenes, we learn how to reconstruct a source image using pixel locations in a target image. 4.5 Approach IDE is designed as an unsupervised dense correspondence network. It receives as input a single grayscale intensity image Ii taken at time ti and an extra-visual estimate of camera motion Mi→i+1 between time ti and time ti+1. The goal of the network is to use the motion estimates Mi→i+1 and the grayscale image Ii to predict the new image coordinate in Ii+1 for each scene element captured in Ii. In other words, IDE learns the 81 correspondence between pixels in images Ii and Ii+1. The network architecture can be thought of as an extension of an autoencoder. How- ever, rather than learning features by minimizing the reconstruction error between an input projected into feature-space and then re-projected into an output-space, IDE is trained by minimizing the reconstruction error between an input and and a reconstruction based on sampled values from a previously unseen target image Ii+1. Thus, IDE is trained accord- ing to the following loss rule: L(θ, Ii+1, Ii) = argmin θ ‖Ir(θ, Ii+1)− Ii‖2 (4.1) where Ir is an image reconstruction, Ii+1 is the image target, and Ii is the image source being reconstructed. Like its predecessors, IDE learns: 1. 2D affine transformation parameters that are applied as a global spatial transform; and 2. Local, pixel-level shifts that encapsulate aberrations due to varied scene depth, non- rigid scene objects, etc. that are applied following the spatial transformation and before bilinear sampling 4.5.1 Pathway 1: Global Shifter The first pathway of IDE (the top pathway shown in Fig. 4.1) is the global shifter. Given a motion estimate (e.g. IMU data), it uses several fully-connected (FC) layers to approximate a 3D transformation as a 2D transformation by learning to compute the 82 parameters for a 2D affine transformation matrix. The Global Shifter then applies this 2D affine transformation to generate expected coordinate shifts in the form of an output of size HxWx2 that represents pixel locations at which to sample from in the target image (additional detail in Section 4.5.3 below). 4.5.2 Pathway 2: Local, Pixel-Level Shifter The second pathway of IDE (the top pathway shown in Fig. 4.1) is the Local Shifter pathway. It receives a source image as input and uses a convolutional-deconvolutional encoder-decoder to also generate a HxWx2 output of pixel shifts. However, these shifts are intended to only modify the coordinate shifts calculated by the Global Shifter path- way for varying scene depth, non-rigidity, etc. (in practice, the Local Shifter outputs are sparse). 4.5.3 Spatial Transformations Spatial transformation in the form of a modified spatial transformer module (Jader- berg et al., 2015) is an integral component of the IDE network architecture and an ex- planation helps to elucidate the workings of IDE. To perform a spatial transformation, we assume that output pixels are defined to lie on a regular grid G = {Gj} of pixels Gj = (x t j, y t j), forming an output feature map V ∈ = x. −1, if h < x. (4.5) ∂Ircj ∂ysj = H,W∑ h,w U(h,w)max(0, 1− |x− h|)  1, if w >= y. −1, if w < y. (4.6) 4.6 Methods We tested networks with motion information taken from the following sources: 1. Ground Truth Pose (GP): Differences in ground truth camera pose at the capture times of the source and target images were input to the network as the extra-visual motion transform. 2. Raw Measurements from an Inertial Measurement Unit (IMU): The IMU data recorded between capture times for the source and target images were input to the 85 network as the extra-visual motion transform. 3. Surrogate Feed-forward Motor Signals (FFMS): Because direct motor-command inputs were not available, a k-means fitting of clusters was used on the ground truth position differences to generate noisy estimates of the approximate direction of mo- tion (see Fig. 4.2 for the associated error). This is unlike (Shamwell, Nothwang, and Perlis, Accepted) where the continuous GP poses were contaminated with het- eroscedastic noise. Here, cluster indices were encoded as one-hot vectors and had only an indirect relationship to the motion of the vehicle. 4. Motor Feedback (MF): The motor feedback data recorded between capture times for the source and target images were input to the network as the extra-visual motion transform. This only applied to EuRoC as this information was not available in KITTI. 5. IMU+FFMS: IMU and FFMS data as above were both input. 6. IMU+MF: IMU and MF data as above were both input. 4.6.1 Datasets and Data Generation For training and evaluation, data was separated into train (80%) and test (20%) sets. For each image in each dataset, we cropped the middle 224x224 pixel region for network inputs. Details about each dataset and dataset specific dataset data generation follows below. 86 (a) EuRoC (b) KITTI Figure 4.2: Histogram of error between the ground truth and k-means cluster to which that exemplar was assigned. 4.6.1.1 EuRoC MAV The EuRoC MAV datasets are a collection of visual-inertial datasets collected on- board an AscTec Firefly hex-rotor helicopter while traversing two indoor environments (referred to as Machine Hall and Vicon Room). EuRoC contains data collected with a VI-Sensor (Nikolic et al., 2014) which captures stereo image data at 20 Hz and IMU data at 200 Hz. A pointcloud of the Vicon Room environment was generated using a Leica Multistation and is also included in the datasets. In six of twelve datasets, synchronized 6-DOF position and attitude groundtruth is provided from a Vicon motion capture system (the Vicon Room environment). In three of these six Vicon Room datasets, motor feedback from the AscTec hex-rotor is also provided at 100 Hz. Because were were interested in fusing these motor-feedback signals, we only used the V01 01 easy, V01 02 medium, and V01 03 difficult datasets from the Vicon Room environment. The total number of usable exemplars in the V01 01 easy, V01 02 medium, and 87 V01 03 difficult datasets (i.e., did not include the first or last example in a dataset, pro- vided sufficient IMU data before and after the example, etc.) was 6, 494 which is signifi- cantly smaller than the KITTI odometry dataset. We augmented EuRoC by including not only pairs of sequential frames, but pairs separated by up to four frames. This resulted in 26, 976 total examples, of which 80% (21, 588) were used for training and 20% (5, 388) were used for testing. For the GP conditions, we generated the pose motion estimate input by decompos- ing the 4x4 transformation matrix into a translation and euler rotation. For the IMU, MF, IMU+FFMS, and IMU+MF conditions, because the lookahead could be anywhere from one to four frames and thus anywhere between 50ms to 200ms, the IMU and motor feedback inputs for the EuroC models used a vector of size 50x6 where for all exemplars regardless of lookahead size the first 10 entries correspond to the 50 ms prior to the image capture and the next 10 entries correspond to the 50ms following capture. For exemplars that were only one ahead, the remainder of the vector were zeros. For lookaheads of 2 the last 20 entries were zeros; for lookaheads of three the last 10 were zeros; and finally for lookaheads of four the vector was fully populated. For EuroC, the motor feedback was not synchronized to the VI-Sensor camera and IMU streams. To temporally synchronize these streams, we leveraged the fact that the Astec FCU also provided IMU data. We took the norm of the angular velocities recorded by the FCU IMU and the VI-Sensor IMU and then auto-correlated the two signals. The aligned FCU motor feedback was included similarly to the IMU data for the MF condi- tions. To generate the surrogate FFMS, we performed K-means clustering on the ground 88 truth position differences to generate 20 clusters. These were encoded as one-hot vectors. 4.6.1.2 KITTI Odometry KITTI odometry is a benchmark dataset for the evaluation of visual odometry and LIDAR-based navigation algorithms. KITTI images were recorded at 10 Hz from cam- eras on-board a Volkswagen Passat B6 as it traversed city, residential, road, and cam- pus environments. Groundtruth poses at each camera exposure were provided by an RTK/INS/GPS solution and depth is provided with coincident data from a Velodyne laser scanner. For KITTI, we used sequences 00 − 10 excluding sequence 03 because the cor- responding raw file 2011 09 26 drive 0067 was not online at the time of publication. This resulted in a total of 22, 362 image pairs with 80% (17, 878) for training and 20% (4484) for testing. In all experiments, we randomly selected an image for the source, used the successive image for the target, and subtracted the two 6-DOF camera poses for the transform input. Corresponding 100Hz IMU data was collected from the KITTI raw datasets and the preceding 100 ms and following 100 ms of IMU data was included for each example yielding a length 20x6 vector. To generate the surrogate FFMS, we performed K-means clustering on the ground truth position differences to generate 20 clusters. These were encoded as one-hot vectors. Motor-feedback data is not included in KITTI and thus the MF conditions were not used with KITTI. 89 4.6.2 Network Parameters and Training Procedures For the GP IDE network, four FC layers of size 512, 4096, 4096, and 512 were used to generate the 2x3 affine transformation matrix. The same architecture was also used for the IMU, FFMS, and MF network configurations. For the IMU+FFMS and IMU+MF configurations that each had two sources of extra-visual motion estimates, each extra- visual estimate was processed through four FC layers of 512, 4096, 4096, and 512 before being concatenated into a vector of length 1024. The convolutional-deconvolutional encoder-decoder that composed the Local Shifter pathway used 5x5 convolutional kernels with a stride of two. The encoder used five layers of 32, 64, 128, 256, and 512 filters and the decoder was reversed, using 512, 256, 128, 64, and 32 filters. All results described in this paper used a Local Shifter pathway with these parameters. As shown in Fig. 4.1, the output of fifth convolutional layer is concatenated with the last FC layer of the Global Shifter pathway and is then fed into a single FC layer of size 4096 before being fed into the first deconvolutional decoder layer. We trained three networks for each condition and dataset. All results presented are the averages of the three networks. 4.7 Evaluation Predicted pixel correspondence between source and target images was evaluated against ground truth correspondence, correspondence computed by the DeepMatching algorithm (Revaud et al., 2016), and correspondence computed by the Deformable Spatial 90 Pyramid Matching algorithm (Kim et al., 2013) on the EuroC MAV dataset (Burri et al., 2016) and the KITTI Visual Odometry dataset (Geiger et al., 2013) For our experiments, ground truth is the new pixel coordinate of a given point in the scene following camera motion. To calculate these new coordinates, we need (1) the depth of each point in the scene, (2) the 3D transformation between camera poses, and (3) the camera matrix and and transforms between each frame. 4.7.1 EuRoC Ground Truth Generation We used the Kalibr derived distortion parametersD and camera intrinsic parameters K to undistort each image and capture the new camera intrinsic matrix Kundistort. This yielded images of size 952x503 from images originally sized 752x480. To obtain depth estimates for each point in each grayscale image, we rendered range images from the ground truth point cloud of the Vicon Room. We converted each ply pointcloud into a pcd pointcloud and rendered range images using PCL according to the camera intrinsic matrix Kundistort and the ground truth position of the MAV transformed into the camera frame. For each image and depth pair, we ray traced each pixel coordinate (ut0, vt0) using the horizontal and vertical fields of view calculated from the focal lengths in Kundistort, normalized the resulting [X, Y, 1]T coordinates, and multiplied by the depth at each pixel location to generate coordinates in the camera frame [X t0c , Y t0 c , Z t0 c , 1] T . Then, for the 4x4 transformation matrix H t0WC that transforms a vector from the camera frame C to the world frame W at time t0, and another transformation matrix 91 H t1WC that transforms a vector from the camera frame C to the world frame W at time t1, we calculated the 4x4 transform matrix HWt0Wt1 as HWt0Wt1 = H t0 WC −1 ∗H t1WC (4.7) and then projected points in the camera frame from t0 to t1: [X t1c , Y t1 c , Z t1 c , 1] T = HWt0Wt1 ∗ [X t0b , Y t0b , Zt0b , 1]T (4.8) Finally, we applied the camera matrixKundistort to project points [X t1c , Y t1 c , Z t1 c , 1] T to the imaging plane and recover ground truth-pixel coordinates: [ut1, vt1, 1] T = Kundistort ∗ [X t1c , Y t1c , Zt1c , 1]T (4.9) The recovered mapping from [ut0, vt0, 1]T to [ut1, vt1, 1]T allows us to project points in the visual scene from one camera position to another and thus provides ground truth correspondence between two image frames. 4.7.2 KITTI Ground Truth Generation Ground truth for KITTI was calculated similarly to EuroC (described above) with three notable exceptions: 1. KITTI images were already undistorted 2. Depth was provided by Velodyne laser scans rather than rendered range images from a point cloud 92 (a) EuRoC (b) KITTI Figure 4.3: Inverse pixel error (higher is better) for each condition for EuRoC and KITTI. The KM condition for EuRoC is omitted for plotting convenience but included in the table below. 3. The resulting correspondence map between pixel locations at different camera lo- cations was only valid when depth was within the Velodyne laser scanner’s range 4.8 Results and Discussion To test our first hypothesis, we looked at whether IMU data is a suitable surrogate for the ground truth transform used in previous incarnations of DeepEfference (Shamwell, Nothwang, and Perlis, 2017; Shamwell, Nothwang, and Perlis, Accepted). The networks are able to learn an appropriate global transform and, especially in the case of the KITTI dataset, perform approximately the same when using IMU data. For KITTI, there was only a 15% MSE performance decrease between the GP pose and the IMU conditions. Compared to DM, However, for EuRoC, there was a 73% MSE performance decrease between the GP pose and IMU conditions. To test our second hypothesis, we looked at whether augmenting IMU data with a 93 (a) Source (b) Target (c) IDE (d) IDE (e) IDE Figure 4.4: Sample KITTI correspondence results. Note that only every other keypoint (horizontally and vertically) is shown and the actual unaltered output is fully dense. 94 (a) Source (b) Target (c) IDE (d) IDE (e) IDE Figure 4.5: Sample KITTI correspondence results. Note that only every only keypoint (horizontally and vertically) is shown and the actual unaltered output is fully dense. 95 (a) Source (b) Target (c) IDE (d) IDE (e) IDE Figure 4.6: Sample KITTI correspondence results. Note that only every only keypoint (horizontally and vertically) is shown and the actual unaltered output is fully dense. 96 (a) Source (b) Target (c) IDE (d) IDE (e) IDE Figure 4.7: Sample EuRoc correspondence results. Note that only every only keypoint (horizontally and vertically) is shown and the actual unaltered output is fully dense. 97 (a) Source (b) Target (c) IDE (d) IDE (e) IDE Figure 4.8: Sample EuRoc correspondence results. Note that only every only keypoint (horizontally and vertically) is shown and the actual unaltered output is fully dense. 98 (a) Source (b) Target (c) IDE (d) IDE (e) IDE Figure 4.9: Sample EuRoc correspondence results. Note that only every only keypoint (horizontally and vertically) is shown and the actual unaltered output is fully dense. 99 Table 4.1: Pixel MSE for the EuRoC dataset Mean Standard Deviation Median DM 3.66 7.68 1.93 DSP 4.22 8.07 1.89 DE GP 2.63 2.52 2.16 DE IMU 6.09 6.029 4.54 DE FFMS 13.19 5.60 10.2 DE IMU+MF 6.05 5.56 4.56 DE IMU+FFMS 5.39 4.808 4.20 surrogate feed forward motor signal increased MSE performance compared to the IMU only case. For EuRoC, there was an 13% MSE performance increase between the IMU and IMU+FFMS conditions. For KITTI, there was an 8% MSE performance increase between the IMU and IMU+FFMS conditions. This performance increase suggests that a network trained with a surrogate FFMS is able mitigate the performance impacts caused by an IMU’s inability to directly measure velocity. The performance difference in the KITTI and EuRoC IMU condtions may also have to do with the differing quality of the IMUs used for the KITTI data collection versus EuRoC data collection. For KITTI, the IMU velocity random walk was 0.005m/s/ √ hr 100 Table 4.2: Mean Pixel Error for the KITTI dataset Mean Standard Deviation Median DM 1.98 2.80 1.49 DSP 2.49 4.34 1.40 DE GP 2.23 2.03 1.54 DE IMU 2.63 2.40 1.93 DE FFMS 5.01 4.55 3.58 DE IMU+FFMS 2.42 2.26 1.81 and the angular random walk was 0.2◦/ √ hr. For EuRoC, velocity random walk was 0.11m/s/ √ hr and angular random walk was 0.66◦/ √ hr. It is worth noting that for optimal performance, our approach requires sensor streams to be temporally synchronized. For the EuRoc dataset, we did not have hardware synchro- nized motor feedback as we did for both image and IMU data. This may explain why the inclusion of motor data did not seem to significantly affect performance. DM and DSP both also performed worse on EuRoC than IDE with ground truth pose information. It would seem that the network was able to effectively learn how to approximate the 3D transform ground truth-encoded transform as a 2D transform as well as approximate differences in scene depth to effectively apply the global transformation and non-linear corrections. This is potentially because EuRoC has many low-texture 101 Table 4.3: Average runtimes with equivalent frames per second (FPS) # Hypotheses Mean Standard Deviation Median FPS DM N/A 0.4115s 1.3e-2 0.407s 2.43 DSP N/A 1.252s 9.23e-2 0.0024s 0.798 DE GP 1 0.00242s 1.2e-3 0.0023s 412.1 DE IMU 1 0.00240s 1.29e-3 0.0023s 416.6 DE FFMS 1 0.00244s 1.3e-3 0.0023s 409.1 DE IMU+FFMS 1 0.00245s 1.2e-3 0.002s 408.1 regions that were difficult for DSP and DM. The motor feedback in the MF condition may have served to confuse the network. For example, current motor feedback may not reflect the current vehicle velocity and is instead a better estimate of future acceleration. Thus, another possibility is that motor feedback encoded largely redundant information compared to IMU data. 4.9 Conclusions and Future Work On the EuRoC MAV dataset (Burri et al., 2016) and on the KITTI Odometry dataset (Geiger et al., 2013), our results demonstrate that IDE performs dense correspondence 167x faster than DM and 516x faster than DSP. One area of future interest is in training 102 IDE networks for use on unseen datasets. In this case, because the two networks were trained with inputs of different dimensions (IMU was 50x6 for EuRoC and 20x6 for KITTI), we were unable to test the transferability of the architecture. Another challenge will arise from the varying coordinate reference frame conventions used by IMUs (and other extra-visual sensors). Ideally, we would use raw, unfiltered IMU data as input to the network. For example, it is generally common to calibrate an IMU (to correct for bias, noise, and mis-calibration offsets) and designate an alignment of its coordinate system. This would mean that we may need to perform an additional operation at the beginning of the network to transform the IMU or other extra-visual data to the reference frame used for training (e.g., Z may be up for one dataset and forward for another, or the handedness of coordinate systems may be reversed). Future work should include additional experiments with larger datasets that at present do not exist. We are in the process of creating a far larger dataset captured using groundrobots in a vicon arena. Because the network is unsupervised and uses predictive errors as a learning signal, we aim to collect large amounts of data with which we can better test this network architecture. In all experiments except those with the EuRoC dataset with IMU- or motor-feedback- only network conditions, ground truth was in some way coupled to network inputs. This means that it could conceivably be possible that the networks were able to learn an addi- tional relationship that then unfairly biased performance results. However, the key results in this thesis are in the runtimes of SOA approaches compared to my DeepEfference net- works. Exact RMSE performance metrics need only be approximate. However, future experiments still need to be carried out that completely decouple ground truth-related 103 calculations from network inputs. 104 Bibliography Abadi, Martı´n et al. (2015). TensorFlow: Large-Scale Machine Learning on Heteroge- neous Systems. Software available from tensorflow.org. Abrams, Richard A, David E Meyer, and Sylvan Kornblum (1989). “Speed and accuracy of saccadic eye movements: characteristics of impulse variability in the oculomotor system.” In: Journal of Experimental Psychology: Human Perception and Perfor- mance 15.3, p. 529. Agrawal, Motilal and Kurt Konolige (2006). “Real-time localization in outdoor environ- ments using stereo vision and inexpensive GPS”. In: Proceedings - International Conference on Pattern Recognition 3, pp. 1063–1068. ISSN: 10514651. DOI: 10. 1109/ICPR.2006.962. Anwar, Sajid, Kyuyeon Hwang, and Wonyong Sung (2015). “Structured pruning of deep convolutional neural networks”. In: arXiv preprint arXiv:1512.08571. Behroozmand, Roozbeh et al. (2009). “Vocalization-Induced Enhancement of the Audi- tory Cortex”. In: Clinical Neurophysiology 120.7, pp. 1303–1312. DOI: 10.1016/ j.clinph.2009.04.022.Vocalization-Induced. Bhargava, Preeti et al. (2012). “The robot baby and massive metacognition: Future vi- sion”. In: Development and Learning and Epigenetic Robotics (ICDL), 2012 IEEE International Conference on. IEEE, pp. 1–2. Bottou, Le´on (2012). “Stochastic gradient descent tricks”. In: Neural networks: Tricks of the trade. Springer, pp. 421–436. Brachmann, Eric et al. (2014). “Learning 6d object pose estimation using 3d object coor- dinates”. In: European conference on computer vision. Springer, pp. 536–551. Bradley, David M (2010). Learning in modular systems. Carnegie Mellon University. Bremmer, Frank et al. (2009). “Neural dynamics of saccadic suppression”. In: Journal of Neuroscience 29.40, pp. 12374–12383. Bridgeman, Bruce (2007). “Efference copy and its limitations”. In: Computers in Bi- ology and Medicine 37.7, pp. 924–929. ISSN: 00104825. DOI: 10.1016/j. compbiomed.2006.07.001. Bridgeman, Bruce, AHC Van der Heijden, and Boris M Velichkovsky (1994). “A the- ory of visual stability across saccadic eye movements”. In: Behavioral and Brain Sciences 17.2, pp. 247–257. Bridgeman, Bruce, Derek Hendry, and Lawrence Stark (1975). “Failure to detect dis- placement of the visual world during saccadic eye movements”. In: Vision research 15.6, pp. 719–722. Brody, Justin, Don Perlis, and Jared Shamwell (2015). “Who’s Talking?Efference Copy and a Robot’s Sense of Agency”. In: 2015 AAAI Fall Symposium Series. 105 Brody, Justin et al. (2016). “Reasoning with Grounded Self-Symbols for Human-Robot Interaction”. In: 2016 AAAI Fall Symposium Series. Brox, Thomas, Jitendra Malik, and C Bregler (2009). “Large displacement optical flow”. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 41–48. ISSN: 1939-3539. DOI: 10.1109/TPAMI.2010.143. Burr, David C., Michael J. Morgan, and M. Concetta Morrone (1999). “Saccadic suppres- sion precedes visual motion analysis”. In: Current Biology 9.20, pp. 1207–1209. ISSN: 09609822. DOI: 10.1016/S0960-9822(00)80028-7. Burri, Michael et al. (2016). “The EuRoC micro aerial vehicle datasets”. In: The Interna- tional Journal of Robotics Research 35.10, pp. 1157–1163. Byravan, Arunkumar and Dieter Fox (2017). “Se3-nets: Learning rigid body motion using deep neural networks”. In: Robotics and Automation (ICRA), 2017 IEEE Interna- tional Conference on. IEEE, pp. 173–180. Carruthers, Peter (2015). The centered mind: what the science of working memory shows us about the nature of human thought. OUP Oxford. Ciliberto, Carlo et al. (2012). “A heteroscedastic approach to independent motion detec- tion for actuated visual sensors”. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3907–3913. ISSN: 2153-0858. DOI: 10.1109/ IROS.2012.6385943. Clark, M and L Stark (1975). “Time optimal behavior of human saccadic eye movement”. In: IEEE Transactions on Automatic Control 20.3, pp. 345–348. Connor, Charles E (2001). “Shifting receptive fields”. In: Neuron 29.3, pp. 548–549. Davison, A J (2003). “Real-time Simultaneous Localisation and Mapping with a Single Camera”. In: Iccv 2, pp. 1403–1410. ISSN: 87551209. DOI: 10.1109/ICCV. 2003.1238654. arXiv: arXiv:1407.5736v1. Deubel, Heiner, Bruce Bridgeman, and Werner X Schneider (2004). “Different effects of eyelid blinks and target blanking on saccadic suppression of displacement.” In: Perception & psychophysics 66.5, pp. 772–778. ISSN: 0031-5117. DOI: 10.3758/ BF03194971. Deubel, Heiner, Carmen Koch, and Bruce Bridgeman (2010). “Landmarks facilitate vi- sual space constancy across saccades and during fixation”. In: Vision Research 50.2, pp. 249–259. ISSN: 00426989. DOI: 10.1016/j.visres.2009.09.020. Deubel, Heiner, Werner X. Schneider, and Bruce Bridgeman (1996). “Postsaccadic tar- get blanking prevents saccadic suppression of image displacement”. In: Vision Re- search 36.7, pp. 985–996. ISSN: 00426989. DOI: 10.1016/0042-6989(95) 00203-0. Duhamel, Jean-Rene´, Carol L Colby, and Michael E Goldberg (1992). “The updating of the representation of visual space in parietal cortex by intended eye movements”. In: Science 255.5040, p. 90. Eliades, Steven J and Xiaoqin Wang (2003). “Sensory-Motor Interaction in the Primate Auditory Cortex During”. In: Journal of neurophysiology 89.2194, pp. 2194–2207. Eliades, Steven J. and Xiaoqin Wang (2008). “Neural substrates of vocalization feedback monitoring in primate auditory cortex”. In: Nature 453.7198, pp. 1102–1106. ISSN: 0028-0836. DOI: 10.1038/nature06910. 106 Enderle, John D and James W Wolfe (1987). “Time-optimal control of saccadic eye move- ments”. In: IEEE transactions on biomedical engineering 1, pp. 43–55. Enkelmann, Wilfried (1991). “Obstacle detection by evaluation of optical flow fields from image sequences”. In: Image and Vision Computing 9.3, pp. 160–168. ISSN: 02628856. DOI: 10.1016/0262-8856(91)90010-M. Feldman, Anatol G. (2009). “New insights into actionperception coupling”. In: Exper- imental Brain Research 194.1, pp. 39–58. ISSN: 0014-4819. DOI: 10.1007/ s00221-008-1667-3. — (2016). “Active sensing without efference copy: referent control of perception”. In: Journal of Neurophysiology 2111.514, jn.00016.2016. ISSN: 0022-3077. DOI: 10.1152/jn.00016.2016. Flinker, a. et al. (2010). “Single-Trial Speech Suppression of Auditory Cortex Activity in Humans”. In: Journal of Neuroscience 30.49, pp. 16643–16650. ISSN: 0270-6474. DOI: 10.1523/JNEUROSCI.1809-10.2010. Fraundorfer, Friedrich and Davide Scaramuzza (2012). “Visual odometry: Part II: Match- ing, robustness, optimization, and applications”. In: IEEE Robotics & Automation Magazine 19.2, pp. 78–90. Geiger, A et al. (2013). “Vision meets robotics: The KITTI dataset”. In: The International Journal of Robotics Research 32.11, pp. 1231–1237. ISSN: 0278-3649. DOI: 10. 1177/0278364913491297. Glorot, Xavier, Antoine Bordes, and Yoshua Bengio (2011). “Deep Sparse Rectifier Neu- ral Networks.” In: Aistats. Vol. 15. 106, p. 275. Greenlee, Jeremy D. W. et al. (2011). “Human Auditory Cortical Activation during Self- Vocalization”. In: PLoS ONE 6.3, e14744. ISSN: 1932-6203. DOI: 10.1371/ journal.pone.0014744. Gru¨sser, O. J., A. Krizicˇ, and L. R. Weiss (1987). “Afterimage movement during saccades in the dark”. In: Vision Research 27.2. ISSN: 00426989. DOI: 10.1016/0042- 6989(87)90184-2. Han, Song, Huizi Mao, and William J Dally (2015). “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding”. In: arXiv preprint arXiv:1510.00149. Hariharan, Bharath et al. (2015). “Hypercolumns for object segmentation and fine-grained localization”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 447–456. Harris, Christopher M (1998). “On the optimal control of behaviour: a stochastic perspec- tive”. In: Journal of neuroscience methods 83.1, pp. 73–88. Harris, Christopher M and Daniel M Wolpert (1998). “Signal-dependent noise determines motor planning”. In: Nature 394.6695, p. 780. — (2006). “The main sequence of saccades optimizes speed-accuracy trade-off”. In: Biological cybernetics 95.1, pp. 21–29. Harwood, Mark R, Laura E Mezey, and Christopher M Harris (1999). “The spectral main sequence of human saccades”. In: Journal of Neuroscience 19.20, pp. 9098–9106. He, Kaiming et al. (2014). “Spatial pyramid pooling in deep convolutional networks for visual recognition”. In: European Conference on Computer Vision. Springer, pp. 346–361. 107 He, Kaiming et al. (2015). “Delving deep into rectifiers: Surpassing human-level per- formance on imagenet classification”. In: Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Hinton, Geoffrey E., Alex Krizhevsky, and Sida D. Wang (2011). “Transforming auto- encoders”. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 6791 LNCS.PART 1, pp. 44–51. ISSN: 03029743. DOI: 10.1007/978-3-642-21735-7_6. arXiv: 9605103 [cs]. Holst, E and H Mittelstaedt (1950). “The principle of reafference: Interactions between the central nervous system and the peripheral organs”. In: PC Dodwell (Ed. and Trans.), Perceptual processing: Stimulus equivalence and pattern recognition 1950, pp. 41–72. Ibbotson, Michael R et al. (2008). “Saccadic modulation of neural responses: possible roles in saccadic suppression, enhancement, and time compression”. In: Journal of Neuroscience 28.43, pp. 10952–10960. Ibbotson, Michael and Bart Krekelberg (2011). “Visual perception and saccadic eye move- ments”. In: Current opinion in neurobiology 21.4, pp. 553–558. Irani, Michal and P Anandan (1998). “A unified approach to moving object detection in 2D and 3D scenes”. In: IEEE transactions on pattern analysis and machine intelli- gence 20.6, pp. 577–589. Jaderberg, Max et al. (2015). “Spatial Transformer Networks”. In: Nips, pp. 1–14. ISSN: 1087-0156. DOI: 10.1038/nbt.3343. arXiv: arXiv:1506.02025v1. Jia, Yangqing et al. (2014). “Caffe: Convolutional Architecture for Fast Feature Embed- ding”. In: arXiv preprint arXiv:1408.5093. Joiner, Wilsaan M, James Cavanaugh, and Robert H Wurtz (2013). “Compression and suppression of shifting receptive field activity in frontal eye field neurons.” In: The Journal of neuroscience : the official journal of the Society for Neuroscience 33.46, pp. 18259–69. ISSN: 1529-2401. DOI: 10.1523/JNEUROSCI.2964- 13. 2013. Kagan, Igor, Moshe Gur, and D Max Snodderly (2008). “Saccades and drifts differentially modulate neuronal activity in V1: effects of retinal image motion, position, and extraretinal influences”. In: Journal of Vision 8.14, pp. 19–19. Kayama, YUKIHIKO et al. (1979). “Luxotonic responses of units in macaque striate cortex”. In: Journal of Neurophysiology 42.6, pp. 1495–1517. Kim, Jaechul et al. (2013). “Deformable spatial pyramid matching for fast dense cor- respondences”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2307–2314. Kitt, Bernd, Frank Moosmann, and Christoph Stiller (2010). “Moving on to dynamic en- vironments: Visual odometry using feature classification”. In: IEEE/RSJ 2010 In- ternational Conference on Intelligent Robots and Systems, IROS 2010 - Conference Proceedings, pp. 5551–5556. ISSN: 2153-0858. DOI: 10.1109/IROS.2010. 5650517. Kivinen, Jyri J. and Christopher K I Williams (2011). “Transformation equivariant Boltz- mann machines”. In: Lecture Notes in Computer Science (including subseries Lec- ture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 6791 108 LNCS.PART 1, pp. 1–9. ISSN: 03029743. DOI: 10 . 1007 / 978 - 3 - 642 - 21735-7_1. Kohashi, Tsunehiko and Yoichi Oda (2008). “Initiation of Mauthner-or non-Mauthner- mediated fast escape evoked by different modes of sensory input”. In: The Jour- nal of Neuroscience 28.42, pp. 10641–53. ISSN: 1529-2401. DOI: 10.1523/ JNEUROSCI.1435-08.2008. Korn, H and DS Faber (2005). “The Mauthner cell half a century later: a neurobiological model for decision-making?” In: Neuron 47.1, pp. 13–28. ISSN: 0896-6273. DOI: 10.1016/j.neuron.2005.05.019. Kumar, Sriram et al. (2015). “Object segmentation using independent motion detection”. In: 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids) 1, pp. 94–100. ISSN: 21640580. DOI: 10.1109/HUMANOIDS.2015.7363537. Laidlaw, Kaitlin EW and Alan Kingstone (2010). “The time course of vertical, horizontal and oblique saccade trajectories: Evidence for greater distractor interference during vertical saccades”. In: Vision research 50.9, pp. 829–837. LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton (2015). “Deep learning”. In: Nature 521.7553, pp. 436–444. Lefaix, G., T. Marchand, and P. Bouthemy (2002). “Motion-based obstacle detection and tracking for car driving assistance”. In: Object recognition supported by user inter- action for service robots 4.August, pp. 74–77. ISSN: 1051-4651. DOI: 10.1109/ ICPR.2002.1047403. Long, Jonathan, Evan Shelhamer, and Trevor Darrell (2015). “Fully convolutional net- works for semantic segmentation”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440. Lucas, Bruce D, Takeo Kanade, et al. (1981). “An iterative image registration technique with an application to stereo vision”. In: Maas, Andrew L, Awni Y Hannun, and Andrew Y Ng (2013). “Rectifier nonlinearities improve neural network acoustic models”. In: Proc. ICML. Vol. 30. 1. Maimone, Mark, Yang Cheng, and Larry Matthies (2007). “Two years of visual odometry on the Mars Exploration Rovers”. In: Journal of Field Robotics 24.3, pp. 169–186. ISSN: 15564959. DOI: 10.1002/rob.20184. arXiv: 10.1.1.91.5767. Martinelli, Agostino (2012). “Vision and IMU data fusion: Closed-form solutions for attitude, speed, absolute scale, and bias determination”. In: IEEE Transactions on Robotics 28.1, pp. 44–60. Matin, E (1974). “Saccadic suppression: a review and an analysis.” In: Psychological bulletin 81.12, pp. 899–917. ISSN: 0033-2909. DOI: 10.1037/h0037368. McCormac, John et al. (2016). “SceneNet RGB-D: 5M Photorealistic Images of Synthetic Indoor Trajectories with Ground Truth”. In: arXiv preprint arXiv:1612.05079. arXiv: 1612.05079. McFarland, James M et al. (2015). “Saccadic modulation of stimulus processing in pri- mary visual cortex”. In: Nature communications 6. Mehta, Biren and Stefan Schaal (2002). “Forward models in visuomotor control”. In: Journal of Neurophysiology 88.2, pp. 942–953. 109 Memisevic, R. and G. Hinton (2007). “Unsupervised Learning of Image Transforma- tions”. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition. ISSN: 1063-6919. DOI: 10.1109/CVPR.2007.383036. Memisevic, Roland (2013). “Learning to relate images”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 35.8, pp. 1829–1846. ISSN: 01628828. DOI: 10. 1109/TPAMI.2013.53. arXiv: arXiv:1110.0107v2. Memisevic, Roland and Geoffrey E Hinton (2010). “Learning to represent spatial transfor- mations with factored higher-order Boltzmann machines.” In: Neural computation 22.6, pp. 1473–1492. ISSN: 0899-7667. DOI: 10.1162/neco.2010.01-09- 953. Miall, RC (1995). “Motor control, biological and theoretical”. In: The handbook of brain theory and neural networks, pp. 597–600. Nelson, Randal C (1991). “Qualitative detection of motion by a moving observer”. In: International journal of computer vision 7.1, pp. 33–46. Ni, Amy M, Scott O Murray, and Gregory D Horwitz (2014). “Object-centered shifts of receptive field positions in monkey primary visual cortex”. In: Current Biology 24.14, pp. 1653–1658. Nikolic, Janosch et al. (2014). “A synchronized visual-inertial sensor system with FPGA pre-processing for accurate real-time SLAM”. In: Robotics and Automation (ICRA), 2014 IEEE International Conference on. IEEE, pp. 431–437. Niziolek, C. a., S. S. Nagarajan, and J. F. Houde (2013). “What Does Motor Efference Copy Represent? Evidence from Speech Production”. In: Journal of Neuroscience 33.41, pp. 16110–16116. ISSN: 0270-6474. DOI: 10.1523/JNEUROSCI.2137- 13.2013. Noda, Hiroharu (1975). “Hiroharu noda”. In: pp. 579–595. Ostendorf, Florian and Raymond J Dolan (2015). “Integration of retinal and extraretinal information across eye movements”. In: PloS one 10.1, e0116810. Ostry, David J. and Anatol G. Feldman (2003). “A critical evaluation of the force control hypothesis in motor control”. In: Experimental Brain Research 153.3, pp. 275–288. ISSN: 00144819. DOI: 10.1007/s00221-003-1624-0. Panouille`res, Muriel TN et al. (2016). “Oculomotor adaptation elicited by intra-saccadic visual stimulation: Time-course of efficient visual target perturbation”. In: Frontiers in human neuroscience 10. Rajkai, Csaba et al. (2008). “Transient cortical excitation at the onset of visual fixa- tion”. In: Cerebral Cortex 18.1, pp. 200–209. ISSN: 10473211. DOI: 10.1093/ cercor/bhm046. Ranzato, Marc Aurelio and Geoffrey E Hinton (2010). “Factored 3-Way Restricted Boltz- mann Machines For Modeling Natural Images”. In: Artificial Intelligence 9, pp. 621– 628. ISSN: 10636919. DOI: 10.1109/CVPR.2010.5539962. Revaud, Jerome et al. (2015). “EpicFlow : Edge-Preserving Interpolation of Correspon- dences for Optical Flow”. In: Cvpr 2015. DOI: 10.1063/1.4905777. arXiv: arXiv:1501.0256. Revaud, Jerome et al. (2016). “DeepMatching : Hierarchical Deformable Dense Match- ing”. In: International Journal of Computer Vision 120.3, pp. 300–323. 110 Ross, John et al. (2001). “Changes in visual perception at the time of saccades”. In: Trends Neurosci. 24.2, pp. 113–121. Rosten, Edward, Reid Porter, and Tom Drummond (2010). “Faster and better: A machine learning approach to corner detection”. In: IEEE transactions on pattern analysis and machine intelligence 32.1, pp. 105–119. Scharstein, Daniel and Richard Szeliski (2002). “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms”. In: International journal of com- puter vision 47.1-3, pp. 7–42. Schmidt, Tanner, Richard Newcombe, and Dieter Fox (2017). “Self-supervised visual descriptor learning for dense correspondence”. In: IEEE Robotics and Automation Letters 2.2, pp. 420–427. Sermanet, Pierre et al. (2013). “Overfeat: Integrated recognition, localization and detec- tion using convolutional networks”. In: arXiv preprint arXiv:1312.6229. Shamwell, E. Jared, William D. Nothwang, and Donald Perlis (2017). “DeepEfference: Learning to Predict the Sensory Consequences of Action Through Deep Correspon- dence”. In: Development and Learning and Epigenetic Robotics (ICDL), 2017 IEEE International Conference on. IEEE. — (Accepted). “A Deep Neural Network Approach to Fusing Vision and Heteroscedas- tic Motion Estimates for Low-SWaP Robotic Applications”. In: Multisensor Fusion and Integration for Intelligent Systems, 2017 International Conference on. IEEE. Shamwell, Jared et al. (2012). “The robot baby and massive metacognition: Early steps via growing neural gas”. In: Development and Learning and Epigenetic Robotics (ICDL), 2012 IEEE International Conference on. IEEE, pp. 1–2. Shotton, Jamie et al. (2013). “Scene coordinate regression forests for camera relocal- ization in RGB-D images”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2930–2937. Simonyan, Karen and Andrew Zisserman (2014). “Very deep convolutional networks for large-scale image recognition”. In: arXiv preprint arXiv:1409.1556. Sommer, M A and R H Wurtz (2006). “Influence of the thalamus on spatial visual pro- cessing in frontal cortex”. In: Nature 444.7117, pp. 374–377. ISSN: 0028-0836. DOI: 10.1038/nature05279. Sperry, R W (1950). “Neural Basis of the Spontaneous Optokinetic Response Produced By Visual Inversion”. In: Journal of Comparative and Physiological Psychology 43.6, pp. 482–489. Sylvester, Richard and Geraint Rees (2006). “Extraretinal saccadic signals in human LGN and early retinotopic cortex”. In: Neuroimage 30.1, pp. 214–219. Taylor, Jonathan et al. (2012). “The vitruvian manifold: Inferring dense correspondences for one-shot human pose estimation”. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, pp. 103–110. Thilo, Kai V et al. (2004). “The site of saccadic suppression.” In: Nature neuroscience 7.1, pp. 13–4. ISSN: 1097-6256. DOI: 10.1038/nn1171. Tomasi, Carlo and Takeo Kanade (1991). “Detection and tracking of point features”. In: Troncoso, Xoana G et al. (2015). “V1 neurons respond differently to object motion versus motion from eye movements.” In: Nature communications 6, p. 8114. ISSN: 2041- 1723. DOI: 10.1038/ncomms9114. 111 Volkmann, Frances C, Amy M Schick, and Lorrin A Riggs (1968). “Time course of visual inhibition during voluntary saccades.” In: Journal of the Optical Society of America 58.4, pp. 562–569. ISSN: 0030-3941. DOI: 10.1364/JOSA.58.000562. Walker, Robin, Eugene McSorley, and Patrick Haggard (2006). “The control of saccade trajectories: Direction of curvature depends on prior knowledge of target location and saccade latency”. In: Attention, Perception, & Psychophysics 68.1, pp. 129– 138. Wang, Xiaolong and Abhinav Gupta (2015). “Unsupervised learning of visual represen- tations using videos”. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2794–2802. Wen, Wei et al. (2016). “Learning structured sparsity in deep neural networks”. In: Ad- vances in Neural Information Processing Systems, pp. 2074–2082. Wiegner, Allen W and M Margaret Wierzbicka (1992). “Kinematic models and human elbow flexion movements: quantitative analysis”. In: Experimental Brain Research 88.3, pp. 665–673. Wohlhart, Paul and Vincent Lepetit (2015). “Learning descriptors for object recognition and 3d pose estimation”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3109–3118. Wolpert, Daniel M, Zoubin Ghahramani, and Michael I Jordan (1994). “Perceptual distor- tion contributes to the curvature of human reaching movements”. In: Experimental brain research 98.1, pp. 153–156. — (1995). “An internal model for sensorimotor integration”. In: Science, pp. 1880– 1882. Zhou, Tinghui et al. (2016). “View Synthesis by Appearance Flow”. In: European Con- ference on Computer Vision 1, pp. 286–301. arXiv: arXiv:1605.03557v2. Zirnsak, Marc and Tirin Moore (2014). “Saccades and shifting receptive fields: Antici- pating consequences or selecting targets?” In: Trends in Cognitive Sciences 18.12, pp. 621–628. ISSN: 1879307X. DOI: 10.1016/j.tics.2014.10.002. 112