ABSTRACT
Title of dissertation: NEURO-INSPIRED AUGMENTATIONS OF
UNSUPERVISED DEEP NEURAL NETWORKS
FOR LOW-SWAP EMBODIED
ROBOTIC PERCEPTION
E. Jared Shamwell
Doctor of Philosophy, 2017
Dissertation directed by: Professor Donald Perlis
Department of Computer Science
Despite the 3-4 saccades the human eye undergoes per second, humans perceive
a stable, visually constant world. During a saccade, the projection of the visual world
shifts on the retina and the net displacement on the retina would be identical if the entire
visual world were instead shifted. However, humans are able to perceptually distinguish
between these two conditions and perceive a stable world in the first condition, and a
moving world in the second.
Through new analysis, I show how biological mechanisms theorized to enable vi-
sual positional constancy implicitly contain rich, egocentric sensorimotor representations
and with appropriate modeling and abstraction, artificial surrogates for these mechanisms
can enhance the performance of robotic systems.
In support of this view, I have developed a new class of neuro-inspired, unsuper-
vised, heterogeneous, deep predictive neural networks that are approximately 5,000%-
22,000% faster (depending on the network configuration) than state-of-the-art (SOA)
dense approaches and with comparable performance.
Each model in this new family of network architectures, dubbed LightEfference
(LE) (Chapter 2), DeepEfference (DE) (Chapter 2), Multi-Hypothesis DeepEfference
(MHDE) (Chapter 3), and Inertial DeepEfference (IDE) (Chapter 4) respectively, achieves
its substantial runtime performance increase by leveraging the embodied nature of mobile
robotics and performing early fusion of freely available heterogeneous sensor and mo-
tor/intentional information. With these architectures, I show how embedding extra-visual
information meant to encode an estimate of an embodied agent’s immediate intention
supports efficient computations of visual constancy and odometry and greatly increases
computational efficiency compared to comparable single-modality SOA approaches.
NEURO-INSPIRED AUGMENTATIONS OF
UNSUPERVISED DEEP NEURAL NETWORKS
FOR LOW-SWAP EMBODIED ROBOTIC PERCEPTION
by
Earl Jared Shamwell
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2017
Advisory Committee:
Professor Don Perlis, Chair/Advisor
Professor Dan Butts
Professor Avis Cohen
Professor Timothy Horiuchi
Dr. William Nothwang
Professor Peter Carruthers, Dean’s Representative
c© Copyright by
Earl Jared Shamwell
2017
Dedicated to my parents.
ii
Acknowledgments
I owe my deepest gratitude to the many people who have made this dissertation
possible. Looking back, it’s hard not to wonder at the multitude of fortuitous events that
have culminated in this doctoral dissertation.
I would like to thank my Advisor, Professor Don Perlis, for his tireless patience and
faith in me as I explored my curiosity.
I would also like to thank my Mentor, Dr. William Nothwang, for his guidance,
support, and helping me to see things in myself that I did not always know were there.
For support (financial and otherwise), I would like to thank the Sensors and Elec-
tron Devices Directorate, Army Research Laboratory (ARL), and in particular, Dr. Brett
Piekarski.
Without Don, Will, and ARL, this dissertation would not have been possible.
I would like to acknowledge Professors Dan Butts and Avis Cohen for the many
conversations over the years that continue to shape my thinking as a scientist and engineer.
I would also like to acknowledge my committee members Professors Timothy Horiuchi
and Peter Carruthers for helping shape my dissertation.
For instilling in me a stubbornness and belief that anything is possible and support-
ing me in everything I have pursued, my parents, to whom this dissertation is dedicated,
deserve a special thank you.
Finally, I must acknowledge Lunet Luna with deep thanks for being a constant
source of encouragement, even across three-time zones.
iii
Table of Contents
Dedication ii
Acknowledgements iii
List of Tables vii
List of Figures viii
List of Abbreviations x
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivations and Background . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Saccadic Suppression . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1.1 Mathematical Representation . . . . . . . . . . . . . . 4
1.2.1.2 Failures of Saccadic Suppression . . . . . . . . . . . . 4
1.2.2 Efference Copy . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2.1 Mathematical Representation . . . . . . . . . . . . . . 7
1.2.3 Failures of Efference Copy (EC) . . . . . . . . . . . . . . . . . . 8
1.2.4 Alternate Theories and Synthesis . . . . . . . . . . . . . . . . . . 12
1.2.4.1 Referent Control (RC) and High-Level Interactions . . 12
1.2.4.2 Landmark Theory . . . . . . . . . . . . . . . . . . . . 14
1.2.4.3 Mathematical Representation . . . . . . . . . . . . . . 15
1.2.5 Visual Odometry . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.6 Feed-Forward Feature Selection . . . . . . . . . . . . . . . . . . 21
1.2.7 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.3 Chapter Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.3.1 Chapter 2 - DeepEfference: Learning to Predict the Sensory Con-
sequences of Action Through Deep Correspondence . . . . . . . 26
1.3.2 Chapter 3 - A Deep Neural Network Approach to Fusing Vision
and Heteroscedastic Motion Estimates for Low-SWaP Robotic
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
iv
1.3.3 Chapter 4 - An Embodied Deep Neural Network Approach to
Dense Visual Correspondence for Low-SWaP Robotics . . . . . . 28
2 DeepEfference: Learning to Predict the Sensory Consequences of Action Through
Deep Correspondence 29
2.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.1 Deep Approaches to Spatial Transformation Encoding and Learning 33
2.3.2 Extra-Visual Motion Estimates . . . . . . . . . . . . . . . . . . . 34
2.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4.1 Training and Loss Rule . . . . . . . . . . . . . . . . . . . . . . . 36
2.4.2 Pathway 1: Local, Pixel-Level Shifter . . . . . . . . . . . . . . . 36
2.4.3 Pathway 2: Global Spatial Transformer . . . . . . . . . . . . . . 37
2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.5.1 Experimental Methods . . . . . . . . . . . . . . . . . . . . . . . 40
2.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3 A Deep Neural Network Approach to Fusing Vision and Heteroscedastic Motion
Estimates for Low-SWaP Robotic Applications 50
3.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3.1 Visual Odometry and Multi-Sensor Fusion . . . . . . . . . . . . 54
3.3.2 Deep Spatial Transformations . . . . . . . . . . . . . . . . . . . 56
3.3.3 Extra-Modal Motion Estimates and Heteroscedastic Noise . . . . 58
3.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4.1 Winner-Take-All (WTA) Loss Rule . . . . . . . . . . . . . . . . 59
3.4.2 Pathway 1: Global Spatial Transformer . . . . . . . . . . . . . . 61
3.4.3 Pathway 2: Local, Pixel-Level Shifter . . . . . . . . . . . . . . . 62
3.5 Experimental Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.5.1 Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.5.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.5.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.9 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.9.1 Extended Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 69
3.9.2 Extended Training Procedures . . . . . . . . . . . . . . . . . . . 70
4 An Embodied Deep Neural Network Approach to Dense Visual Correspondence
for Low-SWaP Robotics 75
4.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
v
4.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.5 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.5.1 Pathway 1: Global Shifter . . . . . . . . . . . . . . . . . . . . . 82
4.5.2 Pathway 2: Local, Pixel-Level Shifter . . . . . . . . . . . . . . . 83
4.5.3 Spatial Transformations . . . . . . . . . . . . . . . . . . . . . . 83
4.6 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.6.1 Datasets and Data Generation . . . . . . . . . . . . . . . . . . . 86
4.6.1.1 EuRoC MAV . . . . . . . . . . . . . . . . . . . . . . . 87
4.6.1.2 KITTI Odometry . . . . . . . . . . . . . . . . . . . . 89
4.6.2 Network Parameters and Training Procedures . . . . . . . . . . . 90
4.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.7.1 EuRoC Ground Truth Generation . . . . . . . . . . . . . . . . . 91
4.7.2 KITTI Ground Truth Generation . . . . . . . . . . . . . . . . . . 92
4.8 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.9 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 102
Bibliography 105
vi
List of Tables
2.1 Average runtimes for DM, LE, and DE . . . . . . . . . . . . . . . . . . . 43
2.2 Pixel errors for DM, LE, and DE on KITTI and SceneNet . . . . . . . . . 48
3.1 Average runtimes for DeepMatching (DM) and Multi-Hypothesis Deep-
Efference (MHDE) with equivalent frames per second (FPS) . . . . . . . 67
4.1 Pixel MSE for the EuRoC dataset . . . . . . . . . . . . . . . . . . . . . 100
4.2 Mean Pixel Error for the KITTI dataset . . . . . . . . . . . . . . . . . . 101
4.3 Average runtimes with equivalent frames per second (FPS) . . . . . . . . 102
vii
List of Figures
2.1 Sample results from KITTI Odometry. A: Sample source image. B: Sam-
ple target image. C: DeepEfference output reconstruction of the source
image in A using pixel intensity values sampled from the target image in
B. D and E: Source and target images with marked correspondence points
computed by DeepEfference. . . . . . . . . . . . . . . . . . . . . . . . 30
2.2 DeepEfference network diagram showing the linked global and local learn-
ers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3 Pixel error boxplots for DM, LE, and DE using DM and FAST key-
points. Y-axis is actual mean pixel error. Middle lines are the medians
and whiskers indicate 1.5 interquartile of the lower and upper quartiles.
Note outliers are not shown for clarity and instead minimum and maxi-
mums are presented in the table below. . . . . . . . . . . . . . . . . . . 42
2.4 Unusual DE and LE sample results from KITTI and SceneNet. The red
boxes highlight areas in the reconstructed image that were imagined by
DeepEfference. Green boxes highlight instances where DE was able to
better predict object positions in a scene with strong depth contrast while
LE generated a poorer reconstruction. The last row shows a failure case
where both LE and DE were unable to generate a reconstruction. This is
most likely due to the extreme transformation between the camera at the
time of source image capture versus target image capture. . . . . . . . . 49
3.1 Sample MHDE outputs from different hypothesis pathways. A-E: MHDE
ouputs from 5 pathways. D shows the output from an inactive pathway
(i.e. a pathway that the network did not optimize). F-E: Reconstruction
error for the hypotheses shown in A-E. From F we can see that the recon-
struction shown in A had the lowest error (yellow-dashed box). . . . . . 53
3.2 MHDE network diagram with two hypotheses shown for brevity. We
experimented with up to 8 hypotheses in this work. . . . . . . . . . . . . 56
3.3 Heteroscedastic noise as a function of transform magnitude for the X and
Y components of the transform input over the test set for a network with
a noise parameter α = 0.25. . . . . . . . . . . . . . . . . . . . . . . . . 64
viii
3.4 Inverse mean pixel error (higher is better) for several noise conditions
produced by MHDE networks trained to generate 1, 2, 4, or 8 maximum
hypotheses. Dashed line is DM (SOA) error. . . . . . . . . . . . . . . . . 72
3.5 Inverse mean pixel error (higher is better) for several noise conditions pro-
duced by MHDE networks. Results are from the same networks shown in
Fig. 3.4 but are instead plotted as a function of active pathways learned
by each network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.6 Activation by pathway for the different noise conditions. Only networks
with maximum hypotheses of 4 or 8 are shown. . . . . . . . . . . . . . . 74
4.1 IDE network diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2 Histogram of error between the ground truth and k-means cluster to which
that exemplar was assigned. . . . . . . . . . . . . . . . . . . . . . . . . 87
4.3 Inverse pixel error (higher is better) for each condition for EuRoC and
KITTI. The KM condition for EuRoC is omitted for plotting convenience
but included in the table below. . . . . . . . . . . . . . . . . . . . . . . 93
4.4 Sample KITTI correspondence results. Note that only every other key-
point (horizontally and vertically) is shown and the actual unaltered out-
put is fully dense. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.5 Sample KITTI correspondence results. Note that only every only key-
point (horizontally and vertically) is shown and the actual unaltered out-
put is fully dense. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.6 Sample KITTI correspondence results. Note that only every only key-
point (horizontally and vertically) is shown and the actual unaltered out-
put is fully dense. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.7 Sample EuRoc correspondence results. Note that only every only key-
point (horizontally and vertically) is shown and the actual unaltered out-
put is fully dense. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.8 Sample EuRoc correspondence results. Note that only every only key-
point (horizontally and vertically) is shown and the actual unaltered out-
put is fully dense. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.9 Sample EuRoc correspondence results. Note that only every only key-
point (horizontally and vertically) is shown and the actual unaltered out-
put is fully dense. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
ix
List of Abbreviations
A1 Primary Auditory Cortex
BPTT Back Propagation Through Time
CD Corollary Discharge
CNN Convolutional Neural Network
CNS Central Nervous System
DE DeepEfference
DM DeepMatching
DNN Deconvolutional Neural Network
DOF Degree of Freedom
DSP Deformable Spatial Pyramids
EC Efference Copy
FAST Features from the Accelerated Segment Test
FC Fully Connected
FFMS Feed-Forward Motor Signal
GPS Global Positioning System
GPU Graphics Processing Unit
HLEC High Level Efference Copy
IDE Inertial DeepEfference
IMU Inertial Measurement Unit
INS Inertial Navigation System
KLT Kanade-Lucas-Tomasi
LIDAR Light Detection and Ranging
LK Lucas-Kanade
MAV Micro Aerial Vehicle
MF Motor Feedback
MHDE Multi-Hypothesis DeepEfference
MLP Multi-Layer Perceptron
PID Proportional-Integral-Derivative
x
RANSAC Random Sample Consensus
RC Referent Control
RCS Referent Command Signal
ReLU Rectified Linear Unit
RGB-D Red Green Blue Depth
RMSE Root Mean Squared Error
RTK Real-Time Kinematic
SC Superior Colliculus
SGD Stochastic Gradient Descent
SOA State-of-the-Art
ST Spatial Transformer
SWaP Size Weight and Power
V1 Primary Visual Cortex/Striate Cortex
VI Visual Inertial
VIO Visual Inertial Odometry
VO Visual Odometry
VSLAM Visual Simultaneous Localization and Mapping
WTA Winner Take All
xi
Chapter 1: Introduction
1.1 Overview
Despite the 3-4 saccades the human eye undergoes per second, humans perceive
a stable, visually constant world. During a saccade, the projection of the visual world
shifts on the retina and the net displacement on the retina would be identical if the entire
visual world were instead shifted. However, humans are able to perceptually distinguish
between these two conditions and perceive a stable world in the first condition (where
visual positional constancy is maintained), and a moving world in the second (where
visual positional constancy is not maintained).
I argue that biological mechanisms supporting visual constancy contain a rich, ego-
centric representation of the environment and, with appropriate modeling and abstraction,
can serve as prototypical models of both early heterogeneous sensor fusion and early sen-
sorimotor fusion/interactions. Artificial surrogates of these prototypical mechanisms can
enable enhanced:
• Prediction of the sensory consequences of self-induced actions;
• Action-based and perceptual anomaly detection; and
• Robotic visual odometry and localization.
1
In support of this view, I have developed a class of neuro-inspired, unsupervised
deep predictive neural networks for robotic applications that are approximately 5,000%-
22,000% faster (depending on the network configuration) than state-of-the-art (SOA)
dense approaches and with comparable performance.
Each model in this new family of network architectures, dubbed LightEfference
(LE) (Chapter 2), DeepEfference (DE) (Chapter 2), Multi-Hypothesis DeepEfference
(MHDE) (Chapter 3), and Inertial DeepEfference (IDE) (Chapter 4) respectively, per-
forms early fusion of heterogeneous information. With these architectures, I show how
embedding extra-visual information meant to encode an estimate embodied agent’s im-
mediate intention supports efficient computations of visual constancy and odometry and
greatly increases computational efficiency compared to comparable single-modality SOA
approaches.
The remainder of this chapter is organized as follows: Section 1.2 describes the mo-
tivations and background for this dissertation; and Section 1.3 summarizes the remaining
chapters in this dissertation.
1.2 Motivations and Background
In this section, I provide a limited discussion of the theoretical motivations for this
work. Additional detail can be found in Chapter 2, Chapter 3, and Chapter 4.
As earlier alluded, visual positional constancy is a well-studied phenomena that
grants us a window into the brain’s ability to separate the self from the other in low-
/sensorimotor-level representations.
2
Experimental evidence and theoretical analysis has suggested that the maintenance
of constancy is achieved through various physiological mechanisms including efference
copy (or “effort of will”) (Holst and Mittelstaedt, 1950; Sperry, 1950), saccadic suppres-
sion (Matin, 1974; Noda, 1975; Ross et al., 2001; Thilo et al., 2004; Volkmann, Schick,
and Riggs, 1968), and low-latency matching of sequential visual sensory information
(Deubel, Schneider, and Bridgeman, 1996; Deubel, Koch, and Bridgeman, 2010).
Despite conflicting experimental results and analysis on visual positional constancy
in the neuroscience literature, I argue that a synthesis of the landmark theory (Deubel,
Bridgeman, and Schneider, 2004; Deubel, Koch, and Bridgeman, 2010) and a relaxed
version of classical efference copy (EC) (Holst and Mittelstaedt, 1950; Sperry, 1950) is
better able to pose solutions to the problem of visual constancy and the larger question of
how biological systems separate the self from the other.
1.2.1 Saccadic Suppression
Early theories of saccadic visual positional constancy relied on saccadic suppres-
sion, a phenomenon where visual perception seems to be inhibited during a saccade
(Matin, 1974; Noda, 1975; Ross et al., 2001; Thilo et al., 2004; Volkmann, Schick, and
Riggs, 1968). Saccades are among the human body’s fastest movements and can reach
speeds up to 1000 degrees per second (Bridgeman, Heijden, and Velichkovsky, 1994)
and a longstanding view of saccadic eye movements has held that saccades are ballistic,
open-loop movements with time courses too fast to benefit from ongoing modulation from
sensory feedback and generally unable to use feedback control and sensory information.
3
1.2.1.1 Mathematical Representation
Using a theory relying on saccadic suppression for visual constancy, constancy
could then be confirmed post-saccade by comparing a post-saccadic state to an estimate
of a predicted post-saccadic state. This can be expressed as:
C =
∥∥X t+1 − F [X t]∥∥2
Y =

True, if C ≤ φ
False, if C > φ
(1.1)
where F [] represents the desired relationship that transforms the state X t (pre-saccade)
to the state X t+1 (post-saccade), and φ is some threshold under which constancy is main-
tained. An appropriate choice of the function F [] would then produce a mapping that
transforms the pre-saccadic state X t to the post-saccadic state X t+1 and values of C less
than φ would signify failures of constancy.
1.2.1.2 Failures of Saccadic Suppression
In conflict with suppressive accounts of transsaccadic visual constancy, studies
have found that subsets of V1 neurons fail to exhibit classical suppression during micro-
saccades and saccades (Troncoso et al., 2015; McFarland et al., 2015). Additionally,
while perceptually humans report experiencing saccadic suppression, this does not pre-
clude the possibility that perisaccadic visual information is available to the visual system.
Indeed recent work in oculomotor adaptation has shown the dependence of saccadic adap-
tation on visual information briefly presented perisaccadically (Panouille`res et al., 2016).
4
This finding suggests not only a link between perisaccadic visual information and senso-
rimotor control, but that perisaccadic visual information is actively used for sensorimotor
control (additional detail on the limitations of saccadic suppression are discussed in Sec-
tion 1.2.3).
1.2.2 Efference Copy
Classically, a forward model with an efference copy (EC) (Holst and Mittelstaedt,
1950; Sperry, 1950) has been proposed as the relational mechanism and extra-visual sig-
nals have been extensively documented around the time of saccades (Bremmer et al.,
2009; Duhamel, Colby, and Goldberg, 1992; Ibbotson and Krekelberg, 2011; Kagan,
Gur, and Snodderly, 2008; Kayama et al., 1979; Rajkai et al., 2008; Sommer and Wurtz,
2006).
EC has been theorized as a critical component of biological sensorimotor control
and has been used to explain a myriad of observed sensorimotor phenomena including
visual positional constancy. The traditional theory of EC relies on the direct motor com-
mands that produce actions being used to generate the EC and has generally supposed
that the effects of ECs result in suppression of neural activity. For example, the vestibulo-
ocular reflex (VOR) is a reflex arc that activates eye muscles to maintain an image in
the center of view during head motions detected by the vestibular system. While the
VOR triggers automatic compensatory movements, self-generated eye movements initi-
ated during a head movement can prevent the VOR from controlling gaze. Without an EC
or an EC-like mechanism, the VOR would instead override any intentional eye movement
5
and leave us unable to self-direct gaze during heads movements.
In now famous experiments with the fly Eristalis, Van Holst and Mittelstaedt (Holst
and Mittelstaedt, 1950) demonstrated a major shortcoming of classical reflex chain theory
and introduced their principle of reafference. Describing the prevailing view of the cen-
tral nervous system (CNS) as that of ”a sort of automat, which reflexly delivers a given
ticket when a particular coin is inserted in it”, they were among the first to realize and
demonstrate the biological necessity of separating reafferent (resulting from the organ-
ism’s actions) from exafferent (resulting from the external world) sensory information
and its consequences for sensorimotor processing.
Studying the optokinetic reflex of the fly Eristalis, Van Holst and Mittelstaedt saw
that reafferent and exafferent sensory information were quantitatively and qualitatively the
same and yet organisms are able to nonetheless distinguish between the two. A classical
example of the optokinetic reflex can be demonstrated by placing the fly Eristalis inside
of a hollow cylinder that is painted with vertical black and white stripes. With the fly
hovering inside the cylinder, rotation of the cylinder causes the fly to rotate itself in the
same direction to maintain the same view (i.e. it rotates its body so the projections of the
stripes on its retina remain identically located). However, the fly is still able to initiate
self-generated motion and independently rotate itself without snapping back to its starting
point at the cessation of movement, somehow overriding the optokinectic reflex.
Van Holst and Mittelstaedt challenged the classical explanation that the optokinec-
tic reflex is only active during spontaneous movements by rotating the fly’s head 180
degrees and repeating the experiments. By essentially creating a negative feedback loop,
the voluntary movement behaviors previously seen in normal fly were no longer seen - the
6
rotated fly would fly spin in tight circles until eventually freezing. They thus concluded
that the optokinectic reflex is not simply inactive during spontaneous movements but in-
stead that the sensory consequences of an action are modulated based on very specific
sensory feedback expectations in response to an action’s associated motor commands.
They theorized that motor commands associated with a self-generated movement were
provided to the CNS through what they termed an EC of the motor command.
Their work suggested that the CNS anticipated self-generated sensory signals and
led to a theory of EC that has been incorporated into forward models of sensorymotor
processing in many systems and in many organisms. The theory broadly encompasses
signals generated in motor regions that target circuits engaged in sensory processing. The
advantages of the motor system providing the CNS an EC are that they are available even
before the movement begins whereas sensory information is available only afterward.
1.2.2.1 Mathematical Representation
In a simplified linear case as has often been used to model forward dynamics in the
visuomotor system (Wolpert, Ghahramani, and Jordan, 1995; Mehta and Schaal, 2002;
Miall, 1995; Wolpert, Ghahramani, and Jordan, 1994), F [] would incorporate motor in-
formation u and might take the form
F [X t, U t] = AX t +BU t (1.2)
where X t is a representation of the current state, and A and B are constant state and in-
put matrices, respectively. However, experimental results and theoretical analysis suggest
7
that a purely motor based prediction is insufficient for perceptual visual constancy and
fails to explain post-saccadic target blanking (Deubel, Schneider, and Bridgeman, 1996),
afterimage effects following saccadic movements (Gru¨sser, Krizicˇ, and Weiss, 1987), and
in general, the observed sensory dependence of saccades (Bridgeman, Hendry, and Stark,
1975; Deubel, Bridgeman, and Schneider, 2004; Deubel, Koch, and Bridgeman, 2010;
Laidlaw and Kingstone, 2010; Walker, McSorley, and Haggard, 2006). The failure to
incorporate sensory dependence can be predicted from Eq. 1.2 where X t and U t are un-
coupled. This suggests a non-linear formulation of Eq. 1.2 would be needed to explain the
myriad of sensorimotor dependencies observed experimentally to affect saccadic visual
constancy.
1.2.3 Failures of Efference Copy (EC)
Recent experimental evidence and theoretical analysis has challenged EC as a mech-
anism of visual constancy. As mentioned earlier, the classical theory of EC relies on motor
commands themselves being used to generate the copy.
In the case of vocalization, this suggests that the information provided to the au-
ditory cortex should directly reflect the actual action being undertaken. However results
from the auditory cortex suggest that the EC signals may instead correspond to the ex-
pected results of an action (Niziolek, Nagarajan, and Houde, 2013).
An implication of EC is that the separation of reafferent from exafferent sensory
feedback is performed by means of neural response depression where primary sensory
afferents that correspond to reafferent sensory information are depressed. However sub-
8
sequent experimental results showing mixed suppressive and enhancement sensory pro-
cessing effects in the auditory cortex during during self-induced vocalization (Behrooz-
mand et al., 2009; Eliades and Wang, 2008; Flinker et al., 2010; Greenlee et al., 2011;
Eliades and Wang, 2003) and new theoretical analysis on postural stability (Bridgeman,
2007; Feldman, 2009; Ostry and Feldman, 2003) have demonstrated that the physiolog-
ical mechanisms presumably used to separate the reafferent from the exafferent are far
more complex than simple subtractions. Recently, neurons in the primate striate cortex
have been shown to exhibit biphasic response modulation during saccadic eye move-
ments that result in a period of reduced activity followed by enhancement (McFarland
et al., 2015). However in Mcfarland et al., the authors also found a small subset of stri-
ate neurons (677) that exhibited a reversed response of enhanced excitation followed by
inhibition. In Troncoso et al. (Troncoso et al., 2015), a subset of striate neurons (16145)
also failed to exhibit the suppressive response modulation observed for microsaccades in
the majority of neurons. Collectively, these studies point to more complex sensorimo-
tor interactions and encodings than the simple shunting or linear subtraction previously
imagined.
EC also fundamentally fails to explain phenomena associated with saccadic sup-
pression. Studies on saccadic suppression have shown that when saccadic visual targets
are artificially shifted during the saccade, under certain circumstances, the visual tar-
get can still appear stable while instability is perceived for surrounding objects (Deubel,
Bridgeman, and Schneider, 2004; Deubel, Koch, and Bridgeman, 2010). This is in con-
flict with theories of EC as EC does now allow for differences in perception of objects in
a single visual scene.
9
Investigations in retinal afterimage displacement during saccades present another
line of evidence against EC as the sole mediator of visual constancy. Grusser et al.
(Gru¨sser, Krizicˇ, and Weiss, 1987) had subjects perform auditory saccades in a dark room
from a bight light that was turned off at saccadic onset to a new fixation location cued by
an auditory tone. When subjects saccaded at rates below one saccadesecond, afterim-
ages were perceived to displace with an amplitude consistent with the saccadic ampli-
tude. In other words, the after image appeared to shift across the retinal plane and was
perceived at an egocentrically consistent retinal position. However above frequencies of
1.5 saccadesecond, the perception of after-image displacement decreased with saccadic
frequency. Failures in egocentric visual constancy at saccadic rates well within normal
behavior further suggest that EC alone is unable to provide the perceptually consistent
world we experience and that an additional mechanism dependent on the currently per-
ceived visual scene is necessary. The dependence of displacements in retinal coordinates
on saccadic frequency further suggests that efference alone is unable to provide an accu-
rate forward model to predict displacements in the visual scene caused by saccades.
The inaccuracy of saccadic control (reviewed in (Bridgeman, 2007)) suggests that
the brain does not have a sufficiently high-accuracy motor forward model to enable visual
positional constancy based solely on an EC of a motor signal and instead, other sup-
plementary mechanisms must at a minimum be used in conjunction with an EC-based
forward model. For example, when a stimulus at the target location of a saccade is shifted
perisaccadically, target displacement may not be detected up to a threshold of one-third
of the total saccadic displacement (Bridgeman, Hendry, and Stark, 1975; Deubel, Schnei-
der, and Bridgeman, 1996). If extra-visual forward modeling was the only mechanism by
10
which visual constancy is determined, then the visual system would have no idea where
targets in the visual world were located if positional errors of one-in-three could be toler-
ated. If constancy can be maintained in the face of such a large error, then the brain must
use some additional mechanism to determine constancy and localization.
Investigations into saccadic drifts emphasize the dependence of saccadic trajecto-
ries on visual sensory information. Saccadic trajectories were originally thought to repre-
sent the maximal acceleration and velocity profiles physiologically achievable by the eye
(Clark and Stark, 1975; Enderle and Wolfe, 1987) but subsequent experimental character-
ization of the structured error in saccadic end-points (e.g., the speed-accuracy trade-off)
(Abrams, Meyer, and Kornblum, 1989) established alternate models of saccadic trajecto-
ries including minimum variance (Harris and Wolpert, 1998; Harwood, Mezey, and Har-
ris, 1999), smoothness maximizing (Wiegner and Wierzbicka, 1992), and optimal control
models (Harris, 1998; Harris and Wolpert, 2006). While saccadic trajectories generally
exhibit a propensity to curve toward either the horizontal or vertical meridians, visual
context influences both the magnitude and direction of trajectory curvature (Laidlaw and
Kingstone, 2010). Depending on the stimuli present at the time of the saccade, saccadic
trajectories may curve toward or away from distracting stimuli and the relationship be-
tween trajectory deviation and saccade latency shows an increasing linear trend (Walker,
McSorley, and Haggard, 2006). The predictability of the target location has also been
found to be important in modulating the direction of distractor-related trajectory devi-
ations. If a target location is predictable then saccades reliably deviate away from the
distractor. If however, the location is unpredictable then trajectory deviation follows the
pattern dictated by the latency of the saccade (Walker, McSorley, and Haggard, 2006).
11
These interactions suggest a sensory contextual dependence of saccadic visual constancy.
1.2.4 Alternate Theories and Synthesis
Results from the experiments listed in the previous sub-sections have influenced
new theoretical models of sensorimotor interactions to explain phenomena traditionally
attributed solely to EC. For visual positional constancy, the landmark theory and active
sensing/referent control (RC) (Feldman, 2009) provide alternatives to classical EC and
are discussed below. I argue that a synthesis of a relaxed version of classical EC that
inherits attributes from RC and the landmark theory is better able to pose solutions to the
problem of visual constancy.
1.2.4.1 Referent Control (RC) and High-Level Interactions
In referent control and active sensing (Feldman, 2009), direct copies of motor com-
mands are not theorized to be used but rather new positions are set with referent command
signals (RCS) that indicate a desired position for the body in some generalized coordinate
system.
The active sensing theory parallels EC in that visual information is not needed
(Feldman, 2016). Additionally, the sensorimotor system would still need to pre-determine
how the world would appear post-saccade implying that both EC and active sensing ne-
cessitate highly accurate forward models to explain human behavioral data. While active
sensing allows for part of the model to be abstracted to other areas of the brain (Feldman,
2016), it still requires an extremely accurate forward model to predict resulting sensory
12
information following a referent shift. This implies that the brain contains a sufficiently
high accuracy model of eye movements and is able to carefully predict the outcomes of
saccadic eye movements and the resulting sensory perceptions and conflicts with findings
to the contrary (see Section 1.2.3 for a review).
Extra-visual RCSs may share features of both decisions and intentions (as they
are described in (Carruthers, 2015)). Decisions are momentary events that give rise to
intentions. Intentions then initiate corresponding chains of actions to satisfy the original
decision. Intentions exist throughout the entire process of acting.
There are arguments for describing the extra-visual RCSs as either decisions or
intentions: a decision in that information may not need to be retained throughout the
entire saccade but only encoded at its beginning, and an intention in that intentions are
presumably more specific as they are hierarchically closer and more similar to action and
thus may include more accurate information on the spatial and temporal characteristics of
the transformation from a pre-saccadic state to a post-saccadic state.
While decisions and intentions can be either for the here-and-now or for the future
(Carruthers, 2015), it is generally thought that the extra-visual information being embed-
ded through either ECs or RCs always represent what an agent is immediately about to
attempt. Things may not go as planned (e.g., internally, the noisy motor commands may
muck things up or externally, an object may move) but it’s certain that an attempt will at
least be made. In this case, the resulting behavior can be thought of as reflexive from the
time at which the extra-visual signal is generated. However, it remains unclear if contri-
butions from higher-level mechanisms (e.g., working memory or some type of high-level
efference copy (HLEC)-like mechanism containing higher-level cognitive or semantic in-
13
formation) are also included in the extra-visual signals that bias the early sensory system
around the time of self-generated action. For example, saccadic target selection is tied
to higher-level reasoning processes (e.g. attention, goal direction, etc.). While the spe-
cific mechanism addressed in the models is reflexive and of a character closer to that of
the here-and-now 1, this does not preclude the possibility of using HLEC extra-visual
information. For example, in Chapter 4, I experiment with using multiple extra-visual,
heterogeneous information sources in a single model and hope to extend the architectures
to use new sources of higher-level information that will provide the feed-forward network
with context.
1.2.4.2 Landmark Theory
The landmark theory (Deubel, Koch, and Bridgeman, 2010) proposes a solution to
visual constancy emphasizing the importance of sensory context. The theory suggests that
the brain uses visual landmarks to update a visual constancy map following a saccade. By
incorporating stimulus context dependence, the theory posits that visual objects present
pre-saccade and post-saccade can act as landmarks to localize objects across the visual
scene. The landmark theory suggests an alternative explanation for the sensory depen-
dence of visual constancy and posits that rather than using an efference copy, the visual
system determines constancy by matching a sparse gist of landmarks from the previous
fixation point to the new fixation point. Should the points in the first scene be inconsis-
1An implementation intention, or an intention that is moved immediately to the here-and-now when
some implementing condition has been met ((Carruthers, 2015), is perhaps closer in describing the extra-
visual information primarily used in this dissertation
14
tently projected to the second (presumably according to some threshold), then consistency
is not maintained. Here, the state representation is in the visual domain and thus theo-
retically able to provide the sufficiently complex state representation needed to explain
observed experimental results.
1.2.4.3 Mathematical Representation
However, the landmark theory fails to adequately include the feed-forward (Rajkai
et al., 2008; Zirnsak and Moore, 2014; Joiner, Cavanaugh, and Wurtz, 2013; Burr, Mor-
gan, and Morrone, 1999) and extra-visual (Bremmer et al., 2009; Duhamel, Colby, and
Goldberg, 1992; Ibbotson et al., 2008; Kagan, Gur, and Snodderly, 2008; Kayama et al.,
1979; Rajkai et al., 2008; Sommer and Wurtz, 2006) interactions observed experimen-
tally, as well as recent results suggesting visual space constancy maintenance through
optimal fusion of visual and extra-visual information (Ostendorf and Dolan, 2015).
This can be emphasized by formalizing a landmark model as:
C = argmin
θ
∑
i,j
∥∥pt+1j − F [pti; θ]∥∥2
Y =

True, if C ≤ φ
False, if C > φ
(1.3)
where pt+1j is a point in a retinal location post-saccade, p
t
i is a corresponding point in a
retinal location pre-saccade, F [] is again a function that transforms a point pt to a point
pt+1 according to parameters θ, and φ is some threshold under which constancy is main-
tained. The landmark theory fails to offer a satisfactory characterization of the relational
15
function F [] in Eq. 1.3 and fails to provide an explanation of the role of feed-forward
extra-visual information.
The landmark theory has two large consequences for visual constancy and odome-
try: (1) only sparse information from previous fixation points is needed transsaccadically;
and (2) motor-related information may not be needed for constancy.
First, very little visual information need be carried between saccades to effectively
re-localize objects in the visual scene and maintain egocentric and exocentric visual con-
stancy. A sparse set of landmarks, or key-points, should then be chosen that best allow
matching between two scenes. While estimates of the amount of transsaccadic informa-
tion needed to enable visual constancy remain elusive, the landmark theory assumes at
most a sparse gist from previous visual locations is needed. This would mean that fol-
lowing a saccade, the brain must match the gist of previous features to the current scene
and generate features for the current scene. Optimally, features should be selected in a
frame that will not only remain present in the second frame, but will also be maximally
discriminative in that second frame. If operating with limited memory, features should be
selectively chosen to maximize inter-frame matching.
Second, feed-forward, intentional information can constrain the scene matching
process to provide a better informed decision. The landmark theory discards feed-forward
intentional input and generally fails to explain how feed-forward information is used to
maintain constancy. This is contrary to the abundance of experimental evidence of extra-
visual and feed-forward effects on visual processing during saccades in the early visual
system (Burr, Morgan, and Morrone, 1999; Joiner, Cavanaugh, and Wurtz, 2013; Ra-
jkai et al., 2008; Zirnsak and Moore, 2014; Bremmer et al., 2009; Duhamel, Colby, and
16
Goldberg, 1992; Ibbotson et al., 2008; Kagan, Gur, and Snodderly, 2008; Sommer and
Wurtz, 2006; Sylvester and Rees, 2006). An unconstrained key-point match between two
images across a large temporal window and spatial extent is at least exponentially com-
plex (Brox, Malik, and Bregler, 2009). If all keypoints correspond to static features in
both scenes, then the matching problem can be greatly simplified by framing the 2D op-
tical flow problem as a 1D disparity estimation problem, thus turning a 2D exponentially
complex problem into a 1D polynomially complex problem Brox, Malik, and Bregler,
2009. Alternatively, by constraining matching to a particular region of the visual scene
computational requirements can also be reduced. Previous work has has applied extra-
visual feedback signals from IMUs or GPS (Maimone, Cheng, and Matthies, 2007) and
signals from an elementary motion model assuming constant velocities (Davison, 2003)
to constrain the matching process. Either considering only static features or constraining
the matching process to be consistent with a narrow range of transforms could lead to
increased performance relative to computational requirements and processing time.
To further explain the computational consequences of the landmark theory and be-
gin to introduce potential solutions, the next subsection will discuss computer and ma-
chine vision approaches to solving the similar problem of visual odometry (VO) (for a
more in depth review see Fraundorfer and Scaramuzza, 2012.
1.2.5 Visual Odometry
Visual odometry (VO) can be formalized by imagining a rigidly attached camera
moving through an environment and recording images at discrete time instants k. Then
17
the set of images taken by such a camera could be I0:n = {I0, ..., In}. Two sequential
camera images taken at time k − 1 and k are related by the transform Tk,k−1 ∈ R4×4
following the form:
Tk,k−1 =
Rk,k−1 tk,k−1
0 1
 (1.4)
where Tk,k−1 ∈ SO(3) is the rotation matrix and tk,k−1 ∈ R3×1 is the translation
vector. The main task of VO is to compute the camera transformations Tk,k−1 from images
Ik to Ik−1 and then concatenate each transformations to recover the most recent camera
pose from the set of camera trajectories Cn. A camera pose is determined by beginning
with a known camera pose C0, computing the next camera pose with C1 = T0,1C0, and
processing iteratively until the current camera pose Cn = Cn−1Tn,n−1 is found.
There are two general approaches to VO. Global methods use the entire image and
feature-based methods only use particular image keypoints or features. Feature-based
approaches are both faster and more accurate and will be the focus of this discussion.
Ego-motion can be estimated by matching points in a visual scene between two sequential
frames and optimizing for a velocity that would result in the points in one scene projecting
to their observed locations in the next scene. Approaches can be generalized as feature-
tracking and feature-matching. Feature-matching approaches independently find features
in all image frames and then match features between frames. Feature-tracking approaches
find features in a frame and then detect their locations in subsequent frames. Feature-
tracking approaches are best when each frame is spatially close to the previous frame and
feature-matching is best for large translations between frames.
18
A common feature-tracking approach is the Kanade-Lucas-Tomasi (KLT) tracker
(Tomasi and Kanade, 1991) (which is based on the more general Lucas-Kanade (LK)
method (Lucas, Kanade, et al., 1981)). Assuming brightness constancy and small motion
between frames, optical flow can be calculated by solving for the velocities in the x and
y directions (u and v respectively) given the spatial change between a point in the image
and the temporal derivative, using a truncated Taylor series expansion:
0 = It(pi) +∇I(pi) · [uv] (1.5)
This equation is under-determined as there are two unknowns. By including the
additional constraint that motion in a neighborhood of pixels should be approximately the
same, an overdetermined system of equations can be found:
Ad = b (1.6)
where
A =

Ix(p1) Iy(p1)
Ix(p2) Iy(p2)
...
...
Ix(pn) Iy(pn)

, d =
u
v
 , b = −

It(p1)
It(p2)
...
It(pn)

(1.7)
A least squared minimization of d in the form of ATAd = AT b will thus provide a
reasonable estimate of u and v, or the transform that best explains movement of features
from the first to the second image.
However the solution to the above system of equations will only approach optimal-
19
ity in the presence of an otherwise static scene. In the presence of independent motion,
optimization will be biased by the inconsistent transforms of independently moving fea-
ture points compared to the transforms of static points in the scene whose motion is wholly
influenced by ego-motion. This is partially because the least squares approach of the LK-
based solutions (Lucas, Kanade, et al., 1981) allow only for zero-mean Gaussian noise
in the underlying image data. Structured noise, such as errors from inconsistent motion
profiles, violate this condition. Additionally, independently moving points in the scene
move inconsistently with their statically moving neighbors, adding additional error to the
least-squares fit. Outlier detection can help mitigate the effects from independently mov-
ing objects. However the problem of outlier removal is equivalent to the NP-complete
maximum clique problem and thus can not be optimally solved. Approximate solutions
are computationally expensive and can still fail with sufficiently large amounts of inde-
pendent motion.
Feature-mapping approaches similarly fall prey to independent motion. Solutions
that best explain a transform from points in one scene to points in another scene will be
biased by independent motion and cause any estimate of ego-motion motion to be biased
by inconsistent movement profiles stemming from mixed static and dynamic objects in
the visual scene. Furthermore, an accurate matching method need ensure that the same
features are present pre- and post- saccade with a high probability. A moving object that
may be out of the field of view post-saccade may then be a poor choice as a key-point. If
using a sparse transsaccadic memory to transfer landmarks between saccades and aid in
localization, then the presence of even a small number of landmarks that correspond to
dynamic objects could greatly bias ego-motion estimates.
20
1.2.6 Feed-Forward Feature Selection
The early visual system could benefit from information on which areas of the visual
scene correspond to static features and which to dynamic features. Problems of differen-
tiating between areas of the image plane composed of static versus independent features
and choosing features to compare in the future can both be solved with predictive forward
modeling in a fused sensorimotor space based on vision and extra-visual inputs. Indeed
evidence from the early visual systems suggests a feed-forward solution to this problem.
While a data-driven approach is possible, we see changes in the early visual system before
the onset of saccadic eye movements (Duhamel, Colby, and Goldberg, 1992; McFarland
et al., 2015) suggesting a forward approach where extra-retinal signals (potentially from
the superior colliculus (SC)) modulate neural activity in accordance with expected future
stimuli. Similarly, results from V4 and IT in primates show a shift in receptive field lo-
cation pre-saccade (Zirnsak and Moore, 2014; Connor, 2001; Ni, Murray, and Horwitz,
2014). Computationally, forward sensorimotor approaches could allow the early visual
system to selectively determine features that will best enable global re-localization fol-
lowing a particular action.
While perceptually humans report experiencing saccadic suppression, this does not
preclude the possibility that perisaccadic visual information is available to the visual sys-
tem. Indeed recent work in oculomotor adaptation has shown the dependence of sac-
cadic adaptation on visual information briefly presented perisaccadically, demonstrating
that not only is information available to the visual system, but that it is actively used
(Panouille`res et al., 2016).
21
1.2.7 Neural Networks
An important aspect of learning in a convolutional neural network (CNN) can be
thought of as a sub-sampling or noise-reduction problem (Bradley, 2010). Modules in a
network can remove signal components that are unrelated to the current task and enhance
task-relevant components. One way this can be accomplished is by filtering a signal to
reduce information along certain dimensions while preserving, or enhancing, information
along others. In other words, by devoting additional bandwidth to certain dimensions of a
signal and reducing bandwidth for other dimensions a signal can be compressed by selec-
tively preserving relevant information. In contrast with most uses of CNNs, the predictive
task of this sensorimotor domain requires the network not just to learn a condensed rep-
resentation of the sensory input space, but instead to selectively learn a compressed space
that best allows the application of an efferent-related transform to a target location.
Complex learning machines can be built by stacking filtering modules on top of
one another to form a network where each layer transforms an input into a representation
at a higher and more abstract level by enhancing relevant information and attenuating
irrelevant information. However applying linear transforms from stacked linear filters is
equivalent to a transform by a single linear filter and a linear classifier can only carve
its input space into very simple regions (namely half-spaces separated by a hyperplane
(LeCun, Bengio, and Hinton, 2015)). As most real-world problems are non-linear and
inadequately approximated with purely linear functions, including a non-linearity after
each linear filer has become a common method to enhance a network’s discriminative
abilities and make its response simultaneously sensitive to minute details and insensitive
22
to large irrelevant variations such as background noise.
A CNN is trained using stochastic gradient descent (SGD) and the back-propagation
algorithm where the loss E must be computed and then the error gradient must also be
computed at each layer i with respect to that layer’s weight parameters Wi. Weights are
updated according to
Vit+1 = µVit − α∇E(Wit) (1.8)
Wit+1 = Wit + Vit+1 (1.9)
where Vit+1 is the update value for the weight Wit+1 and µ and α are hyperparameters
representing the learning rate and momentum, respectively (Bottou, 2012).
Computing E of the final layer is relatively straightforward as we know the true
label of the target with which we can easily compute the error between the output layer’s
prediction and the true label. To illustrate this, if a network of stacked filters includes
layers, then the final layer of the network (the output layer) produces output activations:
Xn = Fn(Xn−1,Wn) (1.10)
where Xn−1 is the input to layer n, Wn are the filter weights, and Fn is the function that
filters Xn−1 with Wn. For a dense output, then the loss E of the output layer can be
computed by taking the L2 error between the network output and ground truth training
label:
23
E = ‖L−Xn‖2 (1.11)
Then the remaining layers of the network can similarly be defined as
Xi = Fi(Xi−1,Wi),∀i ∈ [1, n] (1.12)
At lower levels (i.e. every layer except the final output layer), we have no explicit
training signal or label with which we can calculate intermediate errors. However, we do
have the error from layer n and thus can compute δE
δXi
, i = n−1. Applying the chain rule,
we can then compute δE
δWi
with:
δE
δWi
=
δE
δXi
δFi(Xi− 1,Wi)
δWi
(1.13)
By again applying the chain rule we can compute δE
δXi−1
as:
δE
δXi−1
=
δE
δXi
δFi(Xi− 1,Wi)
δXi−1
(1.14)
We then have recurrence equations that allow computation of the gradient of the
loss function with respect to each layer’s weight parameters.
SGD relies on the ability to estimate the gradient of each model parameter and
model input with respect to overall model error. However this not only requires that
each function implemented in the network is differentiable and at-least nearly smooth,
approaching optimality in a learning task requires that these transforms also preserve suf-
ficient information about their inputs. At the very least, the network needs a mechanism
to estimate the true generating input so as to better estimate the non-linear transformation
24
function in order to optimize parameters with respect to that function. This can also be
thought of in terms of identifiability where the range of a function’s identifiability can be
considered a correlate of the function’s ability to reconstruct its input.
This partially explains the successes of the rectified linear non-linearity unit (ReLU)
in deep networks trained with SGD where its inclusion has led to increased classification
performance (He et al., 2014; He et al., 2015; Maas, Hannun, and Ng, 2013; Sermanet
et al., 2013; Simonyan and Zisserman, 2014).
The ReL function is generally implemented as:
Y = max(0,WX) (1.15)
where W could be a scaler or vector of weights and X is some input. Taking δY/δX
yields
δY/δX =

W, if WX > 0
0, if WX ≤ 0
(1.16)
where the derivative is W when WX is greater than zero and zero when is less than or
equal to zero. Thus when the output is greater than zero, the ReLU non-linearity preserves
substantial information about its input from it’s gradient. Put another way, the linear
operation with a subsequent ReL is identifiable when leading to enhanced differentiability.
Additionally, with a random network initialization, only 50% of units in the network are
active at a given time (Glorot, Bordes, and Bengio, 2011).
25
1.3 Chapter Summaries
1.3.1 Chapter 2 - DeepEfference: Learning to Predict the Sensory Con-
sequences of Action Through Deep Correspondence
Chapter 2, titled DeepEfference: Learning to Predict the Sensory Consequences
of Action Through Deep Correspondence, introduces the DeepEfference architecture.
DeepEfference is a bio-inspired, unsupervised, deep sensorimotor network that learns
to predict the sensory consequences of self-generated actions. DeepEfference computes
dense image correspondences [1] at over 500 Hz and uses only a single monocular grayscale
image and a low-dimensional extra-modal motion estimate as data inputs. Designed for
robotic applications, DeepEfference employs multi-level fusion via two parallel pathways
to learn dense, pixel-level predictions and correspondences between source and target im-
ages. Quantitative and qualitative results from the SceneNet RGBD (McCormac et al.,
2016) and KITTI Odometry (Geiger et al., 2013) datasets are presented and demonstrate
an approximate runtime decrease of over 20,000% with only a 12% increase in mean pixel
matching error compared to DeepMatching (Revaud et al., 2016) on KITTI Odometry.
26
1.3.2 Chapter 3 - A Deep Neural Network Approach to Fusing Vision
and Heteroscedastic Motion Estimates for Low-SWaP Robotic Ap-
plications
Chapter 3, titled A Deep Neural Network Approach to Fusing Vision and Het-
eroscedastic Motion Estimates for Low-SWaP Robotic Applications, presents Multi-
Hypothesis DeepEfference (MHDE) which is a multi-hypothesis extension of the DeepEf-
ference architecture that learns to intelligently combine noisy heterogeneous sensor data
to predict several probable hypotheses for the dense, pixel-level correspondence between
a source image and an unseen target image. MHDE is augmented to handle dynamic, het-
eroscedastic sensor and motion noise and compute hypothesis image mappings and pre-
dictions at 150-400 Hz depending on the number of hypotheses being generated. MHDE
fuses noisy, heterogeneous sensory inputs using two parallel architectural pathways and
n (1, 2, 4, or 8 in this work) multi-hypothesis generation subpathways to generate n pixel-
level predictions and correspondences between source and target images. I evaluated
MHDE on the KITTI Odometry dataset (Geiger et al., 2013) and benchmarked it against
DeepEfference (Shamwell, Nothwang, and Perlis, 2017) and DeepMatching (Revaud et
al., 2016) by root mean squared (RMSE) pixel error and runtime. MHDE with 8 hypothe-
ses outperformed DeepEfference in root mean squared (RMSE) pixel error by 103% in
the maximum heteroscedastic noise condition and by 18% in the noise-free condition.
MHDE with 8 hypotheses was over 5,000% faster than DeepMatching with only a 3%
increase in RMSE.
27
1.3.3 Chapter 4 - An Embodied Deep Neural Network Approach to Dense
Visual Correspondence for Low-SWaP Robotics
Chapter 4, titled An Embodied Deep Neural Network Approach to Dense Visual
Correspondence for Low-SWaP Robotics, introduces Inertial DeepEfference (IDE),
which builds upon the DeepEfference and Multi-Hypothesis DeepEfference architectures
by using raw sensor information. IDE is unique in that it uses real sensor data for an end-
to-end trainable dense correspondence network that is orders of magnitude faster than
other SOA deep approaches. We evaluated IDE on the EuRoC MAV dataset (Burri et
al., 2016) and on the KITTI Odometry dataset (Geiger et al., 2013) and benchmarked it
against the deformable spatial pyramid (DSP) (Kim et al., 2013) and DeepMatching (Re-
vaud et al., 2016) dense correspondence approaches. The IMU+surrogate feed-forward
motor signal (FFMS) IDE network was 167x faster than DM and 516x faster than DSP.
28
Chapter 2: DeepEfference: Learning to Predict the Sensory Consequences
of Action Through Deep Correspondence
2.1 Abstract
As the human eyeball saccades across the visual scene, humans maintain egocentric
visual positional constancy despite retinal motion identical to an egocentric shift of the
scene. Characterizing the underlying biological computations enabling visual constancy
can inform methods of robotic localization by serving as a model for intelligently inte-
grating complimentary, heterogeneous information. Here we present DeepEfference, a
bio-inspired, unsupervised, deep sensorimotor network that learns to predict the sensory
consequences of self-generated actions. DeepEfference computes dense image correspon-
dences (Scharstein and Szeliski, 2002) at over 500 Hz and uses only a single monocular
grayscale image and a low-dimensional extra-modal motion estimate as data inputs. De-
signed for robotic applications, DeepEfference employs multi-level fusion via two parallel
pathways to learn dense, pixel-level predictions and correspondences between source and
target images. We present quantitative and qualitative results from the SceneNet RGBD
(McCormac et al., 2016) and KITTI Odometry (Geiger et al., 2013) datasets and demon-
strate an approximate runtime decrease of over 20,000% with only a 12% increase in
29
mean pixel matching error compared to DeepMatching (Revaud et al., 2016) on KITTI
Odometry.
2.2 Introduction
(A) (B) (C)
(D) (E)
Figure 2.1: Sample results from KITTI Odometry. A: Sample source image. B: Sample
target image. C: DeepEfference output reconstruction of the source image in A using
pixel intensity values sampled from the target image in B. D and E: Source and target
images with marked correspondence points computed by DeepEfference.
For an autonomous agent (be it robotic or organic), understanding how self-produced
30
actions affect the environment is critically important to survival and successful operation
in the real-world. Similarly important is an agent’s understanding of how its actions affect
its sensory perceptions. In the case of visual positional constancy, this corresponds to an
agent’s ability to separate motion across the retinal plane induced by self-motion (e.g.,
from a saccade) from motion induced externally (e.g., from a charging predator).
A comparable understanding of the perceptual consequences of self-induced actions
could be used by autonomous robots to measure action-based and perceptual anomalies
(among others). Take for example the act of turning to the left. In the case of the former,
this action should result in an object within the visual field-of-view (e.g., a soda can)
shifting to the right on the imaging plane by a commensurate amount. If this shift does
not occur, it could mean that the action was not properly performed and that there could
be a problem with the system’s actuators. For the latter, the same expectation violation
might mean that the soda can moved independently (e.g., was blown by the wind or
kicked by a passerby) and subsequently, would be a poor choice as a landmark for visual
dead-reckoning1.
We argue that biological mechanisms supporting visual constancy contain a rich,
egocentric representation of the environment and with appropriate models and compu-
tational architectures, these representations can be extracted to enable enhanced robotic
1While the immediate focus of this paper is on sensorimotor modeling and prediction in the visual
domain, this work is situated within a broader class of issues including that of how an agent can respond
appropriately to anomalies in a complex world (see (Kohashi and Oda, 2008; Korn and Faber, 2005) for
Mauthner Cell anomaly detectors in teleost fish; (Brody, Perlis, and Shamwell, 2015; Brody et al., 2016)
for EC in the auditory domain for human-robot interaction; (Kumar et al., 2015; Irani and Anandan, 1998;
Nelson, 1991) for independent motion detection).
31
visual navigation and localization (e.g., dead-reckoning). Humans maintain perceptual
stability and visual constancy despite the 3-4 saccades the human eyeball undergoes per
second. While the shift in the projection of the visual world on the retina elicited by a
saccade is identical to the shift that alternatively would be elicited by a quick, external
shift of the visual world, humans are able to perceptually distinguish between the two
conditions and perceive a stable world in the first, and a moving world in the second.
The apparent conundrums of human visual positional constancy can be resolved
when considering humans as complex, embodied agents with access to information from
multiple overlapping sensory modalities including vision, audition, proprioception, ‘thought
perception’(Bhargava et al., 2012; Shamwell et al., 2012), and intentional/motor informa-
tion (Holst and Mittelstaedt, 1950; Sperry, 1950; Brody, Perlis, and Shamwell, 2015).
For example, Efference Copy (EC) (Holst and Mittelstaedt, 1950) and the closely related
Corollary Discharge (CD) (Sperry, 1950) neural theories have long been implicated in
the brain’s ability to maintain visual positional constancy trans-saccadically. EC and CD
posit that early sensory centers access extra-modal information about intended actions to
influence subsequent sensory processing by priming early sensory centers with a prior on
expected incoming sensory signals.
We drew inspiration from the theories of EC and CD in developing a computational
solution to robotic visual localization. Similar to how EC can be used by biological
systems to estimate expected sensory information, robotic systems often have access to
information with which they can glean an estimate of ego-motion from a separate, non-
visual modality. If intelligently integrated, this extra-visual estimate can serve as a prior
on expected post-movement visual perceptions and improve visual motion estimates.
32
Paralleling the biological theory of EC where visual processing centers receive
motion/intention-related information to aid in sensory processing, we have designed Deep-
Efference as an unsupervised, feed-forward, heterogeneous, deep network that computes
dense correspondence (Scharstein and Szeliski, 2002) and performs next-frame prediction
at over 500 Hz.
Critical to achieving this update rate, DeepEfference uses monocular images and
only processes the source image from each pair. The network learns (x,y) pixel locations
of where to sample in the target image to best reconstruct the source image. This translates
to learning which pixels in the source image best correspond to the target image, and thus,
a correspondence mapping between source and target images.
The remainder of the paper is organized as follows: Section 2.3 describes the moti-
vations for this work; Section 2.4 outlines the DeepEfference network architecture; Sec-
tion 2.5 describes the datasets and experiments used for validation; Section 2.6 discusses
results from the validation experiments; and Section 2.7 offers concluding thoughts and
directions for future work.
2.3 Background
2.3.1 Deep Approaches to Spatial Transformation Encoding and Learn-
ing
Learning spatial transformations and relationships between successive images has
been a topic of great interest both in computer vision and robotics and deep, bio-inspired,
solutions have already begun to show promise for the correspondence problem (see (Scharstein
33
and Szeliski, 2002) for a review of the correspondence problem). In computer vision,
multiplicative interactions have been used to great success for relationship learning be-
tween images (Memisevic and Hinton, 2007; Ranzato and Hinton, 2010; Memisevic and
Hinton, 2010; Hinton, Krizhevsky, and Wang, 2011; Kivinen and Williams, 2011; Memi-
sevic, 2013). However, both the initial and transformed image are required as inputs and
there is no readily-available means to provide the model extra information from another
modality as a motion prior. Both points (but in particular the latter) have implications for
the correspondence problem and image relationships for robotics.
These and other deep approaches (Revaud et al., 2015; Revaud et al., 2016) to
spatial transformation encoding rely on siamese-like networks where both source and
target images are available and computed on. If we want to deploy deep approaches on
SWaP-constrained systems, networks require significant size reductions.
2.3.2 Extra-Visual Motion Estimates
For any two visual measurements taken successively, robots often have an indepen-
dent measurement of self-motion between those two images. These measurements could
come from IMUs, GPS, LIDAR, ultrasonic ranging sensor, or input motor commands.
When estimating motion based on the movement of feature points on the visual imaging
plane, additional non-visual motion estimates could be used as a prior for estimating cam-
era motion. Similar in spirit to this work is (Ciliberto et al., 2012) where heteroscedastic
models were learned for independent motion detection for an actuated camera. However,
camera motions were limited to pure rotations, which are not affected by varying depths
34
within a scene. Additionally, (Ciliberto et al., 2012) was constrained to use Gaussian
process models and was not end-to-end trainable.
2.4 Approach
Deconvolutional decoderConvolutional encoder
Source image
Local 
Pixel 
Shifts
Reconstructed 
source image
Bilinear 
Sampler
Transform 
input
Localization 
pathway
2D Affine 
matrix
Spatial 
transformer
Global 
pixel 
shifts
Target image 
for sampling
Figure 2.2: DeepEfference network diagram showing the linked global and local learners
DeepEfference is an unsupervised, deep heterogeneous neural network that learns
to predict how source images correspond to unseen (i.e., unprocessed) target images.
Rather than learning how to transform each pixel (e.g., via a fully-connected layer), we
employ a trainable 2D spatial transformer to impose a global estimate of image motion.
Inspired by the Landmark Theory of visual positional constancy (Bridgeman, Heijden,
and Velichkovsky, 1994), DeepEfference carries only a sparse gist of the previous visual
scene forward and instead uses the currently perceived image from which to sample.
As shown in Fig. 2.2, DeepEfference has two interconnected pathways: one for de-
35
termining the global 2x3 affine 2D transformation matrix, and a second encoder-decoder
pathway that predicts local, pixel-level shifts to be applied to the affine-transformed im-
age. The network does not generate images from scratch, but rather learns how to sample
from a target image to recreate the initial image. Given a source image and an estimated
transform, DeepEfference learns coordinates (x, y) at which to sample in a target image
to reconstruct the source image. The result is a correspondence map between pixel loca-
tions in the source image and pixels in the target image (see Fig. 2.1 for example learned
correspondences).
2.4.1 Training and Loss Rule
DeepEfference is trained to minimize reconstruction errors between a given source
image and a reconstruction of that source image generated by selectively sampling from
a target image. We compute Euclidean error and use it to train the network via backprop-
agation. DeepEfference is trained by minimizing the following loss function:
L(θ, It, Is) = argmin
θ
‖Ir(θ, It)− Is‖2 (2.1)
where Ir is an image reconstruction, It is the image target, and Is is the image source
being reconstructed.
2.4.2 Pathway 1: Local, Pixel-Level Shifter
DeepEfference’s first pathway provides localized, object-level shift information.
We implemented this pathway as a convolutional-deconvolutional encoder-decoder. The
36
encoder compresses the source image through a series of convolutional filtering opera-
tions and the decoder generates magnitudes of pixel shifts by expanding the compressed
convolutional outputs using deconvolutions2. We used five convolutional layers followed
by five deconvolutional layers. All convolutional and deconvolutional layers used filters
of size 3, pad of 1, and stride of 2. The first convolutional layer outputted 32 feature maps
and the number of output maps doubled for each subsequent convolutional layer with the
fifth and final convolutional layer outputting 512 feature maps. The output sizes of the
generative deconvolutional layers were arranged oppositely with the first layer outputting
512 maps and the final layer outputting 32 maps.
The local, pixel-level shifts of DeepEfference are similar to mappings learned by
the recent view synthesis method (Zhou et al., 2016) that has been used to render new,
unseen views of objects and scenes. Besides different network structures and inclusion
of a global spatial transformer module, the largest difference between their method and
our own is that rather than learning to generate novel viewpoints of objects or scenes, we
learn how to reconstruct a source image using pixel locations in a target image.
2.4.3 Pathway 2: Global Spatial Transformer
While the output of the first encoder/decoder pathway provides estimates of lo-
calized, pixel-level movement to account for depth, non-rigidity, etc., the second spa-
tial transformer (ST) pathway provides an estimate of the global transformation between
source and target images.
2We use the term ’deconvolution’ as is common in the deep learning literature but the operation we use
is more properly referred to as a transposed convolution
37
ST modules (Jaderberg et al., 2015) enable parametrized geometric transformations
to be applied to inputs or intermediate feature-maps in deep networks. While the param-
eters for the geometric transformation can either be learned or provided to the network as
an input, we provided the network with an estimate of the true 3D transformation between
source and target images (δx, δy, δz, δα, δβ, δγ) and used four fully-connected layers
(each followed by a rectified linear unit (ReLU) (Glorot, Bordes, and Bengio, 2011)) to
approximate the true linear 3D warp matrix as a 2D affine transformation.
However, the failure of a 2D ST-only approach is seen with translational camera
movements in scenes with varying depth. Following a camera translation, the new loca-
tion of an object in the image frame will depend on its distance from the camera: objects
that are closer to the camera exhibit greater displacements on the imaging plane compared
to objects further from the camera. A purely 2D affine transformation in the absence of
depth cannot accurately warp an image with varied scene depths and thus the localization
pathway can at best learn parameters that correspond to a dominant plane of a fixed depth.
As we show, ST modules can be used to efficiently embed action or motor in-
formation in a standard deep network that may then be trained end-to-end with back-
propagation. These motor-related signals can be derived from high-level actions, referent
signals to PID controllers, GPS, or IMU measurements.
We implemented an ST module in Caffe (Jia et al., 2014) using Nvidia CUDA Deep
Neural Network library (cuDNN) primitives. We created two layers, one to perform the
affine transformation and output target coordinates and a second layer that performs the
bilinear sampling given coordinates and an image. Three fully connected layers take the
input estimate 3D camera pose transformation and generate a 2x3 2D affine transforma-
38
tion matrix for the spatial transformer module. Although the sampling component of our
ST module takes an input image as input, no learnable parameters are based on image
content and thus our global pathway is a function only of the input transformation esti-
mate and is image content-independent.
2.5 Experiments
We primarily experimented with two different network architectures. The first ar-
chitecture, LightEfference, only used the first global pathway. The second architecture,
DeepEfference, implemented both the global pathway and the local pathway. LightEf-
ference and DeepEfference were evaluated on the SceneNet RGB-D (McCormac et al.,
2016) and KITTI Odometry (Geiger et al., 2013) datasets and compared against corre-
spondence matching results from the SOA DeepMatching approach (Revaud et al., 2016).
We also experimented with a third architecture that only used the local pathway.
However, networks trained with this architecture failed to converge or decrease network
loss (see Section 2.7 for additional discussion on this point).
SceneNet RGB-D (McCormac et al., 2016) is a dataset of 5 million photo-realistically
rendered images from a dynamically moving camera in a total of 15 different scenes. Im-
ages in SceneNet are rendered at 1 Hz and groundtruth camera pose and depth are pro-
vided at each camera exposure. All objects in the visual scenes are rigid, thus fulfilling the
static scene assumption and allowing for ground truth to be computed from scene depth
and camera position (described in Section 2.5.1).
The KITTI Visual Odometry dataset (Geiger et al., 2013) is a benchmark dataset
39
for the evaluation of visual odometry and LIDAR-based navigation algorithms. Images
in KITTI were captured at 10 Hz from a Volkswagen Passat B6 as it traversed city, resi-
dential, road, and campus environments. Groundtruth poses at each camera exposure are
provided by an RTK GPS solution and depth is provided with coincident data from a Velo-
dyne laser scanner. Groundtruth pixel projections were calculated just as for SceneNet.
2.5.1 Experimental Methods
For SceneNet and KITTI, data was separated into train (80%) and test (20%) sets.
For SceneNet RGB-D, we used a total of 44, 850 image pairs with 80% (35, 880) for
training and 20% (8, 970) for testing. For KITTI, we used a total of 23, 190 image pairs
with 80% (18, 552) for training and 20% (4, 638) for testing. In all experiments, we
randomly selected an image for the source, used the successive image for the target, and
subtracted the two 6-DOF camera poses for the transform input. For each image in each
dataset, we cropped the middle 224x224 pixel region for network inputs.
Predicted pixel correspondences between source and target images were evalu-
ated against groundtruth correspondence and SOA DeepMatching correspondence pre-
dictions. With access to scene depth and true camera pose for both KITTI and SceneNet,
groundtruth pixel shifts were calculated by applying a 3D warp to 3D pixel locations in the
source images to generate the expected pixel locations in the target images. We projected
each 3D point in the frame of camerat0 to the world frame using the derived projection
matrix for camerat0 and then reprojected these points in the world frame to camerat1
using the inverse projection matrix for camerat1. Finally, we transformed points in the
40
frame of camerat1 to the image plane. This resulted in a correspondence map between
pixel locations in camerat0 and camerat1 for each point where depth was available (e.g.,
where ray tracing did not go to infinity in the case of SceneNet or depth was outside of
the Velodyne laser scanner’s range for KITTI).
We evaluated DeepEfference and LightEfference using keypoints generated us-
ing the feature points detected by DeepMatching and from the accelerated segment test
(FAST) feature detector (Rosten, Porter, and Drummond, 2010). For each type of feature,
we measured how the keypoints detected in the source images were projected into the
target images. The projection errors were compared to groundtruth projections and were
used to determine mean pixel errors for each method.
We trained DeepEfference and LightEfference for 500, 000 iterations on KITTI
Odometry scenes 1 − 11 and a subset of SceneNet RGBD (10 randomly selected tra-
jectories of 300 image pairs for each of the 15 different scene types). We used the Adam
solver with batch size=32, momentum=0.9, momentum=0.99, gamma=0.5, and a step
learning rate policy of 100, 000 for all experiments. We used a Euclidean loss rule to train
all networks. All experiments were performed with a Nvidia Titan X GPU and the Caffe
deep learning framework (Jia et al., 2014).
2.6 Results
Tab. 2.6 details the comparative runtimes between DeepMatching, LightEfference,
and DeepEfference3. Fig. 2.3 and Tab. 2.2 detail the predictive error for LightEffer-
3We used a CPU version of DeepMatching for these comparisons. The latest available version of the
GPU implementation of DeepMatching took over 7 seconds per image to run on our workstation on a
41
Figure 2.3: Pixel error boxplots for DM, LE, and DE using DM and FAST keypoints.
Y-axis is actual mean pixel error. Middle lines are the medians and whiskers indicate 1.5
interquartile of the lower and upper quartiles. Note outliers are not shown for clarity and
instead minimum and maximums are presented in the table below.
ence, DeepEfference, and DeepMatching on the KITTI Odometry and SceneNet RGBD
datasets. Using keypoints generated by DeepMatching on KITTI, DeepEfference shows
a 1,100% performance increase in mean pixel error over LightEfference (significant with
t(2.94e5) = 60.01, p < 1e−5)4 while DeepMatching showed a 12% increase over Deep-
Efference (significant with t(5.33e5) = 31.33, p < 1e−5). When using FAST keypoints,
DeepEfference outperformed LightEfference by 1,200% on KITTI odometry (significant
with t(6.18e5) = 87.87, p < 1e−5). However, in runtime performance, LightEfference
was 447% faster than DeepEfference, and DeepEfference was over 23,000% faster than
DeepMatching.
256x256 image so we elected to use the faster CPU version for all experiments
4The distributions of pixel errors appeared non-uniform but as we had greater than 200,000 samples to
test in each condition, we elected to include t-test analysis. We used Welch’s t-test where the degrees of
freedom are approximated by Satterthwaite’s method.
42
Table 2.1: Average runtimes for DM, LE, and DE
Mean Standard Deviation Median Frames Per Second
DM 0.35225 sec. 0.0094525 sec. 0.351561 2.8
LE 0.000332 sec. 2.95788e-05 sec. 0.000318 3012
DE 0.0014864 sec. 2.03606e-05 sec. 0.00148484 672
The performance gap between LightEfference and DeepEfference narrowed on the
SceneNet dataset. DeepEfference outperformed LightEfference by 11% (significant with
t(2.78e6) = 58.65, p < 1e−5) and was outperformed by DeepMatching by 240% (sig-
nificant with t(2.70e6) = 382.56, p < 1e−5) using DeepMatching generated keypoints.
With FAST keypoints, DeepEfference performed only 13% better than LightEfference
(significant with t(4.76e6) = 68.64, p < 1e−5).
The performance differences between DeepEfference and DeepMatching may be
influenced by the differences in movement statistics between the SceneNet and KITTI
datasets. For SceneNet, rotational speeds5 (in deg/s) were typically much larger (mean=5.14,
std=9.82, median=2.843, min=0.039, max=177.84) while translational speeds (in m/s)
were smaller (mean=0.16, std=0.091, median=0.14, min=0.003, max=0.66). Motions in
KITTI followed reversed distributions where rotational speeds (in deg/s) were typically
small (mean=0.997, std=6.65, median=0.086, min=0.0001, max=85.72) while transla-
tional speeds (in m/s) were much larger (mean=9.22, std=4.20, median=9.04, min=0.005,
5SceneNet only contains renders of 1 in every 25 frames so these quantities are based on the differences
in positions between successively rendered frames
43
max=26.41). The larger variety of movements in SceneNet may have proved too difficult
for the current version of DeepEfference to learn a consistent motion model. Future work
will include experiments on deeper versions of DeepEfference.
The similar performance of LightEfference and DeepEfference on SceneNet may
have been influenced by the depth of objects in the dataset. Image scenes in KITTI had
both larger and more varied depths (mean=12.97, std=10.005, median=9.46, min=5.00,
max=79.99) compared to SceneNet (mean=3.92, std=2.45, median=3.40, min=0.00, max=19.99).
This might explain why LightEfference’s average pixel prediction error was within 11-
13% of DeepEfference as errors from translations of objects at different depths would
have been smaller. For example, the green boxes in Fig. 2.4 highlight an unusual instance
in SceneNet where LightEfference was unable to rectify the depth differences between
the foreground features and background features.
The poorer performance of DeepEfference and LightEfference on SceneNet may
also be due to shifts between successive images resulting in objects in the source image no
longer being present in the target image. In SceneNet, transformations between successive
camera frames often resulted in occlusions of large areas of the field of view which may
have led DeepEfference and LightEfference to incorrectly sample from the target images.
We were surprised by the large performance difference between DeepEfference
and LightEfference on the KITTI dataset. As LightEfference is comprised completely
of fully-connected layers and neither network uses dropout, one possibility is that Light-
Efference overfit to the training dataset. This is unlikely as DeepEfference’s training
Euclidean loss on KITTI was ≈50 at 500, 000 training iterations while LightEfference’s
loss was ≈5x larger at ≈250. This suggests that LightEfference was also performing
44
poorly on the training data and thus most likely not overfitting.
A second possibility is that the large range of depths in scenes in KITTI prevented
the limited, 2D-only transformations of LightEfference from learning a single coherent
transformation model. This possisibility is supported by the large number of outliers
produced by LightEfference which suggests that LightEfference was unable to success-
fully process the full range of input transforms. For DeepEfference, the mean pixel er-
ror percentile scores at 5, 10, 25 and 99 were percentile(5) = 0.45, percentile(25) =
1.07, percentile(50) = 1.74, and percentile(99) = 11.69 while for LightEfference,
they were percentile(5) = 0.51, percentile(25) = 1.22, percentile(50) = 2.21, and
percentile(99) = 1109.48. While LightEfference has higher error scores across per-
centiles, it is the percentile score at 99 that demonstrate its large number of high error
outliers which cause its mean pixel error to be an order of magnitude greater than Deep-
Matching and DeepEfference.
2.7 Discussion
We have shown that providing a network with heterogeneous inputs and combining
a parametrized global transformation pathway with a pixel-level, local pathway allows
for far more computationally efficient predictions with minimal degradation in predictive
performance.
Agents must understand which elements in the environment their actions do and not
not have the power to affect. A potentially powerful future use for DeepEfference lies in
teaching systems what they can and cannot interact with. While deep learning approaches
45
have traditionally been limited in their applications due to their need for large, annotated
training sets, DeepEfference’s ability to learn without supervision can allow for robots to
learn meaningful sensorimotor relationships via bootstrapping simply by operating in an
environment.
While the aim of DeepEfference is to generate correspondences between pixels in
the source image and pixels in the target image, an unintended side-effect of the predictive
training is the generation of image areas where there is no actual overlap between source
and target images. Several of these cases are shown in Fig. 2.4. In first row of Fig. 2.4,
DeepEfference learned to sample from different areas in the target image to imagine what
the front of the van looked like despite it not being present in the target image. The same
can be seen in the second row where DeepEfference imagines what the left-side of the
house looked like.
Performance of the dual-pathway DeepEfference architecture surpassed the global-
pathway-only LightEfference architecture in all experimental conditions. As mentioned
briefly, we also attempted to use a local-pathway-only architecture but networks trained
with this local-only architecture failed to converge. Additional work is needed to deter-
mine why these networks failed to learn but we suspect that it is due to the fully-connected
layers attempting to learn a complex pixel-level transform beyond their capacity.
Currently, the actual displacement between the camera at the time of each frame are
used as transform inputs to DeepEfference. One possible source for this information in
real-world robotic applications is from IMUs. However, constant velocity motions may
prove difficult for DeepEfference if the expected transforms are being generated from
IMU signals. In these cases, it may instead be possible to use a motor command as a
46
surrogate transform signal, but this has yet to be investigated.
Finally, the transform estimates fed to DeepEfference are computed from ground-
truth camera poses and do not exhibit noise characteristics that will most likely be found
in real-world applications where we will rarely, if ever, have access to a comparably clean
extra-visual motion measurement. One possibility for overcoming measurement noise is
to expand DeepEfference to produce n image reconstruction predictions and include an
additional decision node that chooses the best reconstruction. DeepEfference’s runtime
could allow for many possible reconstructions to be generated similar to learning and
sampling from a noise distribution.
47
Table 2.2: Pixel errors for DM, LE, and DE on KITTI and SceneNet
KITTI DeepMatching Keypoints
Mean Standard Deviation Median Min Max
DeepMatching 2.1 2.6 1.6 0.0 187.1
LightEfference 29.2 242.3 2.2 0.0 4661.4
DeepEfference 2.3 3.7 1.7 0.0 268.2
KITTI FAST Keypoints
Mean Standard Deviation Median Min Max
LightEfference 31.6 261.3 2.0 0.0 4791.2
DeepEfference 2.4 3.7 1.7 0.0 251.3
SN DeepMatching Keypoints
Mean Standard Deviation Median Min Max
DeepMatching 3.3 16.2 1.3 0.0 2620.9
LightEfference 12.9 19.8 7.8 0.0 2596.1
DeepEfference 11.5 19.2 5.9 0.0 2691.8
SN FAST Keypoints
Mean Standard Deviation Median Min Max
LightEfference 14.8 28.1 8.0 0.0 2957.8
DeepEfference 13.1 26.2 5.9 0.0 3055.4
48
Figure 2.4: Unusual DE and LE sample results from KITTI and SceneNet. The red boxes
highlight areas in the reconstructed image that were imagined by DeepEfference. Green
boxes highlight instances where DE was able to better predict object positions in a scene
with strong depth contrast while LE generated a poorer reconstruction. The last row
shows a failure case where both LE and DE were unable to generate a reconstruction.
This is most likely due to the extreme transformation between the camera at the time of
source image capture versus target image capture.
49
Chapter 3: A Deep Neural Network Approach to Fusing Vision and Het-
eroscedastic Motion Estimates for Low-SWaP Robotic Ap-
plications
3.1 Abstract
Due both to the speed and quality of their sensors and restrictive on-board com-
putational capabilities, current state-of-the-art (SOA) size, weight, and power (SWaP)
constrained autonomous robotic systems are limited in their abilities to sample, fuse,
and analyze sensory data for state estimation. Aimed at improving SWaP-constrained
robotic state estimation, we present Multi-Hypothesis DeepEfference (MHDE) - an unsu-
pervised, deep convolutional-deconvolutional sensor fusion network that learns to intel-
ligently combine noisy heterogeneous sensor data to predict several probable hypotheses
for the dense, pixel-level correspondence between a source image and an unseen tar-
get image. This new multi-hypothesis formulation of our previous architecture, Deep-
Efference (Shamwell, Nothwang, and Perlis, 2017), has been augmented to handle dy-
namic heteroscedastic sensor and motion noise and computes hypothesis image mappings
and predictions at 150-400 Hz depending on the number of hypotheses being generated.
MHDE fuses noisy, heterogeneous sensory inputs using two parallel architectural path-
50
ways and n (1, 2, 4, or 8 in this work) multi-hypothesis generation subpathways to gener-
ate n pixel-level predictions and correspondences between source and target images. We
evaluated MHDE on the KITTI Odometry dataset (Geiger et al., 2013) and benchmarked
it against DeepEfference (Shamwell, Nothwang, and Perlis, 2017) and DeepMatching
(Revaud et al., 2016) by mean pixel error and runtime. MHDE with 8 hypotheses out-
performed DeepEfference in root mean squared (RMSE) pixel error by 103% in the max-
imum heteroscedastic noise condition and by 18% in the noise-free condition. MHDE
with 8 hypotheses was over 5, 000% faster than DeepMatching with only a 3% increase
in RMSE.
3.2 Introduction
The sensing and processing pipelines of autonomous and semi-autonomous robotic
systems pose a fundamental limit on how fast these systems may safely travel through an
environment. For example, when moving at 20 m/s, a 30 Hz sensor-derived state estimate
update rate means that a given robot will travel 0.66 meters between state updates. While
traveling those 0.66 meters, the robot will effectively be blind to any unexpected changes
in the environment (e.g., a tree branch blown by a wind gust or an unexpectedly opened
door). As a result, current size, weight, and power (SWaP) constrained autonomous and
semi-autonomous robotic systems are forced to move very slowly through their environ-
ments.
The slow operational speeds of SWaP-constrained autonomous systems are espe-
cially pronounced for mobile robots operating in dynamic, gps-/communications-denied
51
environments where safe navigation must be performed only with on-board sensors and
computational resources. For unmanned aerial vehicles (UAVs), navigation is typically
performed through a fusion of visual odometry (VO) estimates, inertial measurements,
and simplified predictive linear motion models in a Kalman filter framework. These
SWaP-constrained VO-pipelines force the use of lightweight feature matching approaches
for visual correspondence that are out-performed by computationally heavier SOA ap-
proaches. For example, the visual matching algorithm DeepMatching has enabled SOA
matching and optical flow (Revaud et al., 2015) but the correspondence-finding step alone
can require from 16 seconds to 6.3 minutes per RGB image pair depending on the param-
eter regime used for matching (Revaud et al., 2016). For real-time operation on SWaP-
constrained systems, correspondence must be computed orders of magnitude faster (e.g.,
a minimum of 33 ms per matching pair for a 30 FPS camera commonly used for SWaP-
constrained robotic applications).
We argue that contextual information can greatly reduce the computational burden
for image correspondence approaches and enable both higher-quality and lower-latency
state estimation. One way to provide context is by fusing measurements from multiple
sensory modalities. However, intelligently integrating multimodal information into low-
level sensory processing pipelines remains challenging, especially in the case of SWaP-
constrained robotic systems.
We have previously shown that our architecture DeepEfference (Shamwell, Noth-
wang, and Perlis, 2017) can efficiently fuse visual information with motion-related infor-
mation to greatly increase runtime performance ( 20, 000%) with minimal performance
degradation ( 12%) for dense image correspondence matching. However, in our previous
52
Figure 3.1: Sample MHDE outputs from different hypothesis pathways. A-E: MHDE
ouputs from 5 pathways. D shows the output from an inactive pathway (i.e. a pathway
that the network did not optimize). F-E: Reconstruction error for the hypotheses shown in
A-E. From F we can see that the reconstruction shown in A had the lowest error (yellow-
dashed box).
work, we used motion estimates as inputs to DeepEfference that were accurate to within
approximately 10 cm of actual pose. In the real-world, systems will rarely have access
to comparatively clean signals. Additionally, real noise sources are often heteroscedastic
and input-dependent.
With the original DeepEfference’s fast runtime, we saw the possibility of generat-
ing many different hypothetical outputs for each input image and then selecting the most
accurate at execution time. By learning how to produce n image reconstruction predic-
tions, the DeepEfference architecture could be expanded to better handle real-world noise
sources.
53
In this work, we introduce Multi-Hypothesis DeepEfference (MHDE) which is an
extension of DeepEfference (Shamwell, Nothwang, and Perlis, 2017) that mitigates per-
formance impacts of noisy motion estimates. A side-effect of this multi-hypothesis ap-
proach is enhanced performance even in the absence of added noise that achieves a mean
pixel error within 3% of SOA approaches with an over 5, 000% decrease in runtime. By
learning how to generate multiple hypothetical outputs, MHDE can effectively sample
the space of possible image transformations. This is enabled by a multi-pathway net-
work architecture and novel loss rule that enables the network to explicitly learn multiple,
independent network pathways.
The remainder of the paper is organized as follows: Section 2.3 describes the back-
ground and motivations for MHDE; Section 2.4 outlines our deep network approach to
fusing noisy heterogeneous sensory inputs and describes the MHDE architecture; Section
2.5 outlines our experimental and evaluation approaches; Section 2.6 presents our exper-
imental results; Section 2.7 discusses the results from Section 2.6; and Section 3.8 offers
a summary, concluding thoughts, and directions for future work.
3.3 Background
3.3.1 Visual Odometry and Multi-Sensor Fusion
In VO as well as many other vision tasks such as motion understanding and stere-
opsis, a key challenge is discovering quantitative relationships between temporally or
spatially adjacent images. Within the last decade, bio-plausible approaches for the visual
task of object recognition have set new benchmarks and are now the defacto standard.
54
We agree strongly with Memisevic that bio-plausible, local filtering-based approaches
similarly hold promise for the correspondence problem (Memisevic, 2013).
A known failure mode for visual odometry (VO) is in highly dynamic scenes. Most
VO algorithms are subject to the static scene assumption whereby additional error is in-
troduced when independently moving points in the scene move inconsistently with their
dependently moving neighbors.
Feedback outlier detection approaches based on algorithms such as RANSAC (Kitt,
Moosmann, and Stiller, 2010) seek to discover the most likely motion that has caused a
given transform. However an unconstrained key-point match between two images across
a large temporal window and spatial extent is at least exponentially complex (Brox, Ma-
lik, and Bregler, 2009). By fusing sensor information from separate modalities, we can
effectively constrain the matching process.
Constraining the matching process to be consistent with a narrow range of trans-
forms gleamed from another modality can lead to increased VO performance relative to
computational requirements and processing time. Previous work has applied extra-visual
feedback signals from IMUs or GPS (Maimone, Cheng, and Matthies, 2007; Agrawal and
Konolige, 2006) to constrain the matching process. Simple motion models (Enkelmann,
1991; Davison, 2003) have also been used to predict future images based on previously
observed image motion. These approaches have been extended to use quadratic motion
models (Lefaix, Marchand, and Bouthemy, 2002) which showed improved performance
in specific environments (e.g., on flat roads). However, these models implicitly sacrifice
responsiveness as they wait for changes in an underlying sensory distribution rather than
detecting dominant motion from a separate extra-visual modality.
55
Deconvolutional decoderConvolutional encoder
Source image
Local pixel 
shifts
Reconstructed 
source images
Global 
pixel shifts
Localization 
pathways
2D Affine 
matrices
Spatial 
transformers
Transform 
input
Target 
image for 
sampling
Sampler
Hypothesis 
i
Hypothesis
i+i
Sampler
Figure 3.2: MHDE network diagram with two hypotheses shown for brevity. We experi-
mented with up to 8 hypotheses in this work.
3.3.2 Deep Spatial Transformations
The correspondence problem describes the challenge of determining how the pix-
els in one image spatially correspond to the pixels in another image. Traditionally, the
correspondence problem has been tackled with closed-form, analytical approaches (see
(Scharstein and Szeliski, 2002) for a review) but recently, deep, bio-inspired, solutions
have also begun to show promise. These deep approaches solve the correspondence prob-
lem by learning to estimate the 3D spatial transformations between image pairs.
In computer vision, siamese-like deep network architectures such as those based on
56
multiplicative interactions have been used successfully for relationship learning between
images (Memisevic and Hinton, 2007; Ranzato and Hinton, 2010; Memisevic and Hinton,
2010; Hinton, Krizhevsky, and Wang, 2011; Kivinen and Williams, 2011; Memisevic,
2013). However,there are two problems with these and other deep approaches (e.g. the
DeepMatching (Revaud et al., 2015; Revaud et al., 2016) algorithm described earlier) to
image transformation learning.
First, these approaches require expensive computation on both initial and target
images. They employ siamese architectures that require parameter-heavy learning and
expensive computations to be performed on both source and target images. For SWaP-
constrained robots, the number of computational operations required by these siamese
networks must be significantly reduced. Approaches such as L1 and group lasso-based
pruning (Han, Mao, and Dally, 2015; Wen et al., 2016; Anwar, Hwang, and Sung, 2015)
offer potential mechanisms to reduce the size of networks but fundamentally still require
extensive computation on both source and target images.
Second, these approaches do not provide a mechanism to include extra information
from another modality as a motion prior while maintaining end-to-end trainability. For
robotic applications, heterogeneous sensor information is often available that can be lever-
aged and may allow for reduced computational constraints and increased performance
(see Section 3.3.3).
57
3.3.3 Extra-Modal Motion Estimates and Heteroscedastic Noise
Unlike algorithms in pure computer vision domains, algorithms intended for robotic
applications need not rely solely on vision. For example, when estimating a robotic sys-
tem’s egomotion by tracking changes in feature point locations on a robot’s camera’s
imaging plane, additional non-visual motion estimates can be fused with visual informa-
tion(i.e. to bias or serve as a motion prior) to improve egomotion estimation.
On real-world systems, additional non-visual motion estimates could be derived
from measurements taken from IMUs, GPS, LIDARs, ultrasonic ranging sensors, or the
actual input motor commands given to the system. Furthermore, motor errors exhibit
heteroscedastic noise properties where larger movements generate larger sources of noise
(Ciliberto et al., 2012). Any approach that seeks to leverage extra-modal motion estimates
needs to be robust to real-world heteroscedastic noise.
3.4 Approach
MHDE is an unsupervised deep heterogeneous neural network that employs multi-
ple separable pathways to fuse noisy, heterogeneous sensory information and predict how
source images correspond to unseen target images.
MHDE effectively reverses the prediction pipeline - rather than using the previous
image to reconstruct the future image, it uses the target image to reconstruct the source
image. The network receives a noisy estimate of the change in 3D camera position be-
tween source and target frame acquisitions and learns:
58
1. 2D affine transformation parameters that are applied as a global spatial transform;
and
2. Local, pixel-level shifts that encapsulate aberrations due to varied scene depth, non-
rigid scene objects, etc.
The affine transformations and localized shifts are learned and applied via two inter-
connected architectural pathways: one for determining global 2x3 affine 2D transforma-
tion matrices, and a second encoder-decoder pathway that predicts localized, pixel-level
shifts that are not captured by the global, approximated 2D affine transformation (see
(Shamwell, Nothwang, and Perlis, 2017) for more information on the DeepEfference ar-
chitecture).
Unlike the original DeepEfference, MHDE generates several hypothetical recon-
structions which enable increased robustness to noisy inputs. Thus, while DeepEfference
only has two architectural pathways, MHDE has the same two architectural pathways plus
n additional hypothesis generation pathways (2− 8 in this work).
3.4.1 Winner-Take-All (WTA) Loss Rule
MHDE generates multiple hypothesis reconstructions to enable robustness to stochas-
tic, heteroscedastic, input noise such as found in the real-world. The previous DeepEf-
ference architecture that generated only a single predicted reconstruction used Euclidean
error to train the network by minimizing the loss function
L(θ, It, Is) = argmin
θ
‖Ir(θ, It)− Is‖2 (3.1)
59
where Ir is an image reconstruction, It is the image target, and Is is the image source
being reconstructed.
If instead of generating a single reconstruction Ir, the network generated n recon-
structions I ir, i ∈ N , the loss rule would need to be expanded to train across all hypothesis
pathways in the new network. A naive way to compute error for such a multi-hypothesis
network would be to simply sum the Euclidean error from all hypotheses and divide by
the total number of hypotheses. Then, the network would be trained by minimizing the
loss function
L(θ, It, Is) = argmin
θ
∑N
i ‖I ir(θ, It)− Is‖2
N
(3.2)
where I ir is a hypothesis image reconstruction and the remaining terms are the same as
before.
The naive multi-hypothesis loss rule of Eq. 3.2 would lead the network to optimize
all pathways simultaneously with each update. However, this may not be optimal for
increased robustness to noise. Effectively, we desire the network to generate distinct
predictive hypotheses by sampling from a noise distribution that the network implicitly
learns. For example, consider when the network has perfectly optimized the loss function
of Eq. 3.2:
L(θ, It, Is) =
∑N
i ‖I ir(θ, It)− Is‖2
N
≈ 0 (3.3)
In this case, ‖I ir(θ, It) − Is‖2 ≈ 0,∀i ∈ N which means that each hypothesis
reconstruction I ir(θ, It) is approximately equal. As the network is trained and converges
60
to a local minima, loss will affect parameters in each pathway approximately equally
and drive outputs from all pathways to a common approximate solution. This is the
opposite of what we want from MHDE. Effectively, such a loss rule is equivalent to
the standard Euclidean loss rule used in (Shamwell, Nothwang, and Perlis, 2017) where
a single prediction is generated and fails to leverage the multiple outputs that can be
generated by MHDE.
To leverage its multiple outputs, we train MHDE using what we call a winner-take-
all (WTA) Euclidean loss rule:
I∗r (θ, It)←− argmin
i
‖I ir(θ, It)− Is‖2 (3.4)
L(θ, It, Is) = ‖I∗r (θ, It)− Is‖2 (3.5)
where I∗r is the lowest error hypothesis. Loss is then only computed for this one hy-
pothesis and error is backpropagated only to parameters in that one pathway. Now, only
parameters that contributed to the winning hypothesis are updated and the remaining pa-
rameters are left untouched.
3.4.2 Pathway 1: Global Spatial Transformer
Spatial transformer (ST) modules (Jaderberg et al., 2015) apply parametrized ge-
ometric transformations to feature-maps (either data inputs or intermediate outputs) in
deep networks. The parameters for these transformations (2D affine transformations in
our case) can be directly provided to the network as input or can be learned and optimized
alongside the other network parameters (e.g., network weights and biases).
61
MHDE was provided with estimates of the true 3D transformation between source
and target images (δx, δy, δz, δα, δβ, δγ). Note, however, that the visual input to MHDE
was a single grayscale source image without any depth information. Even if the provided
3D transformation was noise-free and perfectly accurate, it is not possible to analytically
perform a 3D warp (assuming translation) on a 2D image due to unknown scene depth at
each pixel location. Thus, MHDE approximated 3D warps as 2D affine transformations
through a linear-nonlinear optimization using four fully-connected layers, each followed
by an additional rectified linear unit (ReLU) (Glorot, Bordes, and Bengio, 2011) non-
linearity layer.
We modified the standard ST module in tensorflow (Abadi et al., 2015) by splitting
the layer into two layers - one to perform the affine transformation on grids of source
pixel locations (xs, ys) and output target pixel coordinates (xt, yt) and a second layer to
perform bilinear sampling given pixel coordinates and an image to sample from.
Although the sampling component of our ST module takes an input image as input,
no learn-able parameters are based on input image content and thus our global pathway
is a function only of the input transformation estimate and is image content-independent.
3.4.3 Pathway 2: Local, Pixel-Level Shifter
The pixel-level encoder/decoder pathway refines the ST estimate from the first
pathway and provides localized estimates of pixel movement to account for depth, non-
rigidity, etc.
We implemented this pathway as a convolutional-deconvolutional encoder-decoder.
62
First, the convolutional encoder compresses a source image through a cascade of convo-
lutional filtering operations. The output of the convolutional encoder is concatenated with
intermediate outputs from the fully-connected layers from the first, global pathway (the
black and blue vertical lines in the center of Fig. 3.2). This concatenated representation is
then expanded using a deconvolutional decoder to generate n pairs of (xt′ , yt′) pixel loca-
tions that are summed with the target pixel coordinates (xt, yt) from the global pathway
before bilinear sampling (see (Shamwell, Nothwang, and Perlis, 2017) for more details).
3.5 Experimental Methods
We conducted experiments with MHDE using four different noise conditions and
four different architectures. All architectures were based on DeepEfference (Shamwell,
Nothwang, and Perlis, 2017) and implemented both global pathway and local pathways.
MHDE was evaluated on the KITTI Odometry dataset (Geiger et al., 2013) and results
were benchmarked against correspondence matching results from the SOA DeepMatch-
ing approach (Revaud et al., 2016) (see (Shamwell, Nothwang, and Perlis, 2017) and the
Appendix for more information).
We experimented with four noise conditions where α was 0.0 , 0.1, 0.25, or 0.5. We
trained networks with 1, 2, 4 or 8 hypothesis generation pathways. For each noise and and
hypothesis combination, we trained three networks for a total of 48 different networks.
63
Figure 3.3: Heteroscedastic noise as a function of transform magnitude for the X and Y
components of the transform input over the test set for a network with a noise parameter
α = 0.25.
3.5.1 Noise
As shown in Fig. 3.3, we simulated real-world noise conditions by applying het-
eroscedastic noise to each transform input. For each transform T = (δx, δy, δz, δα, δβ, δγ),
we introduced heteroscedastic noise to create network input T ∗ according to:
T ∗ = T +N (0, α
√
T ) (3.6)
where α was a constant modifier that was either 0.0, 0.1, 0.25, or 0.5.
3.5.2 Evaluation
We evaluated MHDE by measuring the mean pixel error of MHDE projections of
DeepMatching keypoints from source images to target images. The projection errors for
64
each method compared to groundtruth projections were used to determine mean pixel
errors for each method (see (Shamwell, Nothwang, and Perlis, 2017) and Appendix A.
for a more thorough explanation of the experimental evaluation).
3.5.3 Training
We trained MHDE for 200, 000 iterations on KITTI Odometry scenes 1− 11 for all
experiments. We used the Adam solver with batch size=32, momentum1=0.9, momentum2=0.99,
gamma=0.5, learning rate=1e − 4, and an exponential learning rate policy for all exper-
iments. All networks were trained using our modified WTA loss rule. All experiments
were performed with a Nvidia Titan X GPU and Tensorflow (see (Shamwell, Nothwang,
and Perlis, 2017) and Appendix B. for a more thorough explanation of training proce-
dures).
3.6 Results
Fig. 3.4 shows the performance of MHDE with various maximum hypotheses com-
pared to DM. A network’s maximum hypotheses is the maximum number of hypothesis
generation pathways a given network was allowed to learn. Because of our WTA loss
rule, this does not mean that the network effectively learned how to use all pathways. For
example in Fig. 3.6(d), a network with four maximum hypotheses predominantly trained
and used a single pathway.
This can also be seen in Fig. 3.5 where the same results from Fig. 3.4 are plotted
as a function of the total active hypotheses. Active hypotheses are considered hypothesis
65
pathways that performed better than all other pathways for at least one testing expemplar
(for reference, Fig. 3.1(d) shows the network output of an inactive pathway).
There is positive relationship between performance and both maximum hypotheses
and active hypotheses. This is true for all noise conditions as well as the no-noise con-
dition. We also see that the rate of improvement when moving from one to six active
hypotheses is greater for higher noise levels.
Fig. 3.6 shows the activations by pathway for networks trained with four or eight
maximum hypotheses. Surprisingly, regardless of noise conditions, we see no strong
relationship between active pathways (pathways that produced the best result for at least
one test exemplar) and noise level.
Tab. 3.1 details the comparative runtimes between DeepMatching and MHDE with
various numbers of hypotheses1. MHDE runtime scales linearly with number of hypothe-
ses. Overall, the runtime gains of MHDE compared to DM show that providing a strong
prior on camera motion allows for far more computationally efficient image predictions
and matchings.
3.7 Discussion
We have shown the unsupervised learning of correspondence between static grayscale
images in a deep sensorimotor fusion network with noisy sensor data.
We were concerned that MHDE networks might only optimize a single pathway.
1We used a CPU version of DeepMatching for these comparisons. The latest available version of the
GPU implementation of DeepMatching took over 7 seconds per image to run on our workstation on a
256x256 image so we elected to use the faster CPU version for all experiments
66
Table 3.1: Average runtimes for DeepMatching (DM) and Multi-Hypothesis DeepEffer-
ence (MHDE) with equivalent frames per second (FPS)
# Hypoth. Mean StDev. Med. FPS
DM N/A 0.4115 s 0.00132 0.407 s 2.4
MHDE 1 0.0024 s 0.00008 0.00238 s 417.4
MHDE 2 0.00303 s 0.00009 0.00302 s 330
MHDE 4 0.00422 s 0.00010 0.00421 s 237.2
MHDE 8 0.00675 s 0.00016 0.00677 s 148.2
For example, if one pathway consistently produced the lowest estimate error at the begin-
ning of training, then perhaps only that pathway would be updated and thus the network
would not be used to it’s fullest potential. As seen in Fig. 3.6, this generally was not the
case as networks were able to learn to use multiple pathways without intervention outside
of the WTA loss rule.
Future work will look at how to include pure sensor measurements (e.g., from an
IMU) and how to encourage networks to train and use all available hypothesis pathways.
Like it’s predecessor, MHDE only uses single grayscale images as inputs. Another pos-
sible avenue of research is to use multiple images as input, or an LSTM like architecture
to give the network additional temporal context.
One of the more important aspects of this network is that it does not generate images
67
from scratch and instead works mostly in the space of pixel locations rather than pixel
intensities. Given that geometry is consistent across image domains even though image
content varies, this network architecture is a promising candidate to leverage transfer
learning.
While we used noise-corrupted motion estimates derived from ground-truth for the
MHDE transform input, IMUs are a possible real-world source for this information. How-
ever, IMUs only measure accelerations and thus we speculate that using raw IMU mea-
surements as MHDE inputs will result in poor performance during constant velocity ma-
neuvers. Additional work is needed to determine a suitable real-world analog for deriving
the motion estimates needed by MHDE.
We hope to experiment with this architecture on other visual odometry datasets.
Specifically, we seek a larger dataset is needed with a wider range of movements. With-
out a wide range of movements, we speculate that trained networks will only be able to
transfer to new, previously unseen datasets that follow similar movement statistics as the
datasets on which they were trained. To overcome many of these limitations, we are cur-
rently working to collect a multi-modal dataset with stereo imagery, depth imagery, high-
resolution IMU data, action commands, low-level motor-commands, and ground-truth
VICON poses. With this dataset, we will be able to better address limitations inherent in
the current MHDE architecture.
68
3.8 Conclusion
While increased performance in the noise-free conditions was an unintended conse-
quence of the multi-hypothesis formulation, the central contribution of this work is in the
handling of noise-contaminated input data. In summary, we have shown the unsupervised
learning of correspondence between static grayscale images in a deep sensorimotor fusion
network with noisy sensor data. In this work, we have presented a multi-hypothesis for-
mulation of our previous DeepEfference architecture. MHDE outperformed DE by 103%
in RMSE in our maximum noise condition, by 18% in the noise-free condition, and was
181% slower (417 FPS vs 148 FPS). Compared to DM, MHDE was 5192% faster with 8
hypotheses (2.8 FPS vs 148 FPS) and was outperformed by 3% in the noise-free condition
with 8 hypotheses and by 57% in the maximum noise condition with 8 hypotheses.
3.9 Appendix
The following methods are largely reproduced from (Shamwell, Nothwang, and
Perlis, 2017) and are included here for completeness.
3.9.1 Extended Evaluation
As in (Shamwell, Nothwang, and Perlis, 2017), we evaluated MHDE on the KITTI
Visual Odometry dataset (Geiger et al., 2013). KITTI is a benchmark dataset for the
evaluation of visual odometry and LIDAR-based navigation algorithms. Images in KITTI
were captured at 10 Hz from a Volkswagen Passat B6 as it traversed city, residential, road,
69
and campus environments. Groundtruth poses at each camera exposure were provided
by an RTK GPS solution and depth is provided with coincident data from a Velodyne
laser scanner. All objects in the visual scenes are rigid, thus fulfilling the static scene
assumption and allowing for ground truth to be computed from scene depth and camera
position.
Predicted pixel correspondence between source and target images was evaluated
against groundtruth correspondence and SOA DeepMatching correspondence predictions.
With access to scene depth and true camera pose for KITTI, groundtruth pixel shifts were
calculated by applying a 3D warp to 3D pixel locations in the source images to generate
the expected pixel locations in the target images. We projected each 3D point in the frame
of camerat0 to the world frame using the derived projection matrix for camerat0 and
then reprojected these points in the world frame to camerat1 using the inverse projection
matrix for camerat1. Finally, we transformed points in the frame of camerat1 to the
image plane. This resulted in a correspondence map between pixel locations in camerat0
and camerat1 for each point where depth was available (e.g., when depth was outside of
the Velodyne laser scanner’s range).
3.9.2 Extended Training Procedures
For training and evaluation, data was separated into train (80%) and test (20%)
sets. We used a total of 23, 190 image pairs with 80% (18, 552) for training and 20%
(4, 638) for testing. In all experiments, we randomly selected an image for the source,
used the successive image for the target, and subtracted the two 6-DOF camera poses for
70
the transform input. For each image in each dataset, we cropped the middle 224x224
pixel region for network inputs.
71
Figure 3.4: Inverse mean pixel error (higher is better) for several noise conditions pro-
duced by MHDE networks trained to generate 1, 2, 4, or 8 maximum hypotheses. Dashed
line is DM (SOA) error.
72
Figure 3.5: Inverse mean pixel error (higher is better) for several noise conditions pro-
duced by MHDE networks. Results are from the same networks shown in Fig. 3.4 but are
instead plotted as a function of active pathways learned by each network.
73
(a) No noise (b) Noise=0.1
(c) Noise=0.25 (d) Noise=0.5
Figure 3.6: Activation by pathway for the different noise conditions. Only networks with
maximum hypotheses of 4 or 8 are shown.
74
Chapter 4: An Embodied Deep Neural Network Approach to Dense Vi-
sual Correspondence for Low-SWaP Robotics
4.1 Abstract
Estimating the correspondences between pixels in sequences of images is a criti-
cal first step for many computer vision and robotics tasks such as visual odometry (VO)
and visual simultaneous localization and mapping (VSLAM). While VO and VSLAM
are used extensively for localization and navigation in low-SWaP robotic applications,
they are usually forced to rely on computationally light-weight, sparse correspondence
approaches. We contend that an earlier inclusion of extra-visual sensory information re-
lated to system motion will allow for computationally efficient dense image matching
even in low-texture regions that cause issues for other heavier dense correspondence ap-
proaches. We introduce Inertial DeepEfference (IDE) - an unsupervised deep matching
network that learns to compute image mappings by fusing heterogeneous sensor streams.
As an extension of our previous approaches (shamwell2017n; Shamwell, Nothwang, and
Perlis, 2017), IDE is unique in that it uses real sensor data for an end-to-end trainable
dense correspondence network that is orders of magnitude faster than other SOA deep
approaches. We evaluated IDE on the EuRoC MAV dataset (Burri et al., 2016) and on the
75
KITTI Odometry dataset (Geiger et al., 2013) and benchmarked it against the deformable
spatial pyramid (DSP) (Kim et al., 2013) and DeepMatching (Revaud et al., 2016) dense
correspondence approaches. Our IMU+surrogate feed-forward motor signal (FFMS) net-
work was 167x faster than DM and 516x faster than DSP.
4.2 Introduction
State estimation for size, weight, and power (SWaP) constrained autonomous robotic
systems is limited by the lightweight and low-power sensing and computational hardware
that they are forced to use. When viewed as complex, embodied agents, robotic sys-
tems can generate and access a wide variety of varied sensory information, and as such,
a popular approach to mitigating the negative influences of noisy SWaP sensors is to fuse
estimates from an array of heterogeneous sensors deployed on the robot.
SWaP-constrained GPS-denied navigation has been greatly influenced by this sen-
sor fusion philosophy and visual-inertial odometry (VIO), where a sensor array will com-
monly consist of a camera and an inertial measurement unnit (IMU), has been an topic
that has seen particular success for gpn-denied SWaP-constrained localization and navi-
gation. However, solutions to the correspondence problem (see (Scharstein and Szeliski,
2002) for a review) to-date remain dominated by bottom-up, vision-only approaches and
neglect other available sources of complemntary information that can increase efficiency.
We propose to fully exploit the embodied nature of robotic systems and begin fusing
heterogeneous sensor measurements as early as possible by learning a feed-forward model
input with heterogeneous sensor data to compute dense image correspondences.
76
Previously, we introduced DeepEfference (DE) (Shamwell, Nothwang, and Perlis,
2017) and Multi-Hypothesis DeepEfference (MHDE) (Shamwell, Nothwang, and Perlis,
Accepted) to perform early fusion of vision data and extra-modal motion-related data.
In both of these previous approaches, we used the ground truth pose to derive the extra-
modal motion input to the network (in the case of MHDE, an artificially noise-corrupted
ground truth pose). We showed the DE and MHDE architectures could learn how to
approximate a 3D transform given image data and an estimate of the 3D motion transform
derived directly from ground-truth.
In this work, we present results from experiments where networks were instead
provided with raw sensor data in the form of a single grayscale image and either IMU
measurements, motor feedback, a surrogate feed-forward motor command, or some com-
bination thereof. We hypothesized that IMU data will be able to at least partially replace
the ground truth derived motion estimates but will be more challenged during constant
velocity maneuvers (e.g., IMU data recorded from a car) compared to high-agility ma-
neuvers (e.g., a MAV flying aggressively in an indoor environment). In cases of constant
velocity, we further hypothesized that additional state information conveying intention
will help mitigate the effects of unmeasurable (by an IMU’s accelerometer) constant ve-
locity motions by providing the model with supplementary information not captured in
an IMU signal stream. Taking note from biology, we believe that a surrogate motor-
/intention-related signal is a prime candidate for such a role (see (Shamwell, Nothwang,
and Perlis, 2017; Shamwell, Nothwang, and Perlis, Accepted) for more information).
The remainder of the paper is organized as follows: Section 4.3 describes the back-
ground and motivations for the network; Section 4.4 presents related work; Section 4.5
77
outlines our deep network approach to fusing noisy heterogeneous sensory inputs and de-
scribes the network architecture; Section 4.6 describes our experimental approach; Sec-
tion 4.7 describes our evaluation approaches; Section 4.8 presents and discusses our ex-
perimental results; and Section 4.9 offers concluding thoughts and directions for future
work.
4.3 Background
When GPS is available, inertial navigation systems (INS) can leverage the high-
temporal resolution of IMU-derived measurements while periodically correcting inte-
grated IMU drift error. In GPS-denied environments, vision has been successfully used
to correct IMU error in place of GPS. By tracking the projected positions of visual scene
elements on the imaging plane, inter-scene relative position differences can be derived.
VO-based dead reckoning requires knowledge of how pixels in one image relate to pixels
in another image. The problem of determining this relationship is known as the corre-
spondence problem.
In VO and VSLAM for robotic applications, image correspondence is performed in
isolation as it would be in a pure computer vision domain. However, unlike in pure com-
puter vision domains, algorithms intended for robotic applications need not rely solely on
vision. For example, when estimating a robotic system’s egomotion by tracking changes
in feature point locations on a robot’s camera’s imaging plane, additional non-visual mo-
tion estimates can be fused with visual information to improve egomotion estimation and
overall system performance (see (Martinelli, 2012) for a review of VIO).
78
In VO as well as many other vision tasks such as motion understanding and stere-
opsis, a key challenge is discovering quantitative relationships between temporally or
spatially adjacent images. Within the last decade, bio-plausible approaches for the visual
task of object recognition have set new benchmarks and are now the defacto standard.
Bio-plausible/local filtering-based approaches also hold promise for the correspondence
problem and VO/VIO/VSLAM (Memisevic, 2013).
4.4 Related Work
Deconvolutional decoderConvolutional encoder
Source 
image
Local shifts
Reconstructed 
source image
xi
global
... xn
global
yi
global
... yn
global
æ
è
ç
ç
ö
ø
÷
÷
Sampler
Target image 
for sampling
2D Affine 
matrix
Spatial 
transformerLocalization 
pathway
a j ω j
a j+1 ω j+1
... ...
am ωm
é
ë
ê
ê
ê
ê
ê
ù
û
ú
ú
ú
ú
ú
Mi®i+1
θ
11
θ
12
θ
13
θ
21
θ
22
θ
23
é
ë
ê
ê
ù
û
ú
ú
Global shifts
xi
local
... xn
local
yi
local
... yn
local
æ
è
ç
ç
ö
ø
÷
÷
Figure 4.1: IDE network diagram
The correspondence problem can be viewed as part of the more general problem
of determining how images relate to one another. While the correspondence problem has
been traditionally addressed through closed-form, analytical approaches (for example De-
formable Pyramid Matching (Kim et al., 2013) and DeepMatching (Revaud et al., 2016);
see (Scharstein and Szeliski, 2002) for a review of the correspondence problem), recent
bio-inspired, deep neural network approaches that estimate the 3D spatial transformations
79
between image pairs have begun to show increasing success.
Unsupervised, siamese-like deep network architectures such as those based on mul-
tiplicative interactions (Memisevic and Hinton, 2007; Ranzato and Hinton, 2010; Memi-
sevic and Hinton, 2010; Hinton, Krizhevsky, and Wang, 2011; Kivinen and Williams,
2011; Memisevic, 2013; Wohlhart and Lepetit, 2015) and triplet learning rules (Wang and
Gupta, 2015) have been used successfully for relationship learning between RGB images
at the cost of computational runtime. These architectures require expensive computa-
tions to be performed on multiple images per run which can greatly increase the number
of needed computations and model complexities due to the high-dimensional nature of
image data.
Other learning approaches have relied on explicit supervised labeling such as the
random decision forest based approaches of (Taylor et al., 2012; Shotton et al., 2013;
Brachmann et al., 2014) and semantic segmentation approaches of (Long, Shelhamer, and
Darrell, 2015; Hariharan et al., 2015). The supervised nature of these approaches require
expensive and time-consuming labeling that greatly limits the size of useable datasets.
While (Byravan and Fox, 2017) used depth data as input to their network, we do
not provide IDE with depth. This is critical because given the 3D locations of points in
the image scene, a 3D affine transformation can be directly performed to project points on
the image plane at time ti to some time ti+1. As IDE is designed for SWaP-constrained
applications where only intensity information from a single imager might be available,
our IDE network is input with only a single grayscale intensity image and uses its local
pathway (see Section 4.5 and (Shamwell, Nothwang, and Perlis, 2017; Shamwell, Noth-
wang, and Perlis, Accepted) for more information) to infer depth and non-rigidity from a
80
single 2D grayscale image.
The self-supervised visual descriptor approach of (Schmidt, Newcombe, and Fox,
2017) used a learning rule that requires a priori labeling such that points in the source
image (image at ti) and target image (image at ti+1) already be aligned and thus, for
correspondence to have already been solved for the training set. Our approach is instead
unsupervised and requires only the raw source and target images for training - in fact, we
only compute ground truth correspondence between source and target images to validate
our results. Otherwise, no labeling is required.
The view synthesis approach of (Zhou et al., 2016) learned local, pixel-level shifts
in order to render new, unseen views of objects and scenes. That approach is similar to the
mappings that are learned by the second pathway of our DE architectures. Besides dif-
ferent network structures and inclusion of a global spatial transformer module, the largest
difference between their method and our own is that rather than learning to generate novel
viewpoints of objects or scenes, we learn how to reconstruct a source image using pixel
locations in a target image.
4.5 Approach
IDE is designed as an unsupervised dense correspondence network. It receives as
input a single grayscale intensity image Ii taken at time ti and an extra-visual estimate
of camera motion Mi→i+1 between time ti and time ti+1. The goal of the network is to
use the motion estimates Mi→i+1 and the grayscale image Ii to predict the new image
coordinate in Ii+1 for each scene element captured in Ii. In other words, IDE learns the
81
correspondence between pixels in images Ii and Ii+1.
The network architecture can be thought of as an extension of an autoencoder. How-
ever, rather than learning features by minimizing the reconstruction error between an input
projected into feature-space and then re-projected into an output-space, IDE is trained by
minimizing the reconstruction error between an input and and a reconstruction based on
sampled values from a previously unseen target image Ii+1. Thus, IDE is trained accord-
ing to the following loss rule:
L(θ, Ii+1, Ii) = argmin
θ
‖Ir(θ, Ii+1)− Ii‖2 (4.1)
where Ir is an image reconstruction, Ii+1 is the image target, and Ii is the image source
being reconstructed.
Like its predecessors, IDE learns:
1. 2D affine transformation parameters that are applied as a global spatial transform;
and
2. Local, pixel-level shifts that encapsulate aberrations due to varied scene depth, non-
rigid scene objects, etc. that are applied following the spatial transformation and
before bilinear sampling
4.5.1 Pathway 1: Global Shifter
The first pathway of IDE (the top pathway shown in Fig. 4.1) is the global shifter.
Given a motion estimate (e.g. IMU data), it uses several fully-connected (FC) layers
to approximate a 3D transformation as a 2D transformation by learning to compute the
82
parameters for a 2D affine transformation matrix. The Global Shifter then applies this 2D
affine transformation to generate expected coordinate shifts in the form of an output of
size HxWx2 that represents pixel locations at which to sample from in the target image
(additional detail in Section 4.5.3 below).
4.5.2 Pathway 2: Local, Pixel-Level Shifter
The second pathway of IDE (the top pathway shown in Fig. 4.1) is the Local Shifter
pathway. It receives a source image as input and uses a convolutional-deconvolutional
encoder-decoder to also generate a HxWx2 output of pixel shifts. However, these shifts
are intended to only modify the coordinate shifts calculated by the Global Shifter path-
way for varying scene depth, non-rigidity, etc. (in practice, the Local Shifter outputs are
sparse).
4.5.3 Spatial Transformations
Spatial transformation in the form of a modified spatial transformer module (Jader-
berg et al., 2015) is an integral component of the IDE network architecture and an ex-
planation helps to elucidate the workings of IDE. To perform a spatial transformation,
we assume that output pixels are defined to lie on a regular grid G = {Gj} of pixels
Gj = (x
t
j, y
t
j), forming an output feature map V ∈ <H
′×W ′ where H’ and W’ are the
height and width of the grid. If we let A(θ,Gj) represent a 2D affine transformation, then
target coordinates are mapped to source coordinates according to
83
 x
s
j
ysj
 = A(θ,Gj) =
 θ11 θ12 θ13
θ21 θ22 θ23


xtj
ytj
1

(4.2)
where (xti, y
t
i) are target coordinates in the output feature map, (x
s
i , y
s
i ) are the source
coordinates in the input feature map, and θ is the 2D affine transformation matrix.
IDE’s Global Shifter pathway learns how to compute the matrix θ for a motion
estimate Mi→i+1 and then applies the transform to generate global coordinate estimates
[Xglobal, Yglobal, 1]
T .
If in each scene, all points are of the same depth, then the Global Shifter pathway
would be able to accurately project pixels to their correct positions post-movement. How-
ever, because this is not the case, the best result the global shifter can accomplish is the
correct projection of points that belong to some dominant plane. The Local Shifter is able
to apply corrections to the coordinate changes computed by the Global Shifter to allow
for differing object depths in a scene and non-rigidity.
Given a source image Ii, IDE’s Local Shifter pathway learns to compute [Xlocal, Ylocal, 1]T
which are localized shifts to be summed with the global shifts [Xglobal, Yglobal, 1]T com-
puted by the Global Pathway. Thus, IDE generates its final coordinate locations as
[Xshift, Yshift, 1]
T = [Xglobal +Xlocal, Yglobal + Ylocal, 1]
T (4.3)
IDE then performs bilinear sampling to produce a reconstruction image Ir by sam-
pling an unseen image Ii+1 at coordinates (x, y) ∈ [Xshift, Yshift]:
84
Ir(x, y) =
H,W∑
h,w
Ii+1(h,w)h(x− h, y − w) (4.4)
where the function h is zero except in the range where it is non-negative and equal to one
at (0, 0).
By only evaluating at the sampling pixels, we can compute sub-gradients that allow
error backpropagation to the affine parameters θ by computing error with respect to the
coordinate locations (x, y) instead of the more conventional error with respect to the pixel
intensity value at that location:
∂Ircj
∂xsj
=
H,W∑
h,w
U(h,w)max(0, 1− |y − w|)

1, if h >= x.
−1, if h < x.
(4.5)
∂Ircj
∂ysj
=
H,W∑
h,w
U(h,w)max(0, 1− |x− h|)

1, if w >= y.
−1, if w < y.
(4.6)
4.6 Methods
We tested networks with motion information taken from the following sources:
1. Ground Truth Pose (GP): Differences in ground truth camera pose at the capture
times of the source and target images were input to the network as the extra-visual
motion transform.
2. Raw Measurements from an Inertial Measurement Unit (IMU): The IMU data
recorded between capture times for the source and target images were input to the
85
network as the extra-visual motion transform.
3. Surrogate Feed-forward Motor Signals (FFMS): Because direct motor-command
inputs were not available, a k-means fitting of clusters was used on the ground truth
position differences to generate noisy estimates of the approximate direction of mo-
tion (see Fig. 4.2 for the associated error). This is unlike (Shamwell, Nothwang,
and Perlis, Accepted) where the continuous GP poses were contaminated with het-
eroscedastic noise. Here, cluster indices were encoded as one-hot vectors and had
only an indirect relationship to the motion of the vehicle.
4. Motor Feedback (MF): The motor feedback data recorded between capture times
for the source and target images were input to the network as the extra-visual motion
transform. This only applied to EuRoC as this information was not available in
KITTI.
5. IMU+FFMS: IMU and FFMS data as above were both input.
6. IMU+MF: IMU and MF data as above were both input.
4.6.1 Datasets and Data Generation
For training and evaluation, data was separated into train (80%) and test (20%) sets.
For each image in each dataset, we cropped the middle 224x224 pixel region for network
inputs. Details about each dataset and dataset specific dataset data generation follows
below.
86
(a) EuRoC (b) KITTI
Figure 4.2: Histogram of error between the ground truth and k-means cluster to which
that exemplar was assigned.
4.6.1.1 EuRoC MAV
The EuRoC MAV datasets are a collection of visual-inertial datasets collected on-
board an AscTec Firefly hex-rotor helicopter while traversing two indoor environments
(referred to as Machine Hall and Vicon Room). EuRoC contains data collected with a
VI-Sensor (Nikolic et al., 2014) which captures stereo image data at 20 Hz and IMU data
at 200 Hz. A pointcloud of the Vicon Room environment was generated using a Leica
Multistation and is also included in the datasets. In six of twelve datasets, synchronized
6-DOF position and attitude groundtruth is provided from a Vicon motion capture system
(the Vicon Room environment). In three of these six Vicon Room datasets, motor feedback
from the AscTec hex-rotor is also provided at 100 Hz. Because were were interested in
fusing these motor-feedback signals, we only used the V01 01 easy, V01 02 medium, and
V01 03 difficult datasets from the Vicon Room environment.
The total number of usable exemplars in the V01 01 easy, V01 02 medium, and
87
V01 03 difficult datasets (i.e., did not include the first or last example in a dataset, pro-
vided sufficient IMU data before and after the example, etc.) was 6, 494 which is signifi-
cantly smaller than the KITTI odometry dataset. We augmented EuRoC by including not
only pairs of sequential frames, but pairs separated by up to four frames. This resulted in
26, 976 total examples, of which 80% (21, 588) were used for training and 20% (5, 388)
were used for testing.
For the GP conditions, we generated the pose motion estimate input by decompos-
ing the 4x4 transformation matrix into a translation and euler rotation.
For the IMU, MF, IMU+FFMS, and IMU+MF conditions, because the lookahead
could be anywhere from one to four frames and thus anywhere between 50ms to 200ms,
the IMU and motor feedback inputs for the EuroC models used a vector of size 50x6
where for all exemplars regardless of lookahead size the first 10 entries correspond to the
50 ms prior to the image capture and the next 10 entries correspond to the 50ms following
capture. For exemplars that were only one ahead, the remainder of the vector were zeros.
For lookaheads of 2 the last 20 entries were zeros; for lookaheads of three the last 10 were
zeros; and finally for lookaheads of four the vector was fully populated.
For EuroC, the motor feedback was not synchronized to the VI-Sensor camera and
IMU streams. To temporally synchronize these streams, we leveraged the fact that the
Astec FCU also provided IMU data. We took the norm of the angular velocities recorded
by the FCU IMU and the VI-Sensor IMU and then auto-correlated the two signals. The
aligned FCU motor feedback was included similarly to the IMU data for the MF condi-
tions.
To generate the surrogate FFMS, we performed K-means clustering on the ground
88
truth position differences to generate 20 clusters. These were encoded as one-hot vectors.
4.6.1.2 KITTI Odometry
KITTI odometry is a benchmark dataset for the evaluation of visual odometry and
LIDAR-based navigation algorithms. KITTI images were recorded at 10 Hz from cam-
eras on-board a Volkswagen Passat B6 as it traversed city, residential, road, and cam-
pus environments. Groundtruth poses at each camera exposure were provided by an
RTK/INS/GPS solution and depth is provided with coincident data from a Velodyne laser
scanner.
For KITTI, we used sequences 00 − 10 excluding sequence 03 because the cor-
responding raw file 2011 09 26 drive 0067 was not online at the time of publication.
This resulted in a total of 22, 362 image pairs with 80% (17, 878) for training and 20%
(4484) for testing. In all experiments, we randomly selected an image for the source,
used the successive image for the target, and subtracted the two 6-DOF camera poses for
the transform input. Corresponding 100Hz IMU data was collected from the KITTI raw
datasets and the preceding 100 ms and following 100 ms of IMU data was included for
each example yielding a length 20x6 vector.
To generate the surrogate FFMS, we performed K-means clustering on the ground
truth position differences to generate 20 clusters. These were encoded as one-hot vectors.
Motor-feedback data is not included in KITTI and thus the MF conditions were not used
with KITTI.
89
4.6.2 Network Parameters and Training Procedures
For the GP IDE network, four FC layers of size 512, 4096, 4096, and 512 were used
to generate the 2x3 affine transformation matrix. The same architecture was also used
for the IMU, FFMS, and MF network configurations. For the IMU+FFMS and IMU+MF
configurations that each had two sources of extra-visual motion estimates, each extra-
visual estimate was processed through four FC layers of 512, 4096, 4096, and 512 before
being concatenated into a vector of length 1024.
The convolutional-deconvolutional encoder-decoder that composed the Local Shifter
pathway used 5x5 convolutional kernels with a stride of two. The encoder used five layers
of 32, 64, 128, 256, and 512 filters and the decoder was reversed, using 512, 256, 128, 64,
and 32 filters. All results described in this paper used a Local Shifter pathway with these
parameters.
As shown in Fig. 4.1, the output of fifth convolutional layer is concatenated with
the last FC layer of the Global Shifter pathway and is then fed into a single FC layer of
size 4096 before being fed into the first deconvolutional decoder layer.
We trained three networks for each condition and dataset. All results presented are
the averages of the three networks.
4.7 Evaluation
Predicted pixel correspondence between source and target images was evaluated
against ground truth correspondence, correspondence computed by the DeepMatching
algorithm (Revaud et al., 2016), and correspondence computed by the Deformable Spatial
90
Pyramid Matching algorithm (Kim et al., 2013) on the EuroC MAV dataset (Burri et al.,
2016) and the KITTI Visual Odometry dataset (Geiger et al., 2013)
For our experiments, ground truth is the new pixel coordinate of a given point in
the scene following camera motion. To calculate these new coordinates, we need (1) the
depth of each point in the scene, (2) the 3D transformation between camera poses, and
(3) the camera matrix and and transforms between each frame.
4.7.1 EuRoC Ground Truth Generation
We used the Kalibr derived distortion parametersD and camera intrinsic parameters
K to undistort each image and capture the new camera intrinsic matrix Kundistort. This
yielded images of size 952x503 from images originally sized 752x480.
To obtain depth estimates for each point in each grayscale image, we rendered range
images from the ground truth point cloud of the Vicon Room. We converted each ply
pointcloud into a pcd pointcloud and rendered range images using PCL according to the
camera intrinsic matrix Kundistort and the ground truth position of the MAV transformed
into the camera frame.
For each image and depth pair, we ray traced each pixel coordinate (ut0, vt0) using
the horizontal and vertical fields of view calculated from the focal lengths in Kundistort,
normalized the resulting [X, Y, 1]T coordinates, and multiplied by the depth at each pixel
location to generate coordinates in the camera frame [X t0c , Y
t0
c , Z
t0
c , 1]
T .
Then, for the 4x4 transformation matrix H t0WC that transforms a vector from the
camera frame C to the world frame W at time t0, and another transformation matrix
91
H t1WC that transforms a vector from the camera frame C to the world frame W at time t1,
we calculated the 4x4 transform matrix HWt0Wt1 as
HWt0Wt1 = H
t0
WC
−1 ∗H t1WC (4.7)
and then projected points in the camera frame from t0 to t1:
[X t1c , Y
t1
c , Z
t1
c , 1]
T = HWt0Wt1 ∗ [X t0b , Y t0b , Zt0b , 1]T (4.8)
Finally, we applied the camera matrixKundistort to project points [X t1c , Y
t1
c , Z
t1
c , 1]
T
to the imaging plane and recover ground truth-pixel coordinates:
[ut1, vt1, 1]
T = Kundistort ∗ [X t1c , Y t1c , Zt1c , 1]T (4.9)
The recovered mapping from [ut0, vt0, 1]T to [ut1, vt1, 1]T allows us to project points
in the visual scene from one camera position to another and thus provides ground truth
correspondence between two image frames.
4.7.2 KITTI Ground Truth Generation
Ground truth for KITTI was calculated similarly to EuroC (described above) with
three notable exceptions:
1. KITTI images were already undistorted
2. Depth was provided by Velodyne laser scans rather than rendered range images
from a point cloud
92
(a) EuRoC (b) KITTI
Figure 4.3: Inverse pixel error (higher is better) for each condition for EuRoC and KITTI.
The KM condition for EuRoC is omitted for plotting convenience but included in the table
below.
3. The resulting correspondence map between pixel locations at different camera lo-
cations was only valid when depth was within the Velodyne laser scanner’s range
4.8 Results and Discussion
To test our first hypothesis, we looked at whether IMU data is a suitable surrogate
for the ground truth transform used in previous incarnations of DeepEfference (Shamwell,
Nothwang, and Perlis, 2017; Shamwell, Nothwang, and Perlis, Accepted). The networks
are able to learn an appropriate global transform and, especially in the case of the KITTI
dataset, perform approximately the same when using IMU data. For KITTI, there was
only a 15% MSE performance decrease between the GP pose and the IMU conditions.
Compared to DM, However, for EuRoC, there was a 73% MSE performance decrease
between the GP pose and IMU conditions.
To test our second hypothesis, we looked at whether augmenting IMU data with a
93
(a) Source (b) Target (c) IDE
(d) IDE (e) IDE
Figure 4.4: Sample KITTI correspondence results. Note that only every other keypoint
(horizontally and vertically) is shown and the actual unaltered output is fully dense.
94
(a) Source (b) Target (c) IDE
(d) IDE (e) IDE
Figure 4.5: Sample KITTI correspondence results. Note that only every only keypoint
(horizontally and vertically) is shown and the actual unaltered output is fully dense.
95
(a) Source (b) Target (c) IDE
(d) IDE (e) IDE
Figure 4.6: Sample KITTI correspondence results. Note that only every only keypoint
(horizontally and vertically) is shown and the actual unaltered output is fully dense.
96
(a) Source (b) Target (c) IDE
(d) IDE (e) IDE
Figure 4.7: Sample EuRoc correspondence results. Note that only every only keypoint
(horizontally and vertically) is shown and the actual unaltered output is fully dense.
97
(a) Source (b) Target (c) IDE
(d) IDE (e) IDE
Figure 4.8: Sample EuRoc correspondence results. Note that only every only keypoint
(horizontally and vertically) is shown and the actual unaltered output is fully dense.
98
(a) Source (b) Target (c) IDE
(d) IDE (e) IDE
Figure 4.9: Sample EuRoc correspondence results. Note that only every only keypoint
(horizontally and vertically) is shown and the actual unaltered output is fully dense.
99
Table 4.1: Pixel MSE for the EuRoC dataset
Mean Standard Deviation Median
DM 3.66 7.68 1.93
DSP 4.22 8.07 1.89
DE GP 2.63 2.52 2.16
DE IMU 6.09 6.029 4.54
DE FFMS 13.19 5.60 10.2
DE IMU+MF 6.05 5.56 4.56
DE IMU+FFMS 5.39 4.808 4.20
surrogate feed forward motor signal increased MSE performance compared to the IMU
only case. For EuRoC, there was an 13% MSE performance increase between the IMU
and IMU+FFMS conditions. For KITTI, there was an 8% MSE performance increase
between the IMU and IMU+FFMS conditions. This performance increase suggests that a
network trained with a surrogate FFMS is able mitigate the performance impacts caused
by an IMU’s inability to directly measure velocity.
The performance difference in the KITTI and EuRoC IMU condtions may also have
to do with the differing quality of the IMUs used for the KITTI data collection versus
EuRoC data collection. For KITTI, the IMU velocity random walk was 0.005m/s/
√
hr
100
Table 4.2: Mean Pixel Error for the KITTI dataset
Mean Standard Deviation Median
DM 1.98 2.80 1.49
DSP 2.49 4.34 1.40
DE GP 2.23 2.03 1.54
DE IMU 2.63 2.40 1.93
DE FFMS 5.01 4.55 3.58
DE IMU+FFMS 2.42 2.26 1.81
and the angular random walk was 0.2◦/
√
hr. For EuRoC, velocity random walk was
0.11m/s/
√
hr and angular random walk was 0.66◦/
√
hr.
It is worth noting that for optimal performance, our approach requires sensor streams
to be temporally synchronized. For the EuRoc dataset, we did not have hardware synchro-
nized motor feedback as we did for both image and IMU data. This may explain why the
inclusion of motor data did not seem to significantly affect performance.
DM and DSP both also performed worse on EuRoC than IDE with ground truth
pose information. It would seem that the network was able to effectively learn how to
approximate the 3D transform ground truth-encoded transform as a 2D transform as well
as approximate differences in scene depth to effectively apply the global transformation
and non-linear corrections. This is potentially because EuRoC has many low-texture
101
Table 4.3: Average runtimes with equivalent frames per second (FPS)
# Hypotheses Mean Standard Deviation Median FPS
DM N/A 0.4115s 1.3e-2 0.407s 2.43
DSP N/A 1.252s 9.23e-2 0.0024s 0.798
DE GP 1 0.00242s 1.2e-3 0.0023s 412.1
DE IMU 1 0.00240s 1.29e-3 0.0023s 416.6
DE FFMS 1 0.00244s 1.3e-3 0.0023s 409.1
DE IMU+FFMS 1 0.00245s 1.2e-3 0.002s 408.1
regions that were difficult for DSP and DM.
The motor feedback in the MF condition may have served to confuse the network.
For example, current motor feedback may not reflect the current vehicle velocity and is
instead a better estimate of future acceleration. Thus, another possibility is that motor
feedback encoded largely redundant information compared to IMU data.
4.9 Conclusions and Future Work
On the EuRoC MAV dataset (Burri et al., 2016) and on the KITTI Odometry dataset
(Geiger et al., 2013), our results demonstrate that IDE performs dense correspondence
167x faster than DM and 516x faster than DSP. One area of future interest is in training
102
IDE networks for use on unseen datasets. In this case, because the two networks were
trained with inputs of different dimensions (IMU was 50x6 for EuRoC and 20x6 for
KITTI), we were unable to test the transferability of the architecture. Another challenge
will arise from the varying coordinate reference frame conventions used by IMUs (and
other extra-visual sensors). Ideally, we would use raw, unfiltered IMU data as input to the
network. For example, it is generally common to calibrate an IMU (to correct for bias,
noise, and mis-calibration offsets) and designate an alignment of its coordinate system.
This would mean that we may need to perform an additional operation at the beginning of
the network to transform the IMU or other extra-visual data to the reference frame used
for training (e.g., Z may be up for one dataset and forward for another, or the handedness
of coordinate systems may be reversed).
Future work should include additional experiments with larger datasets that at present
do not exist. We are in the process of creating a far larger dataset captured using groundrobots
in a vicon arena. Because the network is unsupervised and uses predictive errors as a
learning signal, we aim to collect large amounts of data with which we can better test this
network architecture.
In all experiments except those with the EuRoC dataset with IMU- or motor-feedback-
only network conditions, ground truth was in some way coupled to network inputs. This
means that it could conceivably be possible that the networks were able to learn an addi-
tional relationship that then unfairly biased performance results. However, the key results
in this thesis are in the runtimes of SOA approaches compared to my DeepEfference net-
works. Exact RMSE performance metrics need only be approximate. However, future
experiments still need to be carried out that completely decouple ground truth-related
103
calculations from network inputs.
104
Bibliography
Abadi, Martı´n et al. (2015). TensorFlow: Large-Scale Machine Learning on Heteroge-
neous Systems. Software available from tensorflow.org.
Abrams, Richard A, David E Meyer, and Sylvan Kornblum (1989). “Speed and accuracy
of saccadic eye movements: characteristics of impulse variability in the oculomotor
system.” In: Journal of Experimental Psychology: Human Perception and Perfor-
mance 15.3, p. 529.
Agrawal, Motilal and Kurt Konolige (2006). “Real-time localization in outdoor environ-
ments using stereo vision and inexpensive GPS”. In: Proceedings - International
Conference on Pattern Recognition 3, pp. 1063–1068. ISSN: 10514651. DOI: 10.
1109/ICPR.2006.962.
Anwar, Sajid, Kyuyeon Hwang, and Wonyong Sung (2015). “Structured pruning of deep
convolutional neural networks”. In: arXiv preprint arXiv:1512.08571.
Behroozmand, Roozbeh et al. (2009). “Vocalization-Induced Enhancement of the Audi-
tory Cortex”. In: Clinical Neurophysiology 120.7, pp. 1303–1312. DOI: 10.1016/
j.clinph.2009.04.022.Vocalization-Induced.
Bhargava, Preeti et al. (2012). “The robot baby and massive metacognition: Future vi-
sion”. In: Development and Learning and Epigenetic Robotics (ICDL), 2012 IEEE
International Conference on. IEEE, pp. 1–2.
Bottou, Le´on (2012). “Stochastic gradient descent tricks”. In: Neural networks: Tricks of
the trade. Springer, pp. 421–436.
Brachmann, Eric et al. (2014). “Learning 6d object pose estimation using 3d object coor-
dinates”. In: European conference on computer vision. Springer, pp. 536–551.
Bradley, David M (2010). Learning in modular systems. Carnegie Mellon University.
Bremmer, Frank et al. (2009). “Neural dynamics of saccadic suppression”. In: Journal of
Neuroscience 29.40, pp. 12374–12383.
Bridgeman, Bruce (2007). “Efference copy and its limitations”. In: Computers in Bi-
ology and Medicine 37.7, pp. 924–929. ISSN: 00104825. DOI: 10.1016/j.
compbiomed.2006.07.001.
Bridgeman, Bruce, AHC Van der Heijden, and Boris M Velichkovsky (1994). “A the-
ory of visual stability across saccadic eye movements”. In: Behavioral and Brain
Sciences 17.2, pp. 247–257.
Bridgeman, Bruce, Derek Hendry, and Lawrence Stark (1975). “Failure to detect dis-
placement of the visual world during saccadic eye movements”. In: Vision research
15.6, pp. 719–722.
Brody, Justin, Don Perlis, and Jared Shamwell (2015). “Who’s Talking?Efference Copy
and a Robot’s Sense of Agency”. In: 2015 AAAI Fall Symposium Series.
105
Brody, Justin et al. (2016). “Reasoning with Grounded Self-Symbols for Human-Robot
Interaction”. In: 2016 AAAI Fall Symposium Series.
Brox, Thomas, Jitendra Malik, and C Bregler (2009). “Large displacement optical flow”.
In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference
on, pp. 41–48. ISSN: 1939-3539. DOI: 10.1109/TPAMI.2010.143.
Burr, David C., Michael J. Morgan, and M. Concetta Morrone (1999). “Saccadic suppres-
sion precedes visual motion analysis”. In: Current Biology 9.20, pp. 1207–1209.
ISSN: 09609822. DOI: 10.1016/S0960-9822(00)80028-7.
Burri, Michael et al. (2016). “The EuRoC micro aerial vehicle datasets”. In: The Interna-
tional Journal of Robotics Research 35.10, pp. 1157–1163.
Byravan, Arunkumar and Dieter Fox (2017). “Se3-nets: Learning rigid body motion using
deep neural networks”. In: Robotics and Automation (ICRA), 2017 IEEE Interna-
tional Conference on. IEEE, pp. 173–180.
Carruthers, Peter (2015). The centered mind: what the science of working memory shows
us about the nature of human thought. OUP Oxford.
Ciliberto, Carlo et al. (2012). “A heteroscedastic approach to independent motion detec-
tion for actuated visual sensors”. In: 2012 IEEE/RSJ International Conference on
Intelligent Robots and Systems, pp. 3907–3913. ISSN: 2153-0858. DOI: 10.1109/
IROS.2012.6385943.
Clark, M and L Stark (1975). “Time optimal behavior of human saccadic eye movement”.
In: IEEE Transactions on Automatic Control 20.3, pp. 345–348.
Connor, Charles E (2001). “Shifting receptive fields”. In: Neuron 29.3, pp. 548–549.
Davison, A J (2003). “Real-time Simultaneous Localisation and Mapping with a Single
Camera”. In: Iccv 2, pp. 1403–1410. ISSN: 87551209. DOI: 10.1109/ICCV.
2003.1238654. arXiv: arXiv:1407.5736v1.
Deubel, Heiner, Bruce Bridgeman, and Werner X Schneider (2004). “Different effects
of eyelid blinks and target blanking on saccadic suppression of displacement.” In:
Perception & psychophysics 66.5, pp. 772–778. ISSN: 0031-5117. DOI: 10.3758/
BF03194971.
Deubel, Heiner, Carmen Koch, and Bruce Bridgeman (2010). “Landmarks facilitate vi-
sual space constancy across saccades and during fixation”. In: Vision Research 50.2,
pp. 249–259. ISSN: 00426989. DOI: 10.1016/j.visres.2009.09.020.
Deubel, Heiner, Werner X. Schneider, and Bruce Bridgeman (1996). “Postsaccadic tar-
get blanking prevents saccadic suppression of image displacement”. In: Vision Re-
search 36.7, pp. 985–996. ISSN: 00426989. DOI: 10.1016/0042-6989(95)
00203-0.
Duhamel, Jean-Rene´, Carol L Colby, and Michael E Goldberg (1992). “The updating of
the representation of visual space in parietal cortex by intended eye movements”.
In: Science 255.5040, p. 90.
Eliades, Steven J and Xiaoqin Wang (2003). “Sensory-Motor Interaction in the Primate
Auditory Cortex During”. In: Journal of neurophysiology 89.2194, pp. 2194–2207.
Eliades, Steven J. and Xiaoqin Wang (2008). “Neural substrates of vocalization feedback
monitoring in primate auditory cortex”. In: Nature 453.7198, pp. 1102–1106. ISSN:
0028-0836. DOI: 10.1038/nature06910.
106
Enderle, John D and James W Wolfe (1987). “Time-optimal control of saccadic eye move-
ments”. In: IEEE transactions on biomedical engineering 1, pp. 43–55.
Enkelmann, Wilfried (1991). “Obstacle detection by evaluation of optical flow fields
from image sequences”. In: Image and Vision Computing 9.3, pp. 160–168. ISSN:
02628856. DOI: 10.1016/0262-8856(91)90010-M.
Feldman, Anatol G. (2009). “New insights into actionperception coupling”. In: Exper-
imental Brain Research 194.1, pp. 39–58. ISSN: 0014-4819. DOI: 10.1007/
s00221-008-1667-3.
— (2016). “Active sensing without efference copy: referent control of perception”.
In: Journal of Neurophysiology 2111.514, jn.00016.2016. ISSN: 0022-3077. DOI:
10.1152/jn.00016.2016.
Flinker, a. et al. (2010). “Single-Trial Speech Suppression of Auditory Cortex Activity in
Humans”. In: Journal of Neuroscience 30.49, pp. 16643–16650. ISSN: 0270-6474.
DOI: 10.1523/JNEUROSCI.1809-10.2010.
Fraundorfer, Friedrich and Davide Scaramuzza (2012). “Visual odometry: Part II: Match-
ing, robustness, optimization, and applications”. In: IEEE Robotics & Automation
Magazine 19.2, pp. 78–90.
Geiger, A et al. (2013). “Vision meets robotics: The KITTI dataset”. In: The International
Journal of Robotics Research 32.11, pp. 1231–1237. ISSN: 0278-3649. DOI: 10.
1177/0278364913491297.
Glorot, Xavier, Antoine Bordes, and Yoshua Bengio (2011). “Deep Sparse Rectifier Neu-
ral Networks.” In: Aistats. Vol. 15. 106, p. 275.
Greenlee, Jeremy D. W. et al. (2011). “Human Auditory Cortical Activation during Self-
Vocalization”. In: PLoS ONE 6.3, e14744. ISSN: 1932-6203. DOI: 10.1371/
journal.pone.0014744.
Gru¨sser, O. J., A. Krizicˇ, and L. R. Weiss (1987). “Afterimage movement during saccades
in the dark”. In: Vision Research 27.2. ISSN: 00426989. DOI: 10.1016/0042-
6989(87)90184-2.
Han, Song, Huizi Mao, and William J Dally (2015). “Deep compression: Compressing
deep neural networks with pruning, trained quantization and huffman coding”. In:
arXiv preprint arXiv:1510.00149.
Hariharan, Bharath et al. (2015). “Hypercolumns for object segmentation and fine-grained
localization”. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 447–456.
Harris, Christopher M (1998). “On the optimal control of behaviour: a stochastic perspec-
tive”. In: Journal of neuroscience methods 83.1, pp. 73–88.
Harris, Christopher M and Daniel M Wolpert (1998). “Signal-dependent noise determines
motor planning”. In: Nature 394.6695, p. 780.
— (2006). “The main sequence of saccades optimizes speed-accuracy trade-off”. In:
Biological cybernetics 95.1, pp. 21–29.
Harwood, Mark R, Laura E Mezey, and Christopher M Harris (1999). “The spectral main
sequence of human saccades”. In: Journal of Neuroscience 19.20, pp. 9098–9106.
He, Kaiming et al. (2014). “Spatial pyramid pooling in deep convolutional networks
for visual recognition”. In: European Conference on Computer Vision. Springer,
pp. 346–361.
107
He, Kaiming et al. (2015). “Delving deep into rectifiers: Surpassing human-level per-
formance on imagenet classification”. In: Proceedings of the IEEE international
conference on computer vision, pp. 1026–1034.
Hinton, Geoffrey E., Alex Krizhevsky, and Sida D. Wang (2011). “Transforming auto-
encoders”. In: Lecture Notes in Computer Science (including subseries Lecture
Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 6791 LNCS.PART
1, pp. 44–51. ISSN: 03029743. DOI: 10.1007/978-3-642-21735-7_6.
arXiv: 9605103 [cs].
Holst, E and H Mittelstaedt (1950). “The principle of reafference: Interactions between
the central nervous system and the peripheral organs”. In: PC Dodwell (Ed. and
Trans.), Perceptual processing: Stimulus equivalence and pattern recognition 1950,
pp. 41–72.
Ibbotson, Michael R et al. (2008). “Saccadic modulation of neural responses: possible
roles in saccadic suppression, enhancement, and time compression”. In: Journal of
Neuroscience 28.43, pp. 10952–10960.
Ibbotson, Michael and Bart Krekelberg (2011). “Visual perception and saccadic eye move-
ments”. In: Current opinion in neurobiology 21.4, pp. 553–558.
Irani, Michal and P Anandan (1998). “A unified approach to moving object detection in
2D and 3D scenes”. In: IEEE transactions on pattern analysis and machine intelli-
gence 20.6, pp. 577–589.
Jaderberg, Max et al. (2015). “Spatial Transformer Networks”. In: Nips, pp. 1–14. ISSN:
1087-0156. DOI: 10.1038/nbt.3343. arXiv: arXiv:1506.02025v1.
Jia, Yangqing et al. (2014). “Caffe: Convolutional Architecture for Fast Feature Embed-
ding”. In: arXiv preprint arXiv:1408.5093.
Joiner, Wilsaan M, James Cavanaugh, and Robert H Wurtz (2013). “Compression and
suppression of shifting receptive field activity in frontal eye field neurons.” In: The
Journal of neuroscience : the official journal of the Society for Neuroscience 33.46,
pp. 18259–69. ISSN: 1529-2401. DOI: 10.1523/JNEUROSCI.2964- 13.
2013.
Kagan, Igor, Moshe Gur, and D Max Snodderly (2008). “Saccades and drifts differentially
modulate neuronal activity in V1: effects of retinal image motion, position, and
extraretinal influences”. In: Journal of Vision 8.14, pp. 19–19.
Kayama, YUKIHIKO et al. (1979). “Luxotonic responses of units in macaque striate
cortex”. In: Journal of Neurophysiology 42.6, pp. 1495–1517.
Kim, Jaechul et al. (2013). “Deformable spatial pyramid matching for fast dense cor-
respondences”. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 2307–2314.
Kitt, Bernd, Frank Moosmann, and Christoph Stiller (2010). “Moving on to dynamic en-
vironments: Visual odometry using feature classification”. In: IEEE/RSJ 2010 In-
ternational Conference on Intelligent Robots and Systems, IROS 2010 - Conference
Proceedings, pp. 5551–5556. ISSN: 2153-0858. DOI: 10.1109/IROS.2010.
5650517.
Kivinen, Jyri J. and Christopher K I Williams (2011). “Transformation equivariant Boltz-
mann machines”. In: Lecture Notes in Computer Science (including subseries Lec-
ture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 6791
108
LNCS.PART 1, pp. 1–9. ISSN: 03029743. DOI: 10 . 1007 / 978 - 3 - 642 -
21735-7_1.
Kohashi, Tsunehiko and Yoichi Oda (2008). “Initiation of Mauthner-or non-Mauthner-
mediated fast escape evoked by different modes of sensory input”. In: The Jour-
nal of Neuroscience 28.42, pp. 10641–53. ISSN: 1529-2401. DOI: 10.1523/
JNEUROSCI.1435-08.2008.
Korn, H and DS Faber (2005). “The Mauthner cell half a century later: a neurobiological
model for decision-making?” In: Neuron 47.1, pp. 13–28. ISSN: 0896-6273. DOI:
10.1016/j.neuron.2005.05.019.
Kumar, Sriram et al. (2015). “Object segmentation using independent motion detection”.
In: 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids)
1, pp. 94–100. ISSN: 21640580. DOI: 10.1109/HUMANOIDS.2015.7363537.
Laidlaw, Kaitlin EW and Alan Kingstone (2010). “The time course of vertical, horizontal
and oblique saccade trajectories: Evidence for greater distractor interference during
vertical saccades”. In: Vision research 50.9, pp. 829–837.
LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton (2015). “Deep learning”. In: Nature
521.7553, pp. 436–444.
Lefaix, G., T. Marchand, and P. Bouthemy (2002). “Motion-based obstacle detection and
tracking for car driving assistance”. In: Object recognition supported by user inter-
action for service robots 4.August, pp. 74–77. ISSN: 1051-4651. DOI: 10.1109/
ICPR.2002.1047403.
Long, Jonathan, Evan Shelhamer, and Trevor Darrell (2015). “Fully convolutional net-
works for semantic segmentation”. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 3431–3440.
Lucas, Bruce D, Takeo Kanade, et al. (1981). “An iterative image registration technique
with an application to stereo vision”. In:
Maas, Andrew L, Awni Y Hannun, and Andrew Y Ng (2013). “Rectifier nonlinearities
improve neural network acoustic models”. In: Proc. ICML. Vol. 30. 1.
Maimone, Mark, Yang Cheng, and Larry Matthies (2007). “Two years of visual odometry
on the Mars Exploration Rovers”. In: Journal of Field Robotics 24.3, pp. 169–186.
ISSN: 15564959. DOI: 10.1002/rob.20184. arXiv: 10.1.1.91.5767.
Martinelli, Agostino (2012). “Vision and IMU data fusion: Closed-form solutions for
attitude, speed, absolute scale, and bias determination”. In: IEEE Transactions on
Robotics 28.1, pp. 44–60.
Matin, E (1974). “Saccadic suppression: a review and an analysis.” In: Psychological
bulletin 81.12, pp. 899–917. ISSN: 0033-2909. DOI: 10.1037/h0037368.
McCormac, John et al. (2016). “SceneNet RGB-D: 5M Photorealistic Images of Synthetic
Indoor Trajectories with Ground Truth”. In: arXiv preprint arXiv:1612.05079. arXiv:
1612.05079.
McFarland, James M et al. (2015). “Saccadic modulation of stimulus processing in pri-
mary visual cortex”. In: Nature communications 6.
Mehta, Biren and Stefan Schaal (2002). “Forward models in visuomotor control”. In:
Journal of Neurophysiology 88.2, pp. 942–953.
109
Memisevic, R. and G. Hinton (2007). “Unsupervised Learning of Image Transforma-
tions”. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition.
ISSN: 1063-6919. DOI: 10.1109/CVPR.2007.383036.
Memisevic, Roland (2013). “Learning to relate images”. In: IEEE Transactions on Pattern
Analysis and Machine Intelligence 35.8, pp. 1829–1846. ISSN: 01628828. DOI: 10.
1109/TPAMI.2013.53. arXiv: arXiv:1110.0107v2.
Memisevic, Roland and Geoffrey E Hinton (2010). “Learning to represent spatial transfor-
mations with factored higher-order Boltzmann machines.” In: Neural computation
22.6, pp. 1473–1492. ISSN: 0899-7667. DOI: 10.1162/neco.2010.01-09-
953.
Miall, RC (1995). “Motor control, biological and theoretical”. In: The handbook of brain
theory and neural networks, pp. 597–600.
Nelson, Randal C (1991). “Qualitative detection of motion by a moving observer”. In:
International journal of computer vision 7.1, pp. 33–46.
Ni, Amy M, Scott O Murray, and Gregory D Horwitz (2014). “Object-centered shifts
of receptive field positions in monkey primary visual cortex”. In: Current Biology
24.14, pp. 1653–1658.
Nikolic, Janosch et al. (2014). “A synchronized visual-inertial sensor system with FPGA
pre-processing for accurate real-time SLAM”. In: Robotics and Automation (ICRA),
2014 IEEE International Conference on. IEEE, pp. 431–437.
Niziolek, C. a., S. S. Nagarajan, and J. F. Houde (2013). “What Does Motor Efference
Copy Represent? Evidence from Speech Production”. In: Journal of Neuroscience
33.41, pp. 16110–16116. ISSN: 0270-6474. DOI: 10.1523/JNEUROSCI.2137-
13.2013.
Noda, Hiroharu (1975). “Hiroharu noda”. In: pp. 579–595.
Ostendorf, Florian and Raymond J Dolan (2015). “Integration of retinal and extraretinal
information across eye movements”. In: PloS one 10.1, e0116810.
Ostry, David J. and Anatol G. Feldman (2003). “A critical evaluation of the force control
hypothesis in motor control”. In: Experimental Brain Research 153.3, pp. 275–288.
ISSN: 00144819. DOI: 10.1007/s00221-003-1624-0.
Panouille`res, Muriel TN et al. (2016). “Oculomotor adaptation elicited by intra-saccadic
visual stimulation: Time-course of efficient visual target perturbation”. In: Frontiers
in human neuroscience 10.
Rajkai, Csaba et al. (2008). “Transient cortical excitation at the onset of visual fixa-
tion”. In: Cerebral Cortex 18.1, pp. 200–209. ISSN: 10473211. DOI: 10.1093/
cercor/bhm046.
Ranzato, Marc Aurelio and Geoffrey E Hinton (2010). “Factored 3-Way Restricted Boltz-
mann Machines For Modeling Natural Images”. In: Artificial Intelligence 9, pp. 621–
628. ISSN: 10636919. DOI: 10.1109/CVPR.2010.5539962.
Revaud, Jerome et al. (2015). “EpicFlow : Edge-Preserving Interpolation of Correspon-
dences for Optical Flow”. In: Cvpr 2015. DOI: 10.1063/1.4905777. arXiv:
arXiv:1501.0256.
Revaud, Jerome et al. (2016). “DeepMatching : Hierarchical Deformable Dense Match-
ing”. In: International Journal of Computer Vision 120.3, pp. 300–323.
110
Ross, John et al. (2001). “Changes in visual perception at the time of saccades”. In: Trends
Neurosci. 24.2, pp. 113–121.
Rosten, Edward, Reid Porter, and Tom Drummond (2010). “Faster and better: A machine
learning approach to corner detection”. In: IEEE transactions on pattern analysis
and machine intelligence 32.1, pp. 105–119.
Scharstein, Daniel and Richard Szeliski (2002). “A taxonomy and evaluation of dense
two-frame stereo correspondence algorithms”. In: International journal of com-
puter vision 47.1-3, pp. 7–42.
Schmidt, Tanner, Richard Newcombe, and Dieter Fox (2017). “Self-supervised visual
descriptor learning for dense correspondence”. In: IEEE Robotics and Automation
Letters 2.2, pp. 420–427.
Sermanet, Pierre et al. (2013). “Overfeat: Integrated recognition, localization and detec-
tion using convolutional networks”. In: arXiv preprint arXiv:1312.6229.
Shamwell, E. Jared, William D. Nothwang, and Donald Perlis (2017). “DeepEfference:
Learning to Predict the Sensory Consequences of Action Through Deep Correspon-
dence”. In: Development and Learning and Epigenetic Robotics (ICDL), 2017 IEEE
International Conference on. IEEE.
— (Accepted). “A Deep Neural Network Approach to Fusing Vision and Heteroscedas-
tic Motion Estimates for Low-SWaP Robotic Applications”. In: Multisensor Fusion
and Integration for Intelligent Systems, 2017 International Conference on. IEEE.
Shamwell, Jared et al. (2012). “The robot baby and massive metacognition: Early steps
via growing neural gas”. In: Development and Learning and Epigenetic Robotics
(ICDL), 2012 IEEE International Conference on. IEEE, pp. 1–2.
Shotton, Jamie et al. (2013). “Scene coordinate regression forests for camera relocal-
ization in RGB-D images”. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 2930–2937.
Simonyan, Karen and Andrew Zisserman (2014). “Very deep convolutional networks for
large-scale image recognition”. In: arXiv preprint arXiv:1409.1556.
Sommer, M A and R H Wurtz (2006). “Influence of the thalamus on spatial visual pro-
cessing in frontal cortex”. In: Nature 444.7117, pp. 374–377. ISSN: 0028-0836.
DOI: 10.1038/nature05279.
Sperry, R W (1950). “Neural Basis of the Spontaneous Optokinetic Response Produced
By Visual Inversion”. In: Journal of Comparative and Physiological Psychology
43.6, pp. 482–489.
Sylvester, Richard and Geraint Rees (2006). “Extraretinal saccadic signals in human LGN
and early retinotopic cortex”. In: Neuroimage 30.1, pp. 214–219.
Taylor, Jonathan et al. (2012). “The vitruvian manifold: Inferring dense correspondences
for one-shot human pose estimation”. In: Computer Vision and Pattern Recognition
(CVPR), 2012 IEEE Conference on. IEEE, pp. 103–110.
Thilo, Kai V et al. (2004). “The site of saccadic suppression.” In: Nature neuroscience
7.1, pp. 13–4. ISSN: 1097-6256. DOI: 10.1038/nn1171.
Tomasi, Carlo and Takeo Kanade (1991). “Detection and tracking of point features”. In:
Troncoso, Xoana G et al. (2015). “V1 neurons respond differently to object motion versus
motion from eye movements.” In: Nature communications 6, p. 8114. ISSN: 2041-
1723. DOI: 10.1038/ncomms9114.
111
Volkmann, Frances C, Amy M Schick, and Lorrin A Riggs (1968). “Time course of visual
inhibition during voluntary saccades.” In: Journal of the Optical Society of America
58.4, pp. 562–569. ISSN: 0030-3941. DOI: 10.1364/JOSA.58.000562.
Walker, Robin, Eugene McSorley, and Patrick Haggard (2006). “The control of saccade
trajectories: Direction of curvature depends on prior knowledge of target location
and saccade latency”. In: Attention, Perception, & Psychophysics 68.1, pp. 129–
138.
Wang, Xiaolong and Abhinav Gupta (2015). “Unsupervised learning of visual represen-
tations using videos”. In: Proceedings of the IEEE International Conference on
Computer Vision, pp. 2794–2802.
Wen, Wei et al. (2016). “Learning structured sparsity in deep neural networks”. In: Ad-
vances in Neural Information Processing Systems, pp. 2074–2082.
Wiegner, Allen W and M Margaret Wierzbicka (1992). “Kinematic models and human
elbow flexion movements: quantitative analysis”. In: Experimental Brain Research
88.3, pp. 665–673.
Wohlhart, Paul and Vincent Lepetit (2015). “Learning descriptors for object recognition
and 3d pose estimation”. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 3109–3118.
Wolpert, Daniel M, Zoubin Ghahramani, and Michael I Jordan (1994). “Perceptual distor-
tion contributes to the curvature of human reaching movements”. In: Experimental
brain research 98.1, pp. 153–156.
— (1995). “An internal model for sensorimotor integration”. In: Science, pp. 1880–
1882.
Zhou, Tinghui et al. (2016). “View Synthesis by Appearance Flow”. In: European Con-
ference on Computer Vision 1, pp. 286–301. arXiv: arXiv:1605.03557v2.
Zirnsak, Marc and Tirin Moore (2014). “Saccades and shifting receptive fields: Antici-
pating consequences or selecting targets?” In: Trends in Cognitive Sciences 18.12,
pp. 621–628. ISSN: 1879307X. DOI: 10.1016/j.tics.2014.10.002.
112