ABSTRACT
Title of Dissertation: SCENE AND ACTION UNDERSTANDING
USING CONTEXT AND
KNOWLEDGE SHARING
Pallabi Ghosh
Doctor of Philosophy, 2020
Dissertation Directed by: Professor Larry S. Davis
Professor Abhinav Shrivastava
Department of Computer Science
Complete scene understanding from video data involves spatio-temporal deci-
sion making over long sequences and utilization of world knowledge. We propose a
method that captures edge connections between these spatio-temporal components
or knowledge graphs through a graph convolution network (GCN). Our approach
uses the GCN to fuse various information in the video like detected objects, human
pose, scene information etc. for action segmentation. For certain functions like zero
shot and few shot action recognition, we learn a classifier for unseen test classes
through comparison with similar training classes. We provide information about
similarity between two classes through an explicit relationship map i.e. the knowl-
edge graph. We study different kinds of knowledge graphs based on action phrases,
verbs or nouns and visual features to demonstrate how they perform with respect
to each other. We build an integrated approach for zero-shot and few-shot learning.
We also show further improvements through adaptive learning of the input knowl-
edge graphs and using triplet loss along with the task specific loss while training.
We add results for semi-supervised learning as well to understand improvements
from our graph learning technique.
For complete scene understanding, we also study depth completion using deep
depth prior based on the deep image prior (DIP) technique. DIP shows that struc-
ture of convolutional neural networks (CNNs) induces a strong prior that favors
natural images. Given color images and noisy or incomplete target depth maps,
we optimize a randomly-initialized CNN model to reconstruct a depth map re-
stored by virtue of using the CNN network structure as a prior combined with a
view-constrained photo-consistency loss. This loss is computed using images from
a geometrically calibrated camera from nearby viewpoints. It is based on test time
optimization, so it is independent of training data distributions. We apply this deep
depth prior for inpainting and refining incomplete and noisy depth maps within
both binocular and multi-view stereo pipelines.
SCENE AND ACTION UNDERSTANDING USING
CONTEXT AND KNOWLEDGE SHARING
by
Pallabi Ghosh
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2020
Advisory Committee:
Professor Larry S. Davis, Chair/Advisor
Professor Abhinav Shrivastava, Co-Advisor
Professor Behtash Babadi
Professor David Jacobs
Professor Soheil Feizi
?c Copyright by
Pallabi Ghosh
2020
Acknowledgments
I am thankful to everyone who have helped me in the course of my PhD. The
last few years have been an amazing learning experience and I am very grateful to
all the people who have made it possible.
First and foremost I would like to thank both my advisors, Dr. Larry S. Davis
and Dr. Abhinav Shrivastava without whose guidance and support I would not
have been able to complete this degree. Their expertise in both the actual research
work as well as publishing said works have been invaluable. I would thank Dr.
Behtash Babadi, Dr. David Jacobs and Dr. Soheil Feizi for being a part of my
thesis committee and for helping me make my final manuscript better. I would also
like to thank Dr. Tom Goldstein for being a co-author on my paper and helping
direct it.
Next I would like to acknowledge the help from the staff members of the
Computer Science department and ISSS who made all the administrative work and
maintenance of my international student status seem so easy. I would like to thank
Tom Hurst, Jennifer Story, Janice M. Perrone who helped me with all form sub-
missions, reimbursements, extension of my student status etc. Without their help I
would not have been able to spend as much time on my research work as I have.
My internship mentors and co-authors, Yi Yao and Ajay Divakaran at SRI
International and Vibhav Vineet, Sudipta Sinha and Neel Joshi from Microsoft
ii
Research have also guided me extensively throughout those research projects and I
am extremely grateful for their advice.
My lab-mates, housemates and friends have also been extremely helpful in
the course of my PhD. My co-authors Sohil Shah, Nirat Saini, Bor-Chun Chen
and Vlad Morariu were instrumental in those works and labmates Kamal Gupta,
Gaurav Shrivastava and Soumyadip Sengupta have helped whenever I was stuck
at some point in my research. Also my friends Sudha Rao, Manaswi Saha, Kartik
Nayak, Meethu Malu, Nidhi Shah and Gowthami Somepalli helped me in the ups
and downs throughout my PhD career.
Last but not the least, I would like to thank my family members. Both my
parents Nivedita Ghosh and Dr. Kiriti Bhusan Ghosh have guided me at various
stages of my education and development. My sister Dr. Shrutakirti Ghosh has
also been instrumental in my success as a student and my husband Dr. Bhaskar
Ramasubramanian has stood by me in all my decisions.
I would like to acknowledge the financial support provided by DARPA MediFor
program and Small Business Technology Transfer from the AirForce.
Finally I would like to thank all the UMD staff and faculty who made work
possible even during the COVID19 pandemic and helped create a safe environment
for us to work.
It is impossible to thank every person individually who have helped me during
my PhD in this limited space. I am thankful for all their help.
iii
Table of Contents
Acknowledgements ii
Table of Contents iv
List of Tables vi
List of Figures ix
Chapter 1: Introduction 1
1.1 Action Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Zero-shot and Few-shot Action Recognition . . . . . . . . . . . . . . 4
1.3 Learning Graphs for Knowledge Transfer . . . . . . . . . . . . . . . . 8
1.4 Depth Completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Chapter 2: Related Works 14
2.1 Action Recognition and Segmentation . . . . . . . . . . . . . . . . . . 14
2.2 Learning on Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Zero-shot and Few-shot Learning . . . . . . . . . . . . . . . . . . . . 18
2.4 Stereo Matching and Deep Stereo . . . . . . . . . . . . . . . . . . . . 20
2.5 Depth Map Refinement/Completion. . . . . . . . . . . . . . . . . . . 21
2.6 Deep Prior for Color Images. . . . . . . . . . . . . . . . . . . . . . . . 22
Chapter 3: Stacked Spatio-Temporal Graph Convolutional Networks for Ac-
tion Segmentation 23
3.1 Graph Convolutional Networks . . . . . . . . . . . . . . . . . . . . . 23
3.2 Spatio-Temporal GCNs . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Stacking of hourglass STGCN . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.1 CAD120 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.2 Charades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Chapter 4: All About Knowledge Graphs for Actions 39
4.1 Proposed Knowledge Graphs for Actions . . . . . . . . . . . . . . . . 40
4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.3 Our Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
iv
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Chapter 5: Learning Graphs for Knowledge Transfer with
Limited Labels 59
5.1 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.1.1 Adaptively Updating the Adjacency Matrix . . . . . . . . . . 61
5.1.2 Training using Triplet Loss . . . . . . . . . . . . . . . . . . . . 62
5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3.1 Semi-supervised Learning . . . . . . . . . . . . . . . . . . . . 65
5.3.2 Zero-shot/Few-shot Action Recognition . . . . . . . . . . . . . 66
5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Chapter 6: Depth Completion Using a View-constrained Deep Prior 74
6.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.1.1 Deep Image Prior . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.1.2 Deep Depth Prior . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.2.1 Tanks and Temples . . . . . . . . . . . . . . . . . . . . . . . . 82
6.2.2 KITTI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2.3 Our Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2.4 NYU v2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.2.5 Middlebury . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Chapter 7: Conclusion 95
v
List of Tables
3.1 Performance comparison based on the F1 score using the CAD120
dataset. Our STGCN improves the F1 score over the best reported
result (i.e., S-RNN) by approximately 5.0%. . . . . . . . . . . . . . . 31
3.2 Features for the Charades dataset. . . . . . . . . . . . . . . . . . . . . 33
3.3 Comparison of our Stacked-STGCN (A7) with baseline (A1), STGCN
without hourglass (A2), different temporal connections (A3-A5), and
different input features (A6). Input features include VGG-RGB for
scene, VGG-Flow for motion, Situation Recognition for action, and
Faster RCNN for object. . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Performance comparison based on mAP between our Stacked-STGCN
and the best reported results published in [1] using the Charades
dataset. Our Stacked-STGCN yields an approximate 2.41% and
3.20% improvement in mAP using VGG features only and all four
types of features, respectively. . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Performance comparison based on mAP with previous works using
the Charades dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1 ZSL results for all 3 datasets where we compare performances of A-KG,
VN-KG and a combination of the two. A-KG+VN-KG always does the
best. For UCF101 and HMDB51, the results are in mean accuracy
whereas for Charades, we report mean average precision (mAP). . . . 48
4.2 ZSL results for all 3 datasets. The baselines are ESZSL, DEM, Ob-
jects2Action, CEWGAN and TS-GCN. For UCF101 and HMDB51,
the results are in mean accuracy whereas for Charades, we report
mean average precision (mAP) since it is multi-label dataset. . . . . . 49
4.3 FSL results for the UCF101 and HMDB51 datasets. The baseline is
nearest neighbor, given 5 videos for each test set. The combination
of A-KG, VN-KG and V-KG does the best in both cases. . . . . . . . . . 49
4.4 Performance comparison between word2vec embedding and sentence2vec
embedding based models. Both the models are trained on graphs con-
sisting of class nodes from Kinetics and UCF101 (A-KG) with losses
on both. Performance metric used is mean accuracy. . . . . . . . . . . 51
vi
4.5 Experiments with 3 different knowledge graph constructions. The
variations are due to using only UCF101/HMDB51 classes for the
knowledge graph or appending it with Kinetics classes and train-
ing loss being calculated on UCF101/HMDB51 nodes only or both
UCF101/HMDB51 and Kinetics nodes in the knowledge graphs (A-KG).
Performance metric used is mean accuracy. . . . . . . . . . . . . . . . 52
4.6 Performance comparison for fully connected(FC) and bipartite graphs
constructed with UCF101 or HMDB51 with Kinetics dataset nodes in
A-KG. Both the models are trained on graphs consisting of class nodes
from two datasets (UCF101 and Kinetics or HMDB51 and Kinetics)
with losses on both. Performance metric used is mean accuracy. . . . 53
4.7 Performance comparison of using GCN (on UCF101 A-KG) vs a linear
combination (using the adjacency matrix edge weights) of the top 4
closest training class weights to the test classes. Performance metric
used is mean accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.8 Performance comparison of using an encoder-decode layer before the
GCN layers on UCF101 A-KG vs not using one. Performance metric
used is mean accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.9 Results on UCF101 A-KG with 10 randomly selected test classes leav-
ing 91 classes to be used for training I3D and GCN. Mean accuracy
is used for evaluation. The experiments are carried out 5 times and
the final column provides the averaged mean accuracy scores. We
compare our results to two previous work with similar settings. . . . . 56
5.1 We compare accuracy of our technique to various state-of-the-art
techniques for semi-supervised learning for Cora, Citeseer, and Pubmed
datasets; including two graph learning techniques, GLNN and GLCN.
We also provide the GCN* baseline which is our implemetation in Py-
Torch environment. The ? in Pubmed for GLNN stands for downsam-
pled input data. We get the best performance for both Citeseer and
Pubmed datasets. For Cora, our GCN baseline (80.0%) is worse than
the GCN baseline for GLCN (82.9%) by 3.0%, so the improvement
using our graph learning technique is higher. . . . . . . . . . . . . . . 65
5.2 Ablation comparing accuracy for Pubmed validation data for different
values of weighted averaging between input and updated adjacency
matrix, i.e., ? from equation 5.3. . . . . . . . . . . . . . . . . . . . . . 66
5.3 Comparison of our results using graph learning with the pipeline
from Chapter 4 without graph learning for UCF101 and HMDB51
datasets. We do better for all input KG configurations: A-KG, V-KG,
and A-KG+VN-KG+V-KG. The metric is mean accuracy (Higher is bet-
ter). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4 Improvements using triplet loss or updating adjacency matrix only
on V-KG and then both together. Metric is mean accuracy (Higher is
better). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
vii
5.5 Ablation showing performance of UCF101 A-KG with varying number
of epochs per update of adjacency matrix. The metric used in mean
accuracy (Higher is better). . . . . . . . . . . . . . . . . . . . . . . . 68
5.6 Ablation showing performance of UCF101 A-KG with different nega-
tive set class index ranges for triplet loss. The metric used in mean
accuracy (Higher is better). . . . . . . . . . . . . . . . . . . . . . . . 68
5.7 Comparison with State-of-the-art zero-shot action recognition results
for both UCF101 and HMDB51 datasets. The results are in mean
accuracy. Higher is better. We compare on the entire test set for
both datasets. We also randomly choose 20 classes from UCF101
test set over 10 times and average the output to replicate the 80/20
split reported by previous work. . . . . . . . . . . . . . . . . . . . . . 68
6.1 Minimum and maximum depth clipping values and the constant depth
value added per scene before doing DDP for 7 scenes in TnT dataset 77
6.2 We compare the f-score of DDP on datasets of different sizes. As the
number of images become smaller, the holes increase and the most
relative performance gain is at 22 images. We also compare to [2]
applied to an data that is out of distribution and show we do better.
(Higher is better) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.3 The results of comparing the DDP output using disparity, RGBD, and
warping loss and using UNet or SkipNet as the network are shown
here, P:precision, R:recall, F:f-Score. (Higher is better) . . . . . . . . 85
6.4 Quantitative results comparing 7 sequences for SGM based depths
and applying the DDP on SGM depths. We combine DDP with
SGM by replacing the depth values in the holes of SGM depth with
DDP depth. Here the datasets are I:Ignatius, B: barn, T: truck, C1:
caterpillar, MR: Meetingroom, CH: courthouse and C2: church. Also
N: number of images in sequence, C: number of consistent views while
constructing point cloud, D: disparity threshold, P:precision, R:recall,
F:f-score. (Higher is better) . . . . . . . . . . . . . . . . . . . . . . . 86
6.5 We compare the reconstruction performance using depth maps gener-
ated by MVSNet and by applying the DDP on MVSNet. I: Ignatius,
P:precision, R:recall, F:f-score. (Higher is better) . . . . . . . . . . . 87
6.6 Results on KITTI Dataset using D1 error. Lower is better. . . . . . . 89
6.7 Comparison of our method with techniques mentioned in [3] on Mid-
dlebury dataset using RMSE. (Lower is better) . . . . . . . . . . . . . 94
viii
List of Figures
1.1 Constructing a spatio-temporal graph for video analysis. The nodes
in the graph are features from different aspects of the scene like object
descriptors for jug, chair, table and laptop and the human pose de-
scriptor. We show the spatial and temporal connections (upto three
time steps) in the graph. . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Zero-shot action recognition based on similar training action classes.
The classifier needs external knowledge to be able to group unseen
class labelled ?Playing field hockey? with seen classes labelled ?Play-
ing ice hockey? and ?Playing soccer? and unseen class labelled ?Play-
ing guitar? with seen class labelled ?Playing violin?. It should not
put ?Playing basketball? with ?Playing guitar? in-spite of the term
?Playing? in both. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Different KGs for zero/few-shot action recognition. They are based
on action class names, verbs and nouns related to these class names
and visual features extracted from the videos belonging to each class. 5
1.4 We use a GCN to update the input graph connections and show re-
sults for ?Mixing Batter? class in zero-shot action recognition. Lan-
guage based models associates ?batter? to ?baseball? which is recti-
fied in the updated graph. . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 (a) Input image from one viewpoint (b) Target depth map computed
with SGM using multiple neighboring images[4] (c) The refined depth
map generated using our deep depth prior (DDP) technique. Depth
map (c) has the holes from depth map (b) (shown in white) filled. . . 11
3.1 System overview. Different from the original STGCN based on human
skeleton [5], our graph allows nodes of various types (such as actors,
objects, and scenes) and with varied feature length. Our graph also
supports flexible temporal connections (green lines) that can span
multiple time steps, for example the connections among the actor
nodes (blue nodes). Note that other nodes can have such temporal
connections but are not depicted to avoid congested illustration. This
spatio-temporal graph is fed into a stack of hourglass STGCN blocks
to output a sequence of predicted actions observed in the video. . . . 24
ix
3.2 An illustration of spatio-temporal graphs. Each node vi is represented
by a feature vector denoted by fi. The edge between node i and j
has a weight ei,j. These edge weights form the spatial and temporal
adjacency matrices. Note that our spatio-temporal graph supports a
large amount of deformation, such as missed detection (e.g., the actor
node and the object 3 node) and emerging/disappearing nodes (e.g.,
the object 2 node). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Illustration of two STGCN implementations to support graph nodes
with varied feature length. (a) Additional convolution layers to con-
vert node features with varied length to a fixed length. (b) Multiple
spatial GCNs each for one cluster of nodes (nodes with the same color)
with a similar feature length. These spatial GCNs convert features
with varied length to a fixed length. . . . . . . . . . . . . . . . . . . . 27
3.4 Illustration of stacked hourglass STGCN with two levels. . . . . . . . 29
3.5 Action segmentation results of our Stacked-STGCN on the CAD120
dataset. Green: correct detection and red: erroneous detection. . . . 32
4.1 System overview: We use knowledge graphs based on word embed-
dings (action class names, and associated verbs and nouns) and vi-
sual features for action recognition. With the word embeddings based
knowledge graph, we propose a zero-shot learning approach and with
visual features based knowledge graph we propose a few-shot learning
approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 t-SNE visualization showing feature distribution of UCF101 video
dataset. Sample images are added for our test classes. (Best viewed
in digital format) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 (a) Sentence2Vec embedding space for Kinetics and UCF101 classes.
The class ?uneven bars? and its neighbors are highlighted. (b) Class
?Pommel horse? and its neighboring classes in Kinetics dataset us-
ing word2vec embedding. The embeddings of each individual word
forming the phrase is also displayed. (Best viewed in digital format) . 51
4.4 This figure shows class-wise accuracy for different KGs and combi-
nation of KGs for UCF101 and HMDB51. We added few words for
better word embeddings in the labels (such as ?front crawl? becomes
?front crawl swimming?), which improves performance for language
based KGs or their combinations, as shown here. Each color for bar
represents a KG, blue is word based KG, orange is visual feature
based KG and grey is combination of all three KGs (A-KG, VN-KG and
V-KG). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
x
4.5 Heatmaps showing activations of various classes? classifier layers ob-
tained from training on UCF101 A-KG on various class videos. (a)
is the display of the activation from the ?playing sitar? class on
a?playing sitar? video, (b) is the display of the activation from the
?playing guitar? class on a?playing guitar? video, (c) is the display
of the activation from the ?playing sitar? class on a?playing guitar?
video, (d) is the display of the activation from the ?biking? class on
a?biking? video and (e) is the display of the activation from the ?play-
ing sitar? class on a?biking? video. These heatmaps show that test
class ?playing sitar? is correctly learning from training class ?playing
guitar? instead of training class ?biking? . . . . . . . . . . . . . . . . 57
5.1 System overview for adaptive learning of graphs connections. The
input graph is passed through a GCN layer and this intermediate
output is used to update the graph as well as calculate a triplet loss
between the current nodes and the positive and negative sets. This
output is then passed through another GCN network that generates
outputs specific to the task at hand. The final output is used to
calculate the task specific loss like MSE loss for zero-shot learning. . . 60
5.2 Class-wise comparison of accuracy for 23 UCF101 test classes using
A-KG and V-KG as input for zero- and few-shot learning, respectively,
between current results after applying graph learning (blue) and the
results without graph learning i.e the baseline (green). In both cases
(A-KG and V-KG), for majority of classes, we either beat or maintain
the baseline performance. Best viewed in digital. . . . . . . . . . . . . 69
5.3 We plot the adjacency matrix connections for UCF101+Kinetics A-KG
input and show the following two updates. We plot only a sub-graph
due to space complexity. We chose 8 test classes (class names shown
in red) and display all their connections in the KG. The edge colors
show the weight of the connection. There are multiple regions where
we can see improvements after first and second update. Best viewed
in digital. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4 We show the connections of AL where L is the number of layers in
the GCN (linear connectivity), as well as connections after passing
through the non-linear GCN network (GCN-based connectivity) for
?Mixing Batter? and ?Still Rings? classes. For both, we show the
top-K connections using fixed input A (adjacency matrix) as well as
updated A. The edge color based on the color bar and the width
of the connections represent edge weights. (larger width ? higher
weights). For ?Mixing Batter? the performance becomes better while
for ?Still Rings? the performance becomes worse after A is updated. . 72
xi
6.1 (a) Input depth map with holes (b) DDP on just depth maps and
(c) DDP on RGBD images. In the black box regions in (b), DDP is
filling up the holes in the sky or background based on the depth from
the house or radio because it has no edge information. RGBD input
provides this edge information in (c). . . . . . . . . . . . . . . . . . . 75
6.2 Overview of the Deep Depth Prior (DDP): The DDP network is
trained using a combination of L1 and SSIM reconstruction loss with
respect to a target RGB-D image and a photoconsistency loss with
respect to neighboring calibrated images. This network is used to
refine a set of noisy depth maps and the refined depth maps are sub-
sequently fused to obtain the final 3D point cloud model. Iout and
Dout are the RGB and depth output of our network. Inbr is the RGB
at a neighboring viewpoint. Iref and Dref are the input RGB and
depth at the current (reference) viewpoint. . . . . . . . . . . . . . . . 76
6.3 (a) RGB image (b) Input disparity map (c) Disparity output from
DDP trained with equal weight for RGB and depth loss. The RGB
artifacts are evident in (c) through the vertical and horizontal lines
representing the wooden planks in the wall (d) Disparity output from
the DDP trained with lower weight for the RGB loss compared to
depth loss. The artifacts disappear in (d). . . . . . . . . . . . . . . . 80
6.4 (a) Input RGB and (b) Input depth images and (c) Predicted depth
at 16000 epochs for Ignatius, Meeting Room, Barn, Caterpillar and
Truck.(Best viewed in digital. Please zoom in.) . . . . . . . . . . . . . 88
6.5 (a) Original image from the reference view point (b) Novel view syn-
thesized from neighboring to reference viewpoint using the original
SGM depth (c) Novel view synthesis from a neighboring to reference
viewpoint using DDP depth. The holes that appear in (b) gets filled
in (c) (Best viewed in digital. Please zoom in.) . . . . . . . . . . . . . 90
6.6 (a) Input RGB image. Reconstructed point-cloud from (b) SGM (c)
DDP (Ours) and (d) MVSNet depth images for RedCouch, Guitar
and Van. Our reconstructions are better and more complete. (Best
viewed in digital. Please zoom in.) . . . . . . . . . . . . . . . . . . . 91
6.7 (a) Input RGB (b) Input depth (c) Depth completed using a cross-
bilateral filter and (d) Depth completed using DDP (Best viewed in
digital. Please zoom in.) . . . . . . . . . . . . . . . . . . . . . . . . . 93
xii
Chapter 1: Introduction
In this thesis we show the utilization of context and knowledge bases for overall
scene understanding. We improve action recognition and depth perception using
various levels of supervision like fully supervised or zero-shot and few-shot action
recognition, graph based semi-supervised learning and test time optimization for
depth completion.
1.1 Action Segmentation
We construct a comprehensive spatio-temporal graph (STG) to jointly repre-
sent an action along with its associated actors, objects, and other contextual cues
[6]. Specifically, graph nodes represent actions, actors, objects, and scenes; spatial
edges represent spatial (e.g., next to, on top of, etc.) and functional relationships
(e.g., attribution, role, etc.) between two nodes with importance weights; and tem-
poral edges represent temporal and causal relationships. We exploit a variety of
descriptors in order to capture these rich contextual cues. In literature, there exist
various networks for situation recognition, object detection, scene classification, and
semantic segmentation. The outputs of these networks provide embeddings that
can serve as the node features of the proposed STGs. We show a sample STG
1
constructed from video data in Figure 1.1.
Figure 1.1: Constructing a spatio-temporal graph for video analysis. The nodes in
the graph are features from different aspects of the scene like object descriptors for
jug, chair, table and laptop and the human pose descriptor. We show the spatial
and temporal connections (upto three time steps) in the graph.
We perform action segmentation on top of this STG via a stacked spatio-
temporal graph convolution network (Stacked-STGCN). Our STGCN stems from
the networks originally proposed for skeleton-based action recognition [5] and in-
troduces two major advancements as our innovations. First, to accommodate var-
ious contextual cues, the nodes of our STG have a wide range of characteristics,
leading to the need for using descriptors with varied length. Second, our STG al-
lows arbitrary edge connections (even fully connected graph) to account for the
large amount of graph deformation caused by missed detections, occlusions, and
emerging/disappearing objects. These two advancements are achieved via enhanced
designs with additional layers.
2
Another innovation we introduce is the extended use of stacked hourglass ar-
chitecture on graph data. Stacked hourglass networks have been applied to grid-like
data with regular connections (e.g., images using CNNs) and shown improved re-
sults for a number of tasks such as human pose estimation [7], facial landmark
localization [8], etc. They allow repeated upsampling and downsampling of features
and combine these features at different scales, leading to better performance. We
propose to extend this encoder decoder architecture to graph data with irregular
connections. Different from CNN, STGCN (or more general GCN) employs adja-
cency matrices to represent irregular connections among nodes. To address this
fundamental difference, we adapt the hourglass networks by adding extra steps to
down-sample the adjacency matrices at each encoder level to match the compressed
dimensions of that level.
To summarize, the proposed Stacked-STGCN offers the following innovations:
? Joint inference over a rich set of contextual cues.
? Flexible graph configuration to support a wide range of descriptors with varied
feature length and to account for large amounts of graph deformation over long
video sequences.
? Stacked hourglass architecture specifically designed for graph data with irreg-
ular connections.
These innovations improved recognition and localization accuracy, robustness, and
generalization performance for action segmentation over long video sequences.
3
1.2 Zero-shot and Few-shot Action Recognition
Action recognition has seen rapid progress in the past few years, including
better datasets [9, 10] and stronger models [11, 12, 13, 14, 15, 16]. Despite this
progress, it is not easy to train an action classifier for a new category. A potential
solution is to leverage the knowledge from seen or familiar categories to recognize
unseen or unfamiliar categories. This is the zero-shot learning paradigm, where we
transfer or adapt classifiers of related, known, or seen categories to classify unseen
ones (Figure 1.2). Similarly, for few-shot action recognition, instead of testing on
completely unseen classes, we have only a few labeled samples from the test classes,
which help in learning about the rest of the test samples.
Figure 1.2: Zero-shot action recognition based on similar training action classes.
The classifier needs external knowledge to be able to group unseen class labelled
?Playing field hockey? with seen classes labelled ?Playing ice hockey? and ?Playing
soccer? and unseen class labelled ?Playing guitar? with seen class labelled ?Playing
violin?. It should not put ?Playing basketball? with ?Playing guitar? in-spite of the
term ?Playing? in both.
Zero-shot and few-shot learning methods have been studied widely for image
classification. A recent research [17] builds a knowledge graph (KG) representing
4
relationships between seen and unseen classes and then trains a graph convolutional
network (GCN) on this KG. This helps to transfer classifier knowledge from seen
to unseen classes. Using the same technique for action recognition is hard since,
unlike objects, it is unclear what is the best knowledge representation for actions.
One of the reasons as observed in [18] is that verbs have a broader definition and
conflicting meaning compared to nouns and we will be giving some examples in the
next paragraph where action definitions can be confusing.
Figure 1.3: Different KGs for zero/few-shot action recognition. They are based on
action class names, verbs and nouns related to these class names and visual features
extracted from the videos belonging to each class.
In this thesis, we study the performance improvements by using different types
of KGs for zero-shot and few-shot action recognition (Figure 1.3) [19]. The primary
step in building a KG is generating a good implicit representation for each action
class. In image classification, standard word embeddings (word2vec [20, 21, 22],
GloVe [23], ConceptNet [24], etc.) capture the semantic knowledge associated with
well-defined class names. However, for action classification, class names vary from
single words (?sit?, ?stand?, etc.) to phrases (?shooting ball (not playing baseball)?)
5
and there are multiple definitions of the same (or similar) action class(es); like,
?apply eye makeup? or ?put on eye-liner?. Such diversity is less pronounced in image
classification tasks due to the simplicity of labels. Our first contribution is studying
different implicit representations for action classes and showing the advantages of a
sentence2vector model in capturing the semantics of word sequences for zero/few-
shot action recognition.
Our second contribution is building an explicit relationship map from these
implicit representations of action classes. In image classification, the explicit rep-
resentations for transferring knowledge from seen to unseen categories are using
attributes or external KGs. Several datasets provide labeled class-attribute pairs
(e.g., AwA [25] , aYahoo [26], COCO-Attributes [27], MITstates [28], etc.). Sim-
ilarly, many KGs have nodes that correspond to image classification classes (e.g.,
WordNet [29], NELL, and NEIL [30, 31]). In contrast, such sources are scarce for
action classes. Wordnet contains verbs, therefore, it can be used to construct a KG
for verbs, but we cannot have a KG with nodes representing the entire phrase (eg.,
?playing(verb) guitar(noun)?) for an action class. Instead, there will be separate
nodes for verbs and objects with defined inter-relationships. ConceptNet [24] has
some phrases, but the list is not exhaustive and a lot of label names in our datasets
are not present in ConceptNet. On the other hand, we build a KG with an explicit
relationship of the multi-word action phrases in any dataset. We append dataset
with action classes from other datasets and construct two KGs, one for noun, and
other for verb either by splitting the action phrase in cases like ?playing(verb) gui-
tar(noun)? or using WordNet to get the nearest noun in cases like ?cake?(noun) for
6
action class named ?baking?(verb). Further, we build a KG for few-shot learning
using mean features of training data-points per class. We use a combination of this
KG with the two KGs defined previously and observe performance improvement.
Finally, majority of previous work on zero-shot action recognition use image-
based learning models to estimate actions in videos. Recent advances in action
recognition lead to the use of a network trained on a video dataset as the feature
extractor. Such a system requires an improved evaluation paradigm, since the action
classes in the training set cannot be in the test set. We manually check for com-
monalities between the training datasets (Kinetics) and testing datasets (UCF101,
HMDB51, Charades), but could not resolve problems within Kinetics which is a huge
dataset and can have videos common across multiple classes. We keep all Kinet-
ics classes in training set and remove common classes from Kinetics with UCF101,
HMDB51 and Charades from the test set. Our third contribution is the creation of
this evaluation paradigm using UCF101, HMDB51, Charades, and Kinetics datasets.
To summarize, our main three contributions are:
? Better implicit representation of action phrases (which are word sequences)
using sentence2vec.
? Comparative study of different KGs for action zero-shot/few-shot learning.
? Developing an improved evaluation paradigm for zero-shot/few-shot action
recognition using networks trained on video datasets as feature extractors.
These 3 contributions together builds an integrated approach for both zero-shot and
few-shot learning.
7
1.3 Learning Graphs for Knowledge Transfer
One of the key limitations of the GCN-based techniques discussed thus far
is that the input graph structure, as captured by the adjacency matrix, is fixed.
By design, the GCN-based approaches rely heavily on the input graphs, and noisy
or low-quality graphs have an outsized impact on performance. In this work, we
explore the adaptive learning of the input adjacency matrix over time, in conjunction
with the rest of the GCN training; i.e., the losses used to train the underlying
tasks (e.g., zero-/few-shot learning) are also used to update the structure of the
input adjacency matrix. This is in stark contrast with other related graph learning
works [32, 33], which have a separate dedicated network and special loss functions to
update the adjacency matrix. As we demonstrate empirically, the benefit of using the
downstream tasks? losses is that the learned graph using our approach is better suited
for the downstream task. Our proposed approach is a straightforward algorithm to
update the graph?s structure by learning better node representations and using these
to recompute the adjacency matrix. Since the learned node representations, via a
GCN, capture better correlations with respect to the downstream task, the resulting
graph tends to be better than the input graph from an external source. One such
update is illustrated in Figure 1.4, where we learn better connections for the class
?Mixing batter?. A language-based KG associates ?batter? with the verb ?batting?
(shown as ?input?), and our approach rectifies this mistake across updates and results
in more meaningful connections.
Operationalizing the straightforward approach described above has two key
8
Figure 1.4: We use a GCN to update the input graph connections and show results
for ?Mixing Batter? class in zero-shot action recognition. Language based models
associates ?batter? to ?baseball? which is rectified in the updated graph.
issues. First, updating a densely or fully-connected graph, in the absence of any
other constraints, often tends to provide arbitrary updates to the structure, even-
tually leading to degenerate solutions (e.g., same weights for all edges). Second, if
the graph connections are sparse (as is generally the case), there is no mechanism
to learn to add or drop connections in the graph. Simple heuristics, such as fixed
degree for each node, tend to be sub-optimal as different nodes might have a dif-
ferent number of related nodes that they should be connected to. In addition, each
downstream task can have domain-specific constraints on the degree of the nodes;
e.g., for zero-shot action recognition, we observed that a fully-connected graph is
detrimental to the performance and empirically determine the suitable degree. To
address both the drawbacks discussed above, while obeying the domain-specific con-
straints, we propose to utilize a triplet loss formulation on the intermediate output
nodes ? i.e., the node features after our graph-learning step but before the graph
is passed to the GCN-based framework for the downstream task. Our formulation
selects positive and negative neighbors for each node in the graph, and uses them to
9
add constraints on its degree while avoiding degenerate solutions by ensuring that
negative neighbors are farther than the positive ones. Therefore, the graph learning
step is trained using both the downstream task losses and the triplet loss.
In summary, our contributions are:
? A simple learning approach that can update the input graphs for the GCN-
based knowledge transfer or aggregation frameworks
? A triplet loss formulation that avoids degenerate solutions and allows the flex-
ibility of degree-constraints
We demonstrate the effectiveness of our approach on semi-supervised, zero-shot,
and few-shot learning setups. For semi-supervised learning, we use the generic
framework [34] built on network datasets, like Cora, Citeseer, and Pubmed [35] with
accompanying well-defined input graphs. The knowledge is transferred using GCN
from training samples to test samples; and the nodes represent each sample data
point in the dataset and the input graph represents how these samples are related.
For zero-shot/few-shot learning, we focus on our action recognition pipeline with
input knowledge graph (KG) built from sentence2vec [36] embeddings.
1.4 Depth Completion
There are numerous approaches for estimating scene depth, such as using
binocular [37] or multi-view [38, 39, 40] stereo, or directly measuring depth with
depth cameras, eg. LIDAR, etc. These approaches suffer from artifacts, such as
noise, inaccuracy, and incompleteness, due to various limitations. As depth es-
10
Figure 1.5: (a) Input image from one viewpoint (b) Target depth map computed
with SGM using multiple neighboring images[4] (c) The refined depth map generated
using our deep depth prior (DDP) technique. Depth map (c) has the holes from
depth map (b) (shown in white) filled.
timation is an ill-posed problem, extensive research has been conducted to solve
the problem using approximate inference and optimization techniques that employ
appropriate priors and regularization [41, 42, 43, 44].
Supervised learning methods based on convolutional neural networks (CNNs)
have shown promise in improving depth estimations, both in the binocular [45, 46,
47] and multi-view [48, 49, 50] stereo settings. However, these supervised methods
rely on vast amounts of ground truth data to achieve proper generalization. While
unsupervised learning approaches have been explored [51, 52, 53], their success ap-
pears modest compared to supervised methods.
In this thesis, we propose a new approach for improving depth measurements
that is inspired by the recent work [54]. They demonstrated that the underlying
structure of a encoder-decoder CNN induces a prior that favors natural images, a
property they refer to as a ?deep image prior? (DIP). The work on DIP shows that
the parameters of a randomly initialized encoder-decoder CNN can be optimized to
11
map a high-dimensional noise vector to a single image. When the image is corrupted
and the optimization is stopped at an appropriate point before overfitting sets in,
the network outputs a noise-free image. The DIP has since been used as a regularizer
in a number of vision tasks such as image denoising and inpainting [54, 55, 56].
We use DIP-based regularization for refining and inpainting noisy and incom-
plete depth maps obtained from a wide variety of sources [57]. Using a network
similar to DIP, our approach generates a depth map by combining a depth recon-
struction loss with a view-constrained photoconsistency loss. The latter loss term
is computed by warping a color image into neighboring views using the generated
depth map and then measuring the photometric discrepancy between the warped
image and the original image.
In this sense, our technique resembles direct methods proposed for image reg-
istration problems, which all employ initialization and iterative optimization. How-
ever, instead of using handcrafted regularizers in the optimization objective, we use
the deep image prior as the regularizer. While the role of regularization in end-to-
end trainable CNN architectures is gaining interest [58, 59], our method is quite
different, because there is no training and the network parameters are optimized
from scratch on each set of test images. Figure 1.5 shows the inpainting results of
applying our technique (DDP) on an input depth map with holes.
To the best of our knowledge, this is the first work to investigate deep image
priors for completing depth images. We evaluate our approach using results from
modern stereo pipelines and depth cameras and show that the refined depth maps
are more accurate and complete, leading to more complete 3D models.
12
To summarize, our contributions are:
? A deep prior network for depth completion that allows test time optimization
and is independent of training data distributions
? Using combination of reconstruction loss with view-constrained photo-consistency
loss that works better than just the mean squared error loss used in DIP
13
Chapter 2: Related Works
2.1 Action Recognition and Segmentation
Action recognition is an example of one of the classic computer vision prob-
lems. In the early days, features like PCA-HOG, SIFT, dense trajectories, etc.
were used in conjunction with optimization techniques like HMM, PCA, Markov
models, SVM, etc. In 2014, Simonyan and Zisserman used spatial and temporal
2D CNNs [60]. That was followed by the development of 3D convolutions with
combined spatial and temporal convolutional blocks. Since then a series of works
following these two schemes, two-stream and 3D convolution, were studied including
TSN [15], ST-ResNet [61], I3D [11], P3D [13], R(1+2)D [14], T3D [12], S3D [16],
etc. Another popular type of deep neural networks used for action recognition is
the Recurrent Neural Network (RNN) including Long Short-Term Memory networks
(LSTM), which are designed to model sequential data. Particularly, RNNs/LSTMs
operate on a sequence of per frame features and predict the action label for the
whole video sequence (i.e., action recognition) or action of current frame/segment
(i.e., action detection/segmentation). The structural-RNN (S-RNN) is one such
method that uses RNNs on spatiotemporal graphs for action recognition [62]. The
S-RNN relies on two independent RNNs, namely nodeRNN and edgeRNN, for it-
14
erative spatial and temporal inference. Recently, thanks to the rapid development
in GNNs, graph-based representation became a popular option for action recogni-
tion, for instance skeleton-based activity recognition using STGCN [5], Graph Edge
Convolution Networks [63] and Videos as space time region graphs [64]. In [64],
GCN is applied to space-time graphs extracted from the whole video segment to
output an accumulative descriptor, which is later combined with the aggregated
frame-level features to generate action predictions. STGCN [5] was originally pro-
posed for skeleton-based activity recognition. The nodes of the original STGCN are
the skeletal joints, spatial connections depend on physical adjacency of these joints
in the human body, and temporal edges connect joints of the same type (e.g., right
wrist to right wrist) across one consecutive time step. STGCN on skeleton graph
achieves state-of-the art recognition performance on multiple datasets. However,
the STG is constructed based on human skeletons, which is indeed an oversimplified
structure compared to the variety and complexity that our STG needs to handle in
order to perform action segmentation with contextual cues and large graph deforma-
tion. Therefore, the original STGCN is not directly applicable. Instead, we use the
original STGCN as our basis and introduce a significant amount of augmentation
so that STGCN becomes generalizable to a wider variety of applications including
action segmentation.
Action segmentation presents a more challenging problem than action recogni-
tion in the sense that it requires identifying a sequence of actions with semantic labels
and temporally localized starting and ending points for each identified action. Con-
ditional Random Fields (CRFs) are traditionally used for temporal inference [65].
15
Recently, there has been substantial research interest in leveraging RNNs includ-
ing LSTM and Gated Recurrent Unit (GRU) [66, 67]. Lea et al.propose temporal
convolutional networks (TCNs) [68], which lay the foundation for an additional line
of work for action segmentation. Later, a number of variations of TCNs were also
studied [69, 70, 71].
2.2 Learning on Graphs
Knowledge graphs(KGs) have been used in improving text based search en-
gines [72, 73] and question answering [74, 75, 76, 77]. Automatic construction
of a large KG and relationship learning has captured a lot of attention in the
past [78, 79, 80, 81]. ConceptNet [24], FrameNet [82] and WordNet [29], are some
of the existing large scale KGs.
KGs are also used to improve performance for image classification, represen-
tation learning, visual question answering and object detection techniques, [83, 84,
85, 86, 87, 88]. Hierarchical structure of KGs have contributed significantly in ap-
plications like learning similarity among classes [89] and for capturing hierarchical
relationships between objects of distinct categories [90].
In recent years, there have been a number of research directions for applying
neural networks on graphs. The original work by Scarselli et al., referred to as the
GNN, is an extension of the recursive neural networks and is used for sub-graph
detection [91]. Later, GNNs were extended and a mapping function was introduced
to project a graph and its nodes to an Euclidean space with a fixed dimension [92].
16
In 2016, Li et al.use gated recurrent units and better optimization techniques to de-
velop the Gated Graph Neural Networks [93]. GNNs have been used in a number of
different applications like situation recognition [94], human-object interaction [95],
webpage ranking [91, 92], etc. The literature also mentions a number of techniques
that apply convolutions on graphs. Duvenaud et al.were one of the first to develop
convolution operations for graph propagation [96], whereas Atwood and Towsley
developed their own technique independently [97]. Defferrard et al.use approxima-
tion in spectral domain [98] based on spectral graph introduced by Hammond et
al. [99]. In [34], Kipf and Welling propose GCNs for semi-supervised classification
based on similar spectral convolutions, but with further simplifications that result
in higher speed and accuracy. Other GCN based semi-supervised learning work in-
clude [100, 101, 102, 103, 104]. Such works often use citation network datasets, like
Citeseer, Cora, and PubMed [35], and protein-protein interaction dataset [105] for
experimentation on semi-supervised learning. Some of our approaches utilize the
GCN framework proposed by Kipf and Welling [34] as our GCN operator and the
citation network datasets as our input.
Neural Graph Matching Networks was developed for few-shot learning in 3D
action recognition [106]. Wang et al. [17] uses Graph Convolution network on KG for
zero-shot image classification. The KG was formed with NELL(Never Ending Lan-
guage Learning) [30], NEIL(Never Ending Image Learning) [31] and WordNet [29].
We use a similar model based on GCN for zero-shot actions. Dedre Gentner [18]
shows how verbs and nouns have different levels of complexities and usually an ac-
tion phrase comprises of both or just the verb. We explore different KGs, including
17
one with verbs and nouns only, to understand how these KGs improve performance
for action recognition in zero-shot and few-shot learning setup.
Recently there has also been some work on graph learning networks for semi-
supervised learning by [32, 33]. They develop a new loss to learn the edge weights in
the graph. Instead of a separate network outputting the edge weights, we take the
intermediate output of the original network and update the adjacency matrix. Our
technique is more flexible, allowing for the update of node features as well as edge
weights and connections when necessary. We also do not encounter computational
complexity issues with increase in the length of the input node feature dimension
unlike [32]. Kim et al.[107] applies a graph neural network model for learning the
edge weights in the input graph for few-shot learning, predicting labels based on
connectivity to other labelled nodes. In contrast, we build upon the GCN frame-
work for zero/few-shot learning where nodes in the graph represents classes and not
individual samples.
2.3 Zero-shot and Few-shot Learning
Zero-shot learning (ZSL) refers to the task of learning to predict on classes
that are excluded from the training set [108]. Various studies do ZSL for image
classification and object detection [25, 109, 110, 111, 112, 113, 114], as well as for
action recognition [115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126]. The
other zero-shot action papers, to the best of our knowledge, mostly are not GCN
based, which has been proven to do better than traditional zero-shot techniques
18
for image classification [17]. While [81] is GCN based, their KG is very different
from the one we use. They construct a single KG with actions and objects using
ConceptNet [24], where nodes are connected based on word embedding. They use
visual object features as a second channel to improve zero-shot learning. The number
of objects in their graph is not dependent on the number of action classes. They
show their best result when selecting 2000 most common visible objects in their
dataset to get their object nodes, meaning they need access to the unlabelled test
data (transductive). We use separate KGs for action, verb and noun and fuse
them at the end with a fusion layer. Our verbs and nouns are dependent only on
the action label and uses no visual information (inductive). Some other zero-shot
learning research which we use as baselines include [127] and [128], where [127]
uses a two layer network for learning relationships between features, attributes, and
classes; while [128] uses the image feature space to map the language embedding,
and not an intermediate space.
Few-shot learning (FSL) for image classification has been explored using meta-
learning for learning distance of samples and decision boundary in the embedding
space [129, 130, 131, 132], or by learning the optimization algorithm which can
be generalized over different datasets [133, 134]. A benchmark for few shot image
classification is created in [135]. For action recognition, studies propose embedding
a video as a matrix [136, 137], using deep networks or generative models [138, 139]
and using human-object interaction [140]. We tried GCN based FSL for action
recognition, but our approach cannot be compared to many of these approaches
due to two reasons ? 1) Each paper uses a different dataset split, and our splits are
19
different as well because we use a pre-trained network from Kinetics in our pipeline;
2) We do not evaluate the episodic learning formulation like several other papers.
Our aim is to build an integrated approach for zero and few-shot learning and also
improve few-shot using the KG constructed for the zero-shot setting (relationship of
class names, etc.) which, to the best of our knowledge, is not explored in the past.
2.4 Stereo Matching and Deep Stereo
Dense stereo matching is an extensively studied topic and there has been
tremendous algorithmic progress both in the binocular setting [4, 45, 46, 51, 58] as
well as in the multi-view setting [40, 49, 50, 141, 142], in conjunction with advances
in benchmarking [143, 144, 145, 146, 147]. Traditionally, the best performing stereo
methods were based on approximate MRF inference on pixel grids [41, 42, 43],
where including suitable smoothness priors was considered quite crucial. However,
such methods were usually computationally expensive. Hirschmuller proposed Semi-
Global Matching (SGM) [4], a method that provides a trade-off between accuracy
and efficiency by approximating a 2D MRF optimization problem with several 1D
optimization problems. SGM has many recent extensions [148, 149, 150, 151, 152]
and also works for multiple images [141]. Region growing methods have also shown
promise and implicitly incorporate smoothness priors [40, 153, 154, 155]
In recent years, deep models for stereo have been proposed to compute better
matching costs [45, 156, 157] or to directly regress disparity or depth [46, 47, 51, 158]
and also for the multi-view setting [48, 49, 50]. Earlier on, end-to-end trainable CNN
20
models did not employ any form of explicit regularization, but recently hybrid CNN-
CRF methods have advocated using appropriate regularization based on conditional
random fields (CRFs) [58, 59]. In contrast with these works, as we do not perform
learning by fitting to training data, our approach is more generalizable as it does
not fall prey to the tendency of deep approaches to overfit to their training data.
2.5 Depth Map Refinement/Completion.
The fast bilateral solver [159] is an optimization technique for refining disparity
or depth maps. However, the objective is fully handcrafted. Knoblereiter and Pock
recently proposed a refinement scheme where the regularizer in the optimization
objective is trained using ground truth disparity maps [160]. Their model learns to
jointly reason about image color, stereo matching confidence and disparity. Voynov
et al. [161] use a deep prior for depth super-resolution, but they do not have a
multiview constraint, as we do, nor do they investigate refinement and hole-filling.
Other recent disparity or depth map refinement techniques utilize trained CNN
models [162]. Similarly depth map completion by Zhang et al. [2] use a learning
based technique. They do single RGBD image depth completion whereas we use
a multi-view photo-consistency loss for training our network. Also we show in out
results that one main difference between their work and ours is that our result is not
dependent on training data distributions. Depthcomp [3] also does depth completion
and they use the semantic segmentation maps as prior knowledge.
21
2.6 Deep Prior for Color Images.
Beyond the previously discussed work of Ulyanov et al. [54], deep image pri-
ors have been extended for a number diverse applications ? neural inverse render-
ing [163], mesh reconstruction from 3D points [55], and layer-based image decompo-
sition [56]. Recently, Cheng et al. [164] pointed out important connections between
DIP and Gaussian processes. Our approach is in a similar vein as these approaches,
where we modify the DIP for depth maps by combing the usual reconstruction loss
with a second term, the photoconsistency loss which ensures that when the refer-
ence image is warped into a neighboring view using our depth map, the discrepancy
between the warped image and the original image is minimized.
22
Chapter 3: Stacked Spatio-Temporal Graph Convolutional Networks
for Action Segmentation
In this chapter we describe out action segmentation pipeline using Stacked
Spatio-Temporal Graph Convolutional Networks (Stacked-STGCNs). We provide
the system overview in Figure 3.1. Each section in this figure in described in more
detail.
3.1 Graph Convolutional Networks
Let a graph be defined as G(V , E) with vertices V and edges E (see Figure 3.2).
Vertex features of length d0 are denoted as fi for i ? {1, 2, . . . , N} where N is the
total number of nodes. Edge weights are given as eij where eij ? 0 and i, j ?
{1, 2, . . . , N}. The graph operation at the lth layer is defined as:
H l+1 = g(H l, A) = ?(D??1/2A?D??1/2H lW l) (3.1)
where W l and H l are the dl ? dl+1 weight matrix and N ? dl input matrix of
the lth layer, respectively. A? = I + A where A = [ei,j], D? is the diagonal node
degree matrix of A?, and ? represents a non-linear activation function (e.g., ReLU).
23
Figure 3.1: System overview. Different from the original STGCN based on human
skeleton [5], our graph allows nodes of various types (such as actors, objects, and
scenes) and with varied feature length. Our graph also supports flexible temporal
connections (green lines) that can span multiple time steps, for example the con-
nections among the actor nodes (blue nodes). Note that other nodes can have such
temporal connections but are not depicted to avoid congested illustration. This
spatio-temporal graph is fed into a stack of hourglass STGCN blocks to output a
sequence of predicted actions observed in the video.
3.2 Spatio-Temporal GCNs
STGCN is originally designed for skeleton-based action recognition [5]. We
apply STGCN for action segmentation of long video sequences using frame-based
action graphs extracted via situation recognition [94]. To accommodate additional
application requirements, our STG differs fundamentally in two aspects. First, the
original STGCN is based on the human skeletal system with graph nodes correspond-
ing to physical joints and spatial edges representing physical connectivity between
these joints. Instead, we use human-object interactions to construct our spatial
24
Figure 3.2: An illustration of spatio-temporal graphs. Each node vi is represented
by a feature vector denoted by fi. The edge between node i and j has a weight ei,j.
These edge weights form the spatial and temporal adjacency matrices. Note that
our spatio-temporal graph supports a large amount of deformation, such as missed
detection (e.g., the actor node and the object 3 node) and emerging/disappearing
nodes (e.g., the object 2 node).
graph where nodes represent actors, objects, scenes, and actions whereas edges rep-
resent their spatial (e.g., next to) and/or functional (e.g., role) relationships. Vari-
ous descriptors can be extracted either as the channels or nodes of the spatial graph
to encode comprehensive contextual information about the actions. For example,
we can use pose feature to describe actor nodes, appearance features including at-
tributes at high semantic levels for object nodes, and frame-level RGB/flow features
for scene nodes.
Second, the original STGCN only connects physical joints of the same type
across consecutive time stamps, which indeed reduces to a fixed and grid-like con-
nectivity. As a result, the temporal GCN degrades to conventional convolution.
To support flexible configurations and account for frequent graph deformation in
25
complex activities (e.g., missed detections, emerging/disappearing objects, heavy
occlusions, etc.), our graph allows arbitrary temporal connections. For example, an
object node present at time t0 can be connected to an object node of the same type
at time tn with n ? 1 in comparison to the original STGCN with n = 1.
Let As and At be the spatial and temporal adjacency matrices, respectively.
Our proposed STGCN operation can be represented mathematically as follows:
H l+1 l
?1/2 ?1/2
= gt(Hs, At) = ?(D?
l l
t A?tD?t HsWt )
(3.2)
l l ?1/2 ?1/2Hs = gs(H ,A
l l
s) = D?s A?sD?s H Ws
where W l ls and Wt represents the spatial and temporal weight metrics of the
lth convolution layer, respectively. In comparison, the original STGCN reduces to
l+1 l ?1/2 ?1/2H = g(H ,As) = ?(D?s A?sD? H
lW lW l) (3.3)s s t
due to the fixed grid-like temporal connections.
Note that the original STGCN requires fixed feature length across all graph
nodes, which may not hold for our applications where nodes of different types may
require different feature vectors to characterize (e.g., features from Situation Recog-
nition are of length 1024 while appearance features from Faster-RCNN [165] are of
length 2048). To address the problem of varied feature length, one easy solution is
to include an additional convolutional layer to convert features with varied length
to fixed length (see Figure 3.3(a)). However, we argue that nodes of different types
may require different length to embed different amounts of information. Converting
26
Figure 3.3: Illustration of two STGCN implementations to support graph nodes with
varied feature length. (a) Additional convolution layers to convert node features
with varied length to a fixed length. (b) Multiple spatial GCNs each for one cluster
of nodes (nodes with the same color) with a similar feature length. These spatial
GCNs convert features with varied length to a fixed length.
features to a fixed length may decrease the amount of information they can carry.
Therefore, we group nodes into clusters based on their feature length and design
multiple spatial GCNs, each corresponding to one of the node cluster. These spatial
GCNs will convert features to a fixed length (see Figure 3.3(b)).
Notably, the S-RNN is developed for action recognition in [62] where node
RNN and edge RNN are used iteratively to process graph-like input. In comparison,
our model features a single graph network that can jointly process node features and
edge connectivity in an interconnected manner. This, therefore, leads to improved
performance and robustness.
27
3.3 Stacking of hourglass STGCN
Hourglass networks consist of a series of downsampling and upsampling opera-
tions with skip connections. They follow the principles of the information bottleneck
approach to deep learning models [166] for improved performance. They have also
been shown to work well for tasks such as human pose estimation [7], facial landmark
localization [8], etc. In this work, we incorporate the hourglass architecture with
STGCN so as to leverage the encoder-decoder structure for action segmentation
with improved accuracy.
Our Stacked-STGCN extends and adapts the hourglass structure, commonly
applied to data with regular grids (e.g., images), to data with irregular connections
(e.g., graphs). This entails the development of new techniques: 1) non-symmetric
encoding and decoding since feature pooling on graphs is only required in encoding
stage and 2)adjustment of the dimensions of the spatial and temporal adjacency
matrices accordingly. Our deliberate design of Stacked-STGCN stemming from 1)
and 2) above tackle the difficulties in adapting the traditional hourglass to data with
irregular connections and produce consistent performance improvement. To the best
of our knowledge, extending/adapting the hourglass structure to spatiotemporal
graphs at multiple spatial and temporal resolutions has not been attempted before.
Particularly, our GCN hourglass network contains a series of STGCN layer
followed by a strided convolution layer as the basic building block for the encoding
process. Conventional deconvolution layers comprise the basic unit for the decoding
process to bring the spatial and temporal dimensions to the original size. Figure 3.4
28
Figure 3.4: Illustration of stacked hourglass STGCN with two levels.
depicts an example with two levels.
Note that, at each layer of STGCN, the dimension of the spatial and temporal
adjacency matrices, As and At, needs to be adjusted accordingly to reflect the
downsampling operation. Take the illustrative example in Figure 3.4 for instance
and assume that the adjacency matrices At and As are of size Nt?Nt and Ns?Ns,
respectively, at level 1 and that a stride of two is used. At level 2, both At and As
are sub-sampled by two and their dimensions become Nt/2?Nt/2 and Ns/2?Ns/2,
respectively. Due to the information compression enabled by the encoder-decoder
structure, using hourglass networks leads to performance gain compared to using
the same number of STGCN layers one after another.
3.4 Experiments
3.4.1 CAD120
Dataset. The CAD120 dataset is one of the more simplistic datasets available
for activity recognition [167]. It provides RGBD Data for 120 videos on 4 subjects
29
as well as skeletal data. We use the 10 actions classes as our model labels including
reaching, moving, pouring, eating, drinking, opening, placing, closing, scrubbing
and null.
The CAD120 dataset splits each video into segments of the above mentioned
actions. For each segment, it provides features for object nodes, skeleton features for
actor nodes, and spatial weights for object-object and skeleton-object edges. Across
segments, it also provides temporal weights for object-object and actor-actor edges.
The object node feature captures information about the object?s locations in the
scene and the way it changes. The Openni?s skeleton tracker [168] is applied to
RGBD videos producing skeleton features for actor nodes. The spatial edge weights
are based on the relative geometric features among the objects or between an object
and the actor. The temporal edge weights capture the changes from one temporal
segment to another.
Implementation. We exploited all the node features and edge weights pro-
vided by the CAD120 dataset. The skeleton feature of an actor node is of length 630
and the feature of an object node is of length 180. We pass each of these descriptors
through convolution layers to convert them to a fixed length of 512. The initial
learning rate is 0.00035 and the learning rate scheduler has a drop rate of 0.9 with
a step size of 1. While experimentation, four fold cross-validation is carried out,
where videos from 1 of the 4 people are used for testing and the videos from the
rest three for training.
Results. For the CAD120 dataset, the node features and edge weights are
provided by the dataset itself. The same set of features are used by S-RNN [62]
30
and Koppula et al [167, 169] who used spatio-temporal CRF to solve the problem.
The S-RNN trains two separate RNN models, one for nodes (i.e., nodeRNN) and
the other for edges (i.e., edgeRNN). The edgeRNN is a single layer LSTM of size
128 and the nodeRNN uses an LSTM of size 256. The actor nodeRNN outputs an
action label at each time step.
Method F1-score (%)
Koppula et al. [167, 169] 80.4
S-RNN w/o edge-RNN [62] 82.2
S-RNN [62] 83.2
S-RNN(multitask) [62] 82.4
Ours (STGCN) 88.5
Table 3.1: Performance comparison based on the F1 score using the CAD120
dataset. Our STGCN improves the F1 score over the best reported result (i.e.,
S-RNN) by approximately 5.0%.
In Table 3.1, we show some of the previous results, including the best re-
ported one from S-RNN, as well as the result of our STGCN. The F1 score is used
as the evaluation metric. Our STGCN outperforms the S-RNN by about 5.3% in
F1 score. Instead of using two independent RNNs to model interactions among
edges and nodes, our STGCN collectively performs joint inference over these inher-
ently interconnected features. This, therefore, leads to the observed performance
improvement.
In Figure 3.5, we can see a couple of errors in the second and third examples.
The third prediction is ?opening? instead of ?moving? in the second example. The
previous action is ?reaching? which is generally what precedes ?opening? when the
actor is standing in front of a microwave and looking at it. So probably that is
31
the reason for the observed erroneous detection. Also the ninth frame is classified
?reaching? instead of ?moving?. If we look at the ninth frame and the eleventh
frame, everything appears the same except for the blue cloth in the actor?s hand.
Our STGCN failed to capture such subtle changes and therefore predicted the wrong
action label.
Figure 3.5: Action segmentation results of our Stacked-STGCN on the CAD120
dataset. Green: correct detection and red: erroneous detection.
3.4.2 Charades
Dataset. Charades is a recent real-world activity recognition/segmentation
dataset including 9848 videos with 157 action classes, 38 object classes, and 33 verb
classes [170, 171]. It contains both RGB and flow streams at a frame rate of 24fps.
It poses a multi-label, multi-class problem in the sense that at each time step there
can be more than one action label. The dataset provides ground-truth object and
verb labels as well as FC7 feautres for every 4th frame obtained from a two-stream
network trained on Charades. The entire dataset is split into 7985 training videos
32
Description
Scene Features
N1. FC7 layer output of VGG network trained
on RGB frames
Motion Features
N2. FC7 layer output of VGG network trained
on flow frames
Segment Features
N3. I3D pre-final layer output trained on RGB frames
N4. I3D pre-final layer output trained on flow frames
Actor Features
N5.GNN-based Situation Recognition trained
on the ImSitu dataset
Object Features
N6. Top 5 object detection features from Faster-RCNN
Table 3.2: Features for the Charades dataset.
and 1863 testing videos.
Implementation. For the Charades dataset, we explored two types of fea-
tures, one based on VGG [172] and the other based on I3D [11], for the scene nodes
in our spatio-temporal graph. Further, we used the GNN-based situation recogni-
tion technique [94] trained on the ImSitu dataset [173] to generate the verb feature
for the actor nodes. The top five object features of the Faster-RCNN network [165]
trained on MSCOCO are used as descriptors of the object nodes. In total, the spa-
tial dimension of our STG is eight. The VGG features are of length 4096, the verb
features 1024, and the object features 2048. Each of these channels are individually
processed using graph convolution layers to convert them to a fixed length (e.g., we
used 512). Table 3.2 summarizes these features.
In this experiment, spatial nodes are fully connected and temporal edges allow
connections across three time steps, i.e., at the tth step there are edges from t, to
33
t+ 1 and t+ 2 and t+ 3. The spatial edges between nodes are given a much smaller
weight than self connections. We used a stack of three hourglass STGCN blocks.
Before applying the normalized adjacency matrix, the input is also normalized by
subtracting the mean. The output of the final Stacked-STGCN block is spatially
pooled and passes through a fully connected layer to generate the probability scores
of all possible classes. Since the Charades is a multi-label, multi-class dataset, the
binary cross-entropy loss was used. We used an initial learning rate of 0.001 and a
learning rate scheduler with a step size of 10 and a drop rate of 0.999.
To further improve action segmentation performance on Charades, we have
also used a trained I3D model on Charades to generate descriptors for the scene
nodes replacing the VGG features. These feature descriptors are of length 1024.
Since I3D already represents short-term temporal dependencies, one block of hour-
glass STGCN is sufficient for capturing long-term temporal dependencies. The ini-
tial learning rate was 0.0005 and the learning rate scheduler was fixed at a drop rate
of 0.995 at a step size of 10.
During training, we chose our maximum temporal dimension to be 50. If the
length of a video segment is less than 50, we zero-pad the rest of the positions.
But these positions are not used for loss or score computation. If the length of a
video segment is greater than 50, we randomly select a starting point and use the
50 consecutive frames as the input to our graph.
At test time, we used a sliding window of length 50. Based on overlapping
ratios, we applied a weighted average over these windowed scores to produce the final
score. We used an overlap of 40 time steps. Following instructions in the Charades
34
dataset, we selected 25 equally spaced points from the available time steps in the
video, to generate the final score vectors.
Ablation Studies. As to the Charades dataset, the mean average precision
(mAP) is used as the evaluation metric. For fair comparison, we have used the
scripts provided by the Charades dataset to generate mAP scores. We examined the
performance of Stacked-STGCN using two types of descriptors for the scene nodes,
namely frame-based VGG features and segment-based I3D features (see Table 3.2).
We summarize our ablation studies in Table 3.3
(A1) All Features; Baseline 7.67
(A2) All Features; STGCN 9.22
(A3) VGG-RGB; STGCN; 1 time step 6.33
(A4) VGG-RGB; STGCN 6.54
(A5) All Features; Stacked-STGCN; 1 time step 10.93
(A6) VGG-RGB; Stacked-STGCN; 7.91
(A6) VGG-RGB+VGG-Flow; Stacked-STGCN 10.94
(A7) All Features; Stacked-STGCN 11.73
Table 3.3: Comparison of our Stacked-STGCN (A7) with baseline (A1), STGCN
without hourglass (A2), different temporal connections (A3-A5), and different input
features (A6). Input features include VGG-RGB for scene, VGG-Flow for motion,
Situation Recognition for action, and Faster RCNN for object.
We first examine the performance improvement introduced by structured in-
ference of contextual information represented in spatio-temporal graphs. We imple-
mented a baseline method (A1) in Table 3.3 which employs a Fully Connected layer
for joint inference of multiple types of features. We compare our Stacked-STGCN
(A7) with this baseline (A1) and demonstrate an improvement of 4.06% .
We also compare our Stacked-STGCN (A7) with an implementation without
the hourglass structure (A2) and demonstrate an improvement of 2.51% in Table 3.3.
35
For fair comparison of this experiment, we design a network (A2) with the same
number of convolutional layers as the encoder of our Stacked-STGCN. To maintain
the same temporal resolution, these convolution layers have a stride of one, compared
to a stride of two in the Stacked-STGCN.
We further implement a network that closely resembles the original STGCN:
1) nodes are represented by the same type of features (i.e.,VGG-RGB); 2) pure
graph convolutional operations (i.e., without hourglass); and 3) temporal connec-
tions across one time step. Comparing to this vanilla implementation (A3), our
Stacked-STGCN (A7) produces an improvement of 5.40% in Table 3.3.
Next, we conduct a study on the performance of Stacked-STGCN with different
input features. With one, two and four types of features, the performances are 7.91,
10.94, and 11.73, respectively, in Table 3.3 (A6, A7). This steady improvement is
due to more context gained from enriched input features.
Finally, we study the performance of our Stacked-STGCN with different tem-
poral connections. Comparing (A7) vs. (A5) in Table 3.3, temporal connections
with three time steps demonstrate an improvement of 0.80%. With a simpler net-
work (i.e., without hourglass), we observe an improvement of 0.21%, (A4) vs. (A3).
The optimal number of time steps can vary depending on networks and applications.
The empirical optimal number for our Stacked-STGCN on Charades is three.
Comparison with State-of-the-art. In Table 3.4, the performance of
Stacked-STGCN is compared with a baseline, which uses two-stream VGG or I3D
features directly for per frame action label prediction, an LSTM-based method, and
the Super-Events approach proposed in [1]. Our Stacked-STGCN yields an approx-
36
imate 2.41% and 3.20% improvement in mAP using VGG features only and all four
types of features, respectively. Using I3D features, our Stacked-STGCN ranks the
second.
Method VGG mAP I3D mAP
Baseline [1] 6.56 17.22
LSTM [1] 7.85 18.12
Super-Events [1] 8.53 19.41
Stacked-STGCN (VGG only) 10.94
Stacked-STGCN (all features) 11.73
Stacked-STGCN (I3D) 19.09
Table 3.4: Performance comparison based on mAP between our Stacked-STGCN
and the best reported results published in [1] using the Charades dataset. Our
Stacked-STGCN yields an approximate 2.41% and 3.20% improvement in mAP
using VGG features only and all four types of features, respectively.
Method mAP
Random [174] 2.42
RGB [174] 7.89
Predictive-corrective [175] 8.90
Two-Stream [174] 8.94
Two-Stream + LSTM [174] 9.60
Sigurdsson etal. standard [174] 9.69
Sigurdsson etal. post-processing [174] 12.80
R-C3D [176] 12.70
I3D [11] 17.22
I3D +LSTM [1] 18.10
I3D+Temporal Pyramid [1] 18.20
I3D + Super-events [1] 19.41
I3D +Stacked-STGCN (Ours) 19.09
Table 3.5: Performance comparison based on mAP with previous works using the
Charades dataset.
In Table 3.5, we compare the performance of Stacked-STGCN against some
selected works on Charades. We can see that our Stacked-STGCN outperforms
all the methods except for the I3D+super-events [1], which employs an attention
37
mechanism to learn proper temporal span per class. We believe that incorporating
such attention mechanism could further improve the performance of our Stacked-
STGCN. Furthermore, our method provides a principled way of structured inference
over heterogeneous features, which most of the listed methods are incapable of.
Another set of results on Charades is from the workshop held in conjunction with
CVPR 2017. The results in that competition appear better. However, as mentioned
in [1], that competition used a test set that is different from the validation set we
used for performance evaluation. Besides those techniques could have used both the
training and validation sets for training.
38
Chapter 4: All About Knowledge Graphs for Actions
After successfully using GCNs for action segmentation we also apply it for
zero/few-shot action recognition. This chapter contains a detailed description of
our system from Figure 4.1. We use the GCN layer described in equation 3.1 from
[34] and the pipeline from zero shot learning technique by Wang et al. [17]. It
consists of training and testing phases as described next.
Training: Initially, a model pre-trained on Kinetics is fine-tuned using training
classes of UCF101, HMDB51, or Charades, followed by the extraction of the final
classifier layer weights to be used for training the GCN. The constructed KG, along
with the adjacency matrix, are inputs to the GCN. The output of each node of the
GCN has the same dimensions as the trained classifier layer filter size (1024 in our
case). The GCN is trained such that its output for the training classes matches the
classifier layer weights of the trained I3D model. The loss used is the mean squared
error (MSE) loss.
So if there are Ctrain number of training classes, Ctest number of test classes and
the output feature dimension of each class is d, then the output of the GCN, HGCN,
is of size (Ctrain+Ctest)?d. From HGCN, the output dimensions corresponding to the
training nodes are selected, denoted by HGCNTrain with size Ctrain?d. This feature is
39
Zero-Shot Learning Setup
Word Embedding Knowledge Graph Graph Convolution Network Output Graph I3D weight comparison
Playing piano,
Playing soccer, Seen Class Classifier
Playing tabla, 
Playing ice 
hockey, Playing ?
guitar, Tennis Unseen Class Classifier
swinging ?
Few-Shot Learning Setup
Visual Features Knowledge Graph Graph Convolution Network Output Graph I3D weight comparison
Seen Class Classifier
?
Unseen Class Classifier
?
Figure 4.1: System overview: We use knowledge graphs based on word embeddings
(action class names, and associated verbs and nouns) and visual features for action
recognition. With the word embeddings based knowledge graph, we propose a zero-
shot learning approach and with visual features based knowledge graph we propose
a few-shot learning approach.
of the same dimension as the weights of the I3D classifier layer trained or fine-tuned
on the training classes of the dataset, Wcls. The MSE loss that is back-propagated
is given by ?HGCNTrain ?Wcls?2.
Testing: During test time, the penultimate layer of the I3D model is used to
extract the features of the test images ftest with dimensions N ? d. The output
of the test nodes of the GCN with dimension Ctest ? d is extracted from HGCN,
denoted by by HGCNTest. The output class probabilities for the test images (Ptest)
are obtained as P Ttest = ftestHGCNTest.
4.1 Proposed Knowledge Graphs for Actions
In this section, we describe the construction of different knowledge graphs
(KGs) for actions. Wang et al.[17] use Wordnet [29] and NELL [30] embeddings to
40
construct the KG for zero-shot learning (ZSL) on image classification. Compared
to [17], our action label classes are sentences or phrases instead of words, which is
why using wordnet or word2vec doesn?t provide distributive and coherent embed-
dings for action labels. Moreover, getting semantically correlated embedding space
for words and visual features for a good KG is another challenge. We describe these
challenges and how we tackle them while constructing three different versions of
KGs for actions.
A-KG: The first KG is based on word descriptors of action class names, hence
called action KG or A-KG. Since our action classes are composed of multiple words
like a sentence or phrase, averaging word2vec embedding for all words in the sentence
does not provide a cohesive embedding space. We discuss the experimental results
for word2vec embeddings in Section 4.4. To overcome this challenge, we use the
sentence2vec model described in [36], which is an unsupervised learning method
to learn embeddings for whole sentences. We use the unigrams model trained on
Wikipedia to generate our sentence embeddings.
The node features in A-KG are the sentence embeddings. The nodes from
Kinetics action classes are added in A-KG corresponding to each dataset (UCF101,
HMDB51, and Charades). This is inspired by [115, 116], where they show distinct
advantages of adding classes and images from other datasets in ZSL. Although we
cannot directly add images due to the way our model is constructed, we add new
activity classes from the Kinetics dataset to increase the size of our KGs. Appending
400 Kinetics classes to UCF101 results in a total of 501 nodes in the A-KG. Similarly
41
appending the nodes to HBDB51 and Charades results in a total of 451 nodes and
557 nodes respectively. We show more results on performance comparison with and
without adding Kinetics nodes in Section 4.4.
With the sentence2vector node features, we construct the A-KG where node i
is connected to another node j in the combined dataset based on edge weights Aij
from cosine similarity of node features. Here, A is the adjacency matrix for A-KG.
We sort the edges weights in descending order to get the top N closest neighbors per
node. N is a hyperparameter that is determined experimentally and is dependent
on the dataset. It is 5 for HMDB51 and UCF101 and 20 for Charades. j being one
of the top N neighbors of i does not mean that the vice versa is true as well. To
make the adjacency matrix symmetric, we fill Aji with the same value as Aij, so the
number of connections to each node >= N .
VN-KG: The second graph is constructed with verbs and nouns associated with
each action class, hence it is called verb-noun KG or VN-KG. This graph is inspired
by multiple works on zero-shot action using human object interaction where the
detected objects in the scene are used to draw the relationships between seen and
unseen action classes [81, 119]. In [81] object detection is carried out in the visual
domain as well and then mapped to word domain for ZSL. We do not do mapping
for objects features from visual to word. Instead, we just take the output of verb
and noun graphs (VN-KG), and pass it through the fusion layer to get the visual
action classifier weights.
To construct VN-KG, we use a standard language lemmatizer [177] to break up
42
a phrase describing an action and convert the word to its root form. Then, we use
a part-of-speech (pos) [178] tagger to label the word as a noun or a verb. Still, a lot
of action class names do not have a noun in the phrase, for example ?beatboxing?.
For such classes the pos tagger gives a noun label of ?unknown? and if Wordnet can
return a noun that is related to that word, we replace the ?unknown? by the noun.
For action classes like ?archery?, which does not have a specific verb associated with
it, we replace the verb with ?doing?. For node features, we compute sentence2vec
embeddings as above for verbs and nouns. Hence, we get a set of graphs with only
verbs and only nouns. These also have same number of nodes as A-KG. Moreover,
these graphs are used and categorized together as VN-KG, since individually they
provide partial information about action class (either verb or noun). A-KG and
VN-KG can be used to define ZSL setup.
V-KG: The third graph is developed to see relative performance improvements by
incorporating only a few labelled images per test class. We use averaged visual fea-
tures as nodes in this graph hence it is called visual feature based graph or V-KG. In
the visual feature space, we see implicit clustering of similar actions, which is some-
times not captured in word embedding space. For example, ?pommel horse? and
?horse walking? are considered similar in word embedding space, but these are very
different activities which is captured in visual embedding space shown for dataset
UCF101 in Figure 4.2. We randomly pick 5 videos from each test class and use I3D
to generate video features as described in Section 4.2.2. Then taking the mean of
these features, we get the graph node descriptors and take their cosine similarity to
43
generate the adjacency matrix as we do for A-KG and VN-KG. This generates a graph
based on visual features. V-KG is used to replicate few-shot learning (FSL) setup
using KGs, since we use 5 visual features of each test classes to construct the nodes.
In FSL setup, we can combine V-KG with A-KG and VN-KG to improve results.
4.2 Experimental Setup
4.2.1 Datasets
We use following four datasets, where Kinetics is just for pre-training the
model, and rest are used for experiments:
Kinetics [10]: Kinetics is a large dataset with 400 classes and about 3?106 videos.
We do not actually need access to Kinetics videos, but the class names and an I3D
model pre-trained on Kinetics available in [11]. Since we use Kinetics for pre-training
I3D and data augmentation while training the GCN, we cannot keep common classes
between Kinetics and UCF101 or HMDB51 or Charades in the test set while doing
ZSL. So, we use classes in UCF101, HMDB51 and Charades that are also present
in Kinetics, as training set.
UCF101 [179]: UCF101 has 13320 videos from 101 classes. After removing com-
mon classes with Kinetics, we get 23 classes with 3004 videos in test set for UCF101
and the remaining 78 classes are used for training. Some test class labels do not have
semantically correlated neighbors. So, we appended these class names with extra
words, for example ?front crawl? in UCF101 becomes ?front crawl swimming?. We
44
discuss class-wise accuracy for test classes in Figure 4.4.
HMDB51 [180]: HMDB51 has 6849 videos from 51 classes. Similar to UCF101,
we remove common classes with Kinetics, and get 12 classes with 1541 videos for
HMDB51?s test set and remaining 39 classes for training. Additionally, to encourage
correlation with action classes in Kinetics, we convert the class labels to continuous
tenses. For example, classes like ?eat?, uses sentence2vector embedding correspond-
ing to ?eating?.
Charades [181]: Charades has 9848 videos from 157 classes and is also a mul-
tilabel dataset, meaning each video can have multiple action labels. Charades has
noun and verb labels associated with each action class, which we use directly with-
out labelling ourselves. After removing all videos which have at least one common
label with Kinetics, we are left with 110 possible test classes. Each video can have
both training and test labels in Charades. We cannot separate the training and test
videos but just the classes. We split the classes into 50-50 train-test split meaning
there are 79 and 78 train and test classes respectively. The 78 test classes are from
the 110 classes not in common with Kinetics. All videos with at least one training
class are kept in training set and we remove test class labels from them. The rest
of the videos are test videos and training class labels are removed from them.
45
Figure 4.2: t-SNE visualization showing feature distribution of UCF101 video
dataset. Sample images are added for our test classes. (Best viewed in digital
format)
4.2.2 Feature Extraction
To extract video features, we use initial model of I3D trained on Kinetics data
and fine-tune the last layer on the training classes of either UCF101 or HMDB51.
For Charades, just fine-tuning the last layer did not yield good classification perfor-
mance, so we fine-tune the whole network. This means while training, we cannot
46
compute loss on the Kinetics nodes in the KG for Charades. Even after fine-tuning
the complete network for Charades we did not achieve significant performance for
ZSL; so we use inverse of cross-correlation of training features added to a weighted
identity matrix of the same size. This is followed by multiplication with the multi-
plied result of training features with training labels. This is used as the last layer
weight to train GCN as inspired by [127]. We visualize the video feature space dis-
tribution of the UCF101 classes in Figure 4.2 with some example images for the test
classes. As we can see in Figure 4.2, similar classes are grouped together forming
clusters.
4.2.3 Our Pipeline
Our GCN consists of 6 layers with filter dimensions of 600 ? 512 ? 1024 ?
1024? 1024? 1024. We choose 6 layers empirically. Our hypothesis is that lower
depth might reduce field of view, necessary for information transfer, whereas higher
depths might result in over-smoothing. The convolution kernel is of size 1. For
training/fine-tuning both I3D model and the GCN model, we use ADAM optimizer
with initial learning rate of 0.001. A stepwise scheduler with a drop rate of 0.99 after
every 100 epochs is used for I3D training. For GCN, stepwise scheduler drop rate
is 0.999 after every 100 epochs. Classwise mean accuracy is used as the evaluation
metric for UCF101 and HMDB51 and mean average precision (mAP) scores for
Charades. Most of the training parameters are the same for FSL setup as well,
except we use a smaller learning rate of 0.00005 for UCF101.
47
To fuse the outputs of the different KGs, we concatenate along the channel
dimension and then pass them through a GCN layer. This fusion GCN layer uses
adjacency matrix of A-KG and V-KG for zero-shot and few-shot respectively. For
A-KG+VN-KG in UCF101 and Charades, this fusion technique did not give good
performance. So, we use the weighted sum of the outputs of A-KG and VN-KG with
weights of 0.9 for A-KG and 0.05 each for the verb and noun outputs from VN-KG.
4.3 Results
Dataset A-KG VN-KG A-KG+VN-KG
UCF101 49.14 45.47 50.13
HMDB51 38.01 31.57 40.77
Charades 15.81 12.48 18.21
Table 4.1: ZSL results for all 3 datasets where we compare performances of A-KG,
VN-KG and a combination of the two. A-KG+VN-KG always does the best. For UCF101
and HMDB51, the results are in mean accuracy whereas for Charades, we report
mean average precision (mAP).
The results for ZSL on all 23 test classes for UCF101, 12 test classes for
HMDB51 and 78 test classes for Charades are in Table 4.1. These results are based
on KGs A-KG and VN-KG and combination of both. The combination of A-KG and
VN-KG graph is done through the fusion process as described in Section 4.2.3. Since
all datasets have many action classes without any nouns, only VN-KG does not give
good performance, but the combination of A-KG+VN-KG works well.
48
Method UCF101 HMDB51 Charades
23-78 split 50-51 split 12-39 split 25-26 split 78-79 split
ESZSL [127] 35.27 15.0 34.16 18.5 17.21
DEM [128] 34.26 - 35.26 - -
Objects2Action [119] - 30.3 - 15.6 -
CEWGAN [125] - 26.9 - 30.2 -
TS-GCN [81] 44.5 34.2 - 23.2 -
Ours 50.13 - 40.19 - 18.21
Table 4.2: ZSL results for all 3 datasets. The baselines are ESZSL, DEM, Ob-
jects2Action, CEWGAN and TS-GCN. For UCF101 and HMDB51, the results are
in mean accuracy whereas for Charades, we report mean average precision (mAP)
since it is multi-label dataset.
We also provide the comparison with state-of-the-art in Table 4.2. For our
data split, we have compared our results with three previous works carried out under
similar ZSL settings, ESZSL [127], DEM [128] and TS-GCN [81]. We could not apply
DEM baseline results for Charades, since it is a multi-label dataset. Also, TS-GCN
only released code for the transductive setup for UCF101. We have implemented
the inductive version and compared to it. We have also added some of the recent
results for ZSL. Either their splits are different, or they do not provide code, or an
essential part of their framework is missing. However, note that recent work of [81]
outperforms these other approaches on their splits and we outperform [81] on our
splits.
Dataset Baseline V-KG V-KG+A-KG V-KG+VN-KG V-KG+A-KG+VN-KG
UCF101 52.7 57.04 62.10 59.92 64.24
HMDB 30.2 45.07 45.67 47.61 47.69
Table 4.3: FSL results for the UCF101 and HMDB51 datasets. The baseline is
nearest neighbor, given 5 videos for each test set. The combination of A-KG, VN-KG
and V-KG does the best in both cases.
We report results for combining V-KG with A-KG and VN-KG in Table 4.3. Since
49
we are using V-KG, these experiments can be considered as few shot learning. To
create a baseline, we used the nearest neighbor search to get the class label for
test videos. Based on the 5 labelled videos provided, we calculate the mean feature
for each class and then we use cosine distances between the rest of the test videos
and these class centers to sort them into corresponding classes. We use the same
train-test class splits for UCF101 and HMDB51 as used in ZSL. For both UCF101
and HMDB51, we get best results if we use all 3 KGs. We do not conduct this
experiment for Charades since each video has multiple labels, hence each video data
point will update multiple class centers resulting in overlapping class distribution.
4.4 Analysis
Word embeddings for action labels: For constructing node features from ac-
tion labels, we used the word2vec embeddings trained on Google News [20, 21, 22].
For all words in each class name, the word2vec embeddings were averaged to give a
resultant embedding for the whole phrase, which serves as features of the nodes in
the KG. In Figure 4.3(b), we show the word2vec embedding space of node ?Pommel
Horse? and its nearest neighbor class nodes.
Averaging word2Vec embedding for all words in action class label phrase works
in some cases, but it cannot always capture the meaning or correct relationships
between the action classes. Hence, for a class like ?riding or walking with horse? in
Kinetics dataset, the embedding for each word is located far apart from each other
as displayed in Figure 4.3(b). The mean of these individual words does not lie close
50
(a)	 (b)	
Figure 4.3: (a) Sentence2Vec embedding space for Kinetics and UCF101 classes.
The class ?uneven bars? and its neighbors are highlighted. (b) Class ?Pommel
horse? and its neighboring classes in Kinetics dataset using word2vec embedding.
The embeddings of each individual word forming the phrase is also displayed. (Best
viewed in digital format)
to related words in the embedding space and hence does not capture meaningful
information.
Method Mean Accuracy
Word2Vec 38.02
Sentence2vec 49.14
Table 4.4: Performance comparison between word2vec embedding and sentence2vec
embedding based models. Both the models are trained on graphs consisting of class
nodes from Kinetics and UCF101 (A-KG) with losses on both. Performance metric
used is mean accuracy.
To solve this problem we use sentence2vec model from [36], which captures
the semantic meaning of sequences of words. Using this embedding space, the
closest word match to a class like ?uneven bars? is ?gymnastics tumbling?. The
word embedding space for all the classes in UCF101 and Kinetics are displayed in
51
Figure 4.3(a). The word ?Uneven bars? along with its neighbors are emphasized. We
run experiments with both word2vec embeddings trained on Google News [20, 21, 22]
and Sentence2Vec embeddings based on unigram model trained on Wikipedia [36].
The results on UCF101 A-KG are shown in Table 4.4. These results show significant
improvement by using setence2vec over word2vec.
Knowledge Nodes for Loss Mean
Graph Computation Accuracy
UCF only UCF 27.72
UCF+Kinetics UCF 32.85
UCF+Kinetics UCF+Kinetics 49.14
HMDB only HMDB 31.09
HMDB+Kinetics HMDB 29.22
HMDB+Kinetics HMDB+Kinetics 38.01
Table 4.5: Experiments with 3 different knowledge graph constructions. The
variations are due to using only UCF101/HMDB51 classes for the knowledge
graph or appending it with Kinetics classes and training loss being calculated on
UCF101/HMDB51 nodes only or both UCF101/HMDB51 and Kinetics nodes in the
knowledge graphs (A-KG). Performance metric used is mean accuracy.
Appending Knowledge Graphs with more action classes: We augment the
UCF101 and HMDB51 action class names based KG with Kinetics class labels in
three different ways. In the first configuration, either the UCF101 nodes or HMDB51
nodes are used in the KG (101/51 nodes) out of which, 78 and 39 are training nodes
respectively. The loss is computed by comparing the output of the GCN on these
classes to the weights in the final classifier layer of the fine-tuned I3D network.
The second configuration uses the same KG as A-KG explained in Section 4.1.
The loss is computed by comparing the output of only the UCF101 or HMDB51
training nodes (78/39 nodes) to the final classifier layer of the fine-tuned I3D net-
52
work.
In the third configuration, again A-KG is used. Although, now the loss is
computed by summing the 2 MSE losses: (a) Loss 1 by comparing the output of only
the UCF101 or HMDB training nodes(78/39 nodes) to the final classifier layer of the
fine-tuned I3D network. (b) Loss 2 by comparing the output of the Kinetics nodes
(400 nodes) to the classifier layer weight of I3D pre-trained on Kinetics. The results
of these three experiments are shown in Table 4.5. For UCF101 and HMDB51, third
configuration works best.
Types of connections in Knowledge Graphs: While constructing the A-KG
with both UCF101 or HMDB51 and Kinetics dataset, we used two types of graph
connections. In fully-connected graphs all nodes can be connected to all other nodes,
out of which we select top 5 connections. In bipartite, for every node in UCF101 or
HMDB51 dataset, we find the top 5 connections to the Kinetics dataset nodes and
vice versa. The fully connected(FC) graph works better than the bipartite graph
(Table 4.6).
Method Mean-accuracy Mean-accuracy
for UCF for HMDB
FC 49.14 38.01
Bipartite 33.11 28.49
Table 4.6: Performance comparison for fully connected(FC) and bipartite graphs
constructed with UCF101 or HMDB51 with Kinetics dataset nodes in A-KG. Both
the models are trained on graphs consisting of class nodes from two datasets
(UCF101 and Kinetics or HMDB51 and Kinetics) with losses on both. Performance
metric used is mean accuracy.
53
Figure 4.4: This figure shows class-wise accuracy for different KGs and combination
of KGs for UCF101 and HMDB51. We added few words for better word embeddings
in the labels (such as ?front crawl? becomes ?front crawl swimming?), which im-
proves performance for language based KGs or their combinations, as shown here.
Each color for bar represents a KG, blue is word based KG, orange is visual feature
based KG and grey is combination of all three KGs (A-KG, VN-KG and V-KG).
Analysis of Class-wise Accuracy using different Knowledge Graphs: To
understand the impact of using A-KG, VN-KG and V-KG for learning each test class,
we plot the class-wise accuracy for UCF101 and HMDB51 in Figure 4.4. Each color
of the bar represents a different KG: blue is for word based A-KG, orange is for visual
feature based V-KG and grey is the combination of A-KG, VN-KG and V-KG.
As observed in Figure 4.4, for few classes such as ?billiards?, ?playing tabla?,
A-KG performs the best. These classes innately have many neighbors in the word
embeddings space, which help in learning them from given training classes. Few
other classes, such as ?front crawl swimming?, ?chew food? and ?pour liquid? per-
form well with just A-KG as well, since we add the extra word ?swimming?, ?food?
and ?liquid? respectively, to enforce good neighbors in language domain. Intuitively,
54
V-KG does well for ?uneven bars?, ?fall floor?, ?smile? and ?shoot gun?, since these
have distinct visual features. The combination KG works well for ?parallel bars?,
?jumping jack?, ?playing daf?, ?playing dhol?, ?climb stairs?, ?talk? and ?wave?.
Ablation for Network Architecture: We experiment with different number of
layers of the GCN (2,4 6, 8 and 10) to explore influence of GCN depth on perfor-
mance for both UCF101 and HMDB51. The increase in the number of layers of the
GCN increases smoothing and decrease in number of layers causes less information
propagation. We found that 6 layers gives us the best performance.
Method Mean accuracy
GCN 49.14
Linear Combination 42.57
Table 4.7: Performance comparison of using GCN (on UCF101 A-KG) vs a linear
combination (using the adjacency matrix edge weights) of the top 4 closest training
class weights to the test classes. Performance metric used is mean accuracy.
Usefulness of GCN vs a linear combination of training class weights: To show the
performance improvement on UCF101 due to GCN on A-KG compared to just linear
combinations, we perform an ablation study. For each test class, we find the top 4
neighbors in the training set. Then using the adjacency edge connection weights,
the classifier layer weight for the test class is a weighted average of the classifier
layer weights for its neighbors. The performance is in Table 4.7.
55
Method Mean accuracy
without encoder-decoder 49.14
with encoder-decoder 47.72
Table 4.8: Performance comparison of using an encoder-decode layer before the
GCN layers on UCF101 A-KG vs not using one. Performance metric used is mean
accuracy.
Use encoder decoder before GCN: We run another set of experiments where a 2 lay-
ered encoder decoder network is added before GCN on UCF101 A-KG, for improving
encoding of sentence embedding features. The results do not show any promise as
seen in Table 4.8.
Method Nodes for Loss Computation Split 1 Split 2 Split 3 Split 4 Split 5 Mean
ESZSL - 61.25 60.30 53.68 64.81 60.56 60.12
DEM - 60.87 65.88 41.89 61.90 52.11 56.53
Ours UCF101 59.68 48.51 42.18 49.86 43.12 48.67
Ours UCF101+Kinetics 83.62 72.60 71.57 70.85 49.39 69.61
Table 4.9: Results on UCF101 A-KG with 10 randomly selected test classes leaving 91
classes to be used for training I3D and GCN. Mean accuracy is used for evaluation.
The experiments are carried out 5 times and the final column provides the averaged
mean accuracy scores. We compare our results to two previous work with similar
settings.
Random test train splits: Some of the experiments are done on a random sub-
sample of the test-set classes. For UCF101 A-KG, we choose 10 out of 23 classes 5
times; so that for each random sample of 10 test classes, the rest of the 91 classes
forms the training set. The mean accuracy score is calculated after each run and the
result of all 5 runs are averaged to get the final mean accuracy score. The results
for each of these splits is in Table 4.9.
56
(b) (c)
(a)
(d) (e)
Figure 4.5: Heatmaps showing activations of various classes? classifier layers obtained
from training on UCF101 A-KG on various class videos. (a) is the display of the
activation from the ?playing sitar? class on a?playing sitar? video, (b) is the display
of the activation from the ?playing guitar? class on a?playing guitar? video, (c)
is the display of the activation from the ?playing sitar? class on a?playing guitar?
video, (d) is the display of the activation from the ?biking? class on a?biking? video
and (e) is the display of the activation from the ?playing sitar? class on a?biking?
video. These heatmaps show that test class ?playing sitar? is correctly learning from
training class ?playing guitar? instead of training class ?biking?
Learning classifier for unknown classes from related classes in Knowledge
Graph: The heatmaps in Figure 4.5 depicts the test nodes learning from the inter-
connections to the train nodes in A-KG. They are based on CAM [182]. Considering
the test class ?playing sitar? in UCF101, one of the top 5 nearest train classes in
UCF101 is ?playing guitar? and one of the random classes that have no relation is
?biking?. Now among the five sub-figures in Figure 4.5, (a) is the display of the ac-
tivation from the ?playing sitar? class on a ?playing sitar? video, (b) is the display
of the activation from the ?playing guitar? class on a ?playing guitar? video, (c)
is the display of the activation from the ?playing sitar? class on a ?playing guitar?
video, (d) is the display of the activation from the ?biking? class on a ?biking? video
57
and (e) is the display of the activation from the ?playing sitar? class on a ?biking?
video. What we show here is that ?playing sitar? classifier is similar to the ?playing
guitar? classifier and hence the heat maps from both are similar. This is not the
case between ?playing sitar? and ?biking?.
58
Chapter 5: Learning Graphs for Knowledge Transfer with
Limited Labels
We improve the GCN based semi-supervised classification and zero/few-shot
action recognition performance further by learning and updating the input knowl-
edge graph over time and also using additional constraint via triplet loss. The GCN
network for semi-supervised learning is a 2 layer network based on the spectral GCN
form, introduced by [34] and given in the equation 3.1. We use the same GCN frame-
work as described in Chapter 4 for zero/few-shot action recognition. The system
overview for learning the graph structure while using triplet loss is in Figure 5.1 and
the algorithm specifically for zero/few-shot learning is summarized in Algorithm 1.
5.1 Our Approach
The transfer of knowledge from training to test nodes relies heavily on the
quality of the input graph. Better input inter-relationships among nodes lead to
a better output of the GCN-based framework. All GCN-based frameworks, with a
few exceptions for both semi-supervised learning and zero/few-shot learning, use a
fixed adjacency matrix throughout the GCN Network. However, as discussed earlier,
being able to learn the adjacency matrix is desirable and challenging.
59
Positive  Current Negative  Triplet Task Specific 
Neighbors Node Neighbors Loss + Loss
GCN1 GCN2
Input Graph Updated Graph Output Graph
Figure 5.1: System overview for adaptive learning of graphs connections. The input
graph is passed through a GCN layer and this intermediate output is used to update
the graph as well as calculate a triplet loss between the current nodes and the positive
and negative sets. This output is then passed through another GCN network that
generates outputs specific to the task at hand. The final output is used to calculate
the task specific loss like MSE loss for zero-shot learning.
Algorithm 1: System Overview for zero/few-shot learning
Input Input KG with node features (H feat) and adjacency matrix(Ain),
pretrained I3D network for extraction of test video feature extraction (f test) and
final classifier layer weights for training classes (W cls), number of epochs per
update (n)
Output Classification probablity scores for all test classes (P test)
Networks GCN1 and GCN2 are two GCN networks
1: procedure GCN training and testing
2: A = Ain
3: ref? example reference node
4: P ? positive neighboring set for ref based on Ain
5: N ? negative neighboring set for ref based on Ain
6: while not converge do
7: H inter ? GCN (H feat, Ain), Hout1 ? GCN inter2(H ,A)
8: Htrain ? Hout for training classes,
Href ? H inter for ref node,
HP ? mean(H inter for positive neighbors in P ),
HN ? mean(H inter for negative neighbors in N)
9: dP = ?Href ?HP? , dN = ?Href ?HN2 ?2
10: Loss? L +L = ?W cls?HtrainMSE triplet ?2+max(dP?dN+?, 0), ? = margin
11: if epoch mod n = 0 then
inter inter
Aupdated
H
12: = i
?Hj , A = Normalize(Aupdatedij ?Hinter??Hinteri j ?
)
13: Hout* ? output for optimized network,
Htest ? Hout* for testing classes, P test = f test(Htest)T
60
5.1.1 Adaptively Updating the Adjacency Matrix
Let GCN1 be the part of the original network that gives an intermediate output,
and the rest of the original GCN be GCN2. The output of GCN1 is used to recalculate
the adjacency matrix, where the edge weights are the cosine similarity of the output
node values of GCN1. Then, we use the new adjacency matrix as our input to GCN2,
starting from the next epoch.
More formally, let hl?1k be the output of the k
th node at the (l ? 1)th layer.
This passes through the lth convolution layer with weights W l. Then, for each node,
there is a weighted aggregation based on its neighbors, Ni, where the edge weights
are represented by c connecting nodes i and k. So the lthik layer output of the i
th
node in GCN1, h
l
i, is given by equation 5.1, where ? is the non-linearity following the
aggregation. Similarly, hlj is the output of the j
th node at the lth layer. Then, the
new edge weight connecting nodes i and j is given by the cosine similarity of hli and
hlj as shown in equation 5.2. (? )
hl = ? c hl?1 li ik k W (5.1)
k?Ni
( )
hli ? hlj
cij = Normalize (5.2)? hl li ?? hj ?
We denote the original adjacency matrix with A and the updated one with Anew.
In equation 5.2 we normalize the adjacency matrix, with the node degree matrix
D, given by D?1/2AD?1/2 operation from equation 3.1. GCN1 always operates in
A, whereas GCN2 operates on Anew. To aid with the optimization, we update Anew
61
every n epochs, so that GCN2 can adapt to the new input graph. Finally, the graph
adjacency is by taking a weighted average with the original input graph (update
using equation 5.3).
Anew = ? ? Anew + (1? ?) ? A (5.3)
When we have good quality input graphs (eg., those in semi-supervised learning
benchmarks, whose connections are based on dataset labels), we determine ? em-
pirically. However, in cases where the input graphs are noisy (eg., those computed
allegorically in Chapter4 for action recognition), we often set ? = 1, i.e., do not rely
on input graph for GCN2. Details for all setups are provided in Section 5.2.
5.1.2 Training using Triplet Loss
The original network without graph learning uses classification loss and MSE
(mean squared error) loss for training semi-supervised and zero-shot learning net-
works respectively. To aid in updating the graph structure, we add a triplet loss.
Therefore, the final framework is trained with a weighted sum of the triplet loss and
the task-specific loss for increased supervision. For the triplet loss, we need positive
and negative sets for each node. For semi-supervised learning, each training node
in the graph is a data sample and has a class label associated with it. So we can
use the soft-triple loss [183], which requires the number of clusters per class as a
hyperparameter. We determine this empirically on the validation set and the values
are provided in Section 5.2.
On the other hand, the positive and negative neighbors for the class nodes
62
in zero/few-shot learning for actions need to be explicitly defined. We rely on the
neighborhood of each class in the graph to initialize these sets as follows. For the
positive set, we simply use the top-N (=2) neighbors closest to each node in the input
KG. However, with triplet losses, defining negative set is more challenging. If we
only use the farthest neighbors, the downstream task MSE network already achieves
good separation between positives and negatives, and the triplet loss contribution
is negligible. This implies that the triplet loss has no effect on training and the
adjacency matrix can get arbitrary updates and lead to degenerate solutions. On
the other hand, if the negative set is too close to the positive set, some nodes in the
negative set that may be constrictive and lead to large penalty which is detrimental
to adjacency matrix updates. Therefore, we use the validation set to empirically
select the range of the negative set classes (details in Section 5.3).
Finally, we take the average of the GCN1 node outputs for the positive and
negative sets to get positive and negative vectors. Then, the triplet loss is zero only
when the distance between the positive vector and the current node is smaller than
the distance between the current node and the negative vector by a certain margin ?
(= 0.1). Mathematically, Href be the output of the current reference node, HP and
HN are the averaged output vectors for the nodes in the positive and negative sets,
respectively. Then, the distance between Href and HP (or HN) is represented as dP
and dN (equation 5.4); and the triplet loss, Ltriplet is calculated using equation 5.5.
dP = ?Href ?HP? N2, d = ?Href ?HN?2 (5.4)
Ltriplet = max(d
P ? dN + ?, 0) (5.5)
63
5.2 Experiments
Datasets: We use Citeseer, Cora and Pubmed datasets [35, 184] for experiments
on semi-supervised learning where nodes are documents and edges are citations.
There are 6 classes in Citeseer, 7 in Cora and 3 in Pubmed. We use the same train,
test and validation splits as [34, 185]. For zero/few-shot action recognition, we use
Kinetics [10], UCF101 [179] and HMDB51 [180] as our datasets. Kinetics has 400
classes, UCF101 has 101 classes out of which 23 are are for test and 78 for training
and HMDB51 has 51 classes out of which 12 are for test and 39 for training. We have
described these datasets in Chapter 4. We make 10 random selections of c classes
among test classes and we average the performance on all 10 selections for validation
purposes. We then select the model with best performance on this validation set
and report results on the entire test set. c is 20 for UCF101 dataset and 10 for
HMDB51 dataset.
Pipeline: For semi-supervised learning we use a 2 layer network where the inter-
mediate output is used to update the graph connections. The learning rate is 0.005
for all 3 datasets. We experimentally determined the number of cluster per class for
soft-triple loss and they are 2 for Pubmed and Cora and 10 for Citeseer. The rest
of the hyperparameters for Soft-triple loss are the same as their paper [183]. The ?
parameter in equation 5.3 is 0.8 for all datasets.
For zero/few-shot action recognition, we use an I3D [11] pre-trained on Ki-
netics and only finetune the last classifier layer on UCF101 and HMDB51 train-
64
ing classes respectively until convergence. We use 1 layer for GCN1 and 5 layers
in GCN2 with 1 of these 5 layers belonging to fusion GCN for systems based on
multiple KGs. We use the same hyperparameters as Chapter4 for our baseline net-
work. ? from equation 5.3 is 1.0 for all zero-shot/few-shot KGs except for HMDB51
A-KG+V-KG+VN-KG where it is 0.5. For HMDB51 A-KG, we use the final output of
GCN2 to calculate Anew and we do not use triplet loss.
5.3 Quantitative Results
5.3.1 Semi-supervised Learning
Method Cora Citeseer Pubmed
SemiEmb [186] 59.0% 59.6% 71.7%
DeepWalk [187] 67.2% 43.2% 65.3%
ICA [188] 75.1% 69.1% 73.9%
Planetoid [185] 75.7% 64.7% 77.2%
Chebyshev [98] 81.2% 69.8% 74.4%
GCN [34] 81.5% 70.3% 79.0%
MoNet [189] 81.7% - 78.8%
GAT [104] 83.0% 72.5% 79.0%
GLNN [33] 83.4% 72.4% 76.7%?
GCN+GDC [101] 83.6% 73.4% 78.7%
H-GCN [100] 84.5% 72.8% 79.8%
GLCN [32] 85.5% 72.0% 78.3%
GCN* 80.0% 72.0% 77.8%
Ours 83.6% 74.3% 79.8%
Table 5.1: We compare accuracy of our technique to various state-of-the-art tech-
niques for semi-supervised learning for Cora, Citeseer, and Pubmed datasets; includ-
ing two graph learning techniques, GLNN and GLCN. We also provide the GCN*
baseline which is our implemetation in PyTorch environment. The ? in Pubmed for
GLNN stands for downsampled input data. We get the best performance for both
Citeseer and Pubmed datasets. For Cora, our GCN baseline (80.0%) is worse than
the GCN baseline for GLCN (82.9%) by 3.0%, so the improvement using our graph
learning technique is higher.
65
We show results on semi-supervised learning for Cora, Citeseer, and Pubmed
datasets in Table 5.1. We compare against multiple state-of-the-art methods, in-
cluding graph learning methods like GLCN [32] and GLNN [33]. The GCN* is our
implementation of GCN [34] in PyTorch environment with 256 intermediate chan-
nels and we get slightly differing results. Since our approach builds on this baseline,
we report these results to do a direct comparison. Our approach outperforms all
others on both Citeseer and Pubmed datasets. GLCN does best on the Cora dataset,
but their GCN baseline is 82.9% (? 3.0% higher than our baseline at 80.0%).
? 1.0 0.8 0.6 0.4 0.2
Pubmed 76.2% 80.6% 79.8% 79.4% 79.0%
Table 5.2: Ablation comparing accuracy for Pubmed validation data for different
values of weighted averaging between input and updated adjacency matrix, i.e., ?
from equation 5.3.
Ablation analysis. We experiment with different values of ? from equation 5.3 on
Pubmed validation set and report the results in Table 5.2. We observe that ? = 0.8
achieves best performance and use this in all semi-supervised experiments.
5.3.2 Zero-shot/Few-shot Action Recognition
We compare with the results without graph learning from Chapter 4 for zero-
and few-shot action recognition in Table 5.3. These results are for both UCF101
and HMDB51, using three different input graph configurations ? A-KG, V-KG, and
A-KG+VN-KG+V-KG. For both UCF101 and HMDB51, the metric is mean accuracy,
which averages the classwise accuracy over all classes. As can be seen, our approach
66
of updating the graph structure during training significantly outperform our results
from Chapter 4.
Input KG UCF101 HMDB51
Ours Ours+Learning KG Ours Ours+Learning KG
A-KG 49.14 53.27 38.01 41.05
V-KG 57.04 60.57 45.07 48.07
{A+VN+V}-KG 64.24 65.49 47.69 49.17
Table 5.3: Comparison of our results using graph learning with the pipeline from
Chapter 4 without graph learning for UCF101 and HMDB51 datasets. We do better
for all input KG configurations: A-KG, V-KG, and A-KG+VN-KG+V-KG. The metric is
mean accuracy (Higher is better).
KG (UCF101) triplet loss update A mean accuracy
V-KG 57.04
V-KG ! 58.57
V-KG ! 59.39
V-KG ! ! 60.57
Table 5.4: Improvements using triplet loss or updating adjacency matrix only on
V-KG and then both together. Metric is mean accuracy (Higher is better).
Ablation analysis. We first analyze the contribution of our approach to update
A and the triplet loss formulation in Table 5.4 (UCF101 using V-KG). We show that
both contributions are better individually and are complementary to each other.
Next, we study the two hyperparameters associated with these two proposals ?
(a) varying the number of epochs before updating adjacency matrix (n), and (b)
different ordinal ranges for negative classes for the triplet loss. The results are
presented in Table 5.5 and Table 5.6 respectively. We get the best performance at
30 epochs per update and negative set range of [9, 14]. For this ablation, we use the
mean of 10 runs of randomly chosen subsets of 20 test classes.
67
# epoch per update 10 20 30 40 50
UCF101 A-KG 52.89 50.17 54.41 50.72 48.71
Table 5.5: Ablation showing performance of UCF101 A-KG with varying number of
epochs per update of adjacency matrix. The metric used in mean accuracy (Higher
is better).
Triplet loss ?-?ive set 5-10 15-20 9-11 9-14 9-19
UCF101 A-KG 49.22 48.74 51.51 54.41 49.27
Table 5.6: Ablation showing performance of UCF101 A-KG with different negative
set class index ranges for triplet loss. The metric used in mean accuracy (Higher is
better).
Method UCF101 HMDB51 Method UCF101
23-78 split 12-39 split 20-81 split
ESZSL [127] 35.27 34.16 Action2vec [118] 36.5
DEM [128] 34.26 35.26 TARN [126] 42.7
TS-GCN [81] 44.5 - SAOE[121] 51.2
Ours 50.13 40.77 UR[124] 53.8
Ours+learn KG 53.28 41.05 Ours+learn KG 54.4
Table 5.7: Comparison with State-of-the-art zero-shot action recognition results for
both UCF101 and HMDB51 datasets. The results are in mean accuracy. Higher
is better. We compare on the entire test set for both datasets. We also randomly
choose 20 classes from UCF101 test set over 10 times and average the output to
replicate the 80/20 split reported by previous work.
Comparison with State-of-the-art Zero-shot Learning. Finally, we compare
against state-of-the-art approaches for zero-shot learning. Note that we cannot
do a similar comparison for few-shot learning because we do not follow the episodic
learning pipeline as the other papers. In particular, we compare against ESZSL [127],
DEM [128], TS-GCN [81], our own baseline without graph learning from Chapter 4,
SAOE [121], UR [124], Action2vec [118], and TARN [126]. We evaluate on both
UCF101 and HMDB51 datasets and report mean accuracy. We provide results for
68
the entire test sets, for both UCF101 and HMDB51 and on the 80/20 split on
UCF101 used by previous papers in Table 5.7. For the latter, we randomly choose
20 classes from UCF101 test classes 10 times and average the output performance
and report the average scores over all runs. We outperform the state-of-the-art
techniques in all three cases, further emphasizing the importance of updating the
graph structure for zero-shot approaches.
5.4 Discussion
Figure 5.2: Class-wise comparison of accuracy for 23 UCF101 test classes using A-KG
and V-KG as input for zero- and few-shot learning, respectively, between current
results after applying graph learning (blue) and the results without graph learning
i.e the baseline (green). In both cases (A-KG and V-KG), for majority of classes, we
either beat or maintain the baseline performance. Best viewed in digital.
Class-wise performance. In Figure 5.2, we do a class-wise performance compar-
ison between our output from Chapter4 without graph learning and output after
learning input KG for UCF101 test classes with A-KG and V-KG as input. For zero-
69
Figure 5.3: We plot the adjacency matrix connections for UCF101+Kinetics A-KG
input and show the following two updates. We plot only a sub-graph due to space
complexity. We chose 8 test classes (class names shown in red) and display all their
connections in the KG. The edge colors show the weight of the connection. There
are multiple regions where we can see improvements after first and second update.
Best viewed in digital.
shot learning with V-KG as input, our technique beats the baseline for most classes
(12 our of 23) like ?Apply eye makeup?, ?Apply lipstick?, ?Billiards?, ?Nunchucks?,
?Playing Daf?. In some cases (7 out of 23), like ?Still rings?, ?table tennis shot?,
?uneven bars?, we do worse. We provide an explanation for the ?Still rings? class
in Figure 5.4 discussed later. For few-shot learning using V-KG, we do better on 12
and worse on 6, and similar to fixed input graphs on 5 classes.
Qualitative results for graph updates. In Figure 5.3, we show the graph con-
nections among 57 selected nodes for UCF101 and Kinetics based on A-KG. These
nodes are the neighbors for the selected 8 test classes (class names shown in red).
The edge weights are represented by the colors shown in color bar, with blue rep-
resenting lower edge weights and red representing higher ones. The visualization
70
on the left is for the input adjacency matrix, the center is after the first update at
30th epoch, and the right is after the 2nd update at 60th epoch. There are many
examples where the update is improving the input KG, but due to space constraints,
we only discuss one specific node here. Looking at ?Pommel horse? (a gymnast ac-
tion) and the input KG, we see multiple mistakes because this KG is based on word
embeddings. Due to the presence of the term ?horse? in the name, it associates
?Pommel horse? with ?Grooming horse? and ?Horse riding?. After the first update,
these connections are removed, but it creates connections to classes like ?Archery?
and ?Fencing? which are not correct. It has some correct connections, like ones
to ?Vault?,?Uneven bars?; but the weights are low due to normalization from too
many connections. After the 2nd update, a lot of these connections (like ?Archery?)
are removed and weight increases on connections like ?Floor Gymnastics? and ?Pole
Vault?. So overall the KG improves after each update.
Visualizing important connections. Next, we display the important graph con-
nections with respect to the GCN network in Figure 5.4. A GCN has multiple
layers and each layer involves convolution, adjacency matrix multiplication, and
non-linearity. The linear equivalent of this system is AL where L is the number of
layers in the GCN. We display the top-N neighbors in AL with A from input and
updated adjacency matrix for two test classes: ?Mixing batter? and ?Still Rings?
in Figure 5.4 labeled as linear connectivity. We also develop a way to display the
closest neighbors after the GCN operation which are different from the input adja-
cency matrix. To do this, we follow the technique used to understand the traditional
ConvNets by blocking out portions of input images [190]. If the GCN operation is
71
Figure 5.4: We show the connections of AL where L is the number of layers in
the GCN (linear connectivity), as well as connections after passing through the
non-linear GCN network (GCN-based connectivity) for ?Mixing Batter? and ?Still
Rings? classes. For both, we show the top-K connections using fixed input A (adja-
cency matrix) as well as updated A. The edge color based on the color bar and the
width of the connections represent edge weights. (larger width ? higher weights).
For ?Mixing Batter? the performance becomes better while for ?Still Rings? the
performance becomes worse after A is updated.
represented by G and the input to the GCN is the KG K, the original output prob-
ability is given by O = G(K) ? fvid, where fvid is the feature vector of a video in
class C. Next, we modify K to K ? ni by removing connections to one input node
ni and the new output is given by Onew = G(K ? ni) ? fvid. Then, the impact
of connectivity between nodes ni and the correct output class node C is given by
equation 5.6, where the higher the change the more important a connection is.
|O ?Onew| = | (G(K)?G(K ? ni))? fvid| (5.6)
We show GCN based connectivity, extracted using this approach, in Figure 5.4 for
the two classes, ?Mixing batter? and ?Still Rings?, using input and updated adja-
72
cency matrix. The edge color (depending on the given color bar) and widths repre-
sent the importance of the connectivity (higher width implies higher edge weight).
The updated adjacency matrix based connectivity becomes better for ?Mixing bat-
ter? and worse for ?Still rings?. For ?Mixing batter? the word embedding based
KG makes a mistake by associating ?batter? with ?baseball? classes like ?Throwing
Ball?, ?Baseball Pitch?, etc. whereas our updated KG correctly associates ?Mixing
batter? with cooking classes like ?Cooking Egg? and ?Making a cake?. On the other
hand for ?Still rings? the original KG has ?Pole Vault? and ?Gymnastics tumbling?
as some of the top neighbors whereas the updated KG has ?Balance beam?, ?Un-
even bars?, and ?Parallel bars? as the top neighbors. The problem is that these are
more similar to the ?Pommel horse? test class and so most ?Still rings? videos are
predicted as ?Pommel horse? after the update.
73
Chapter 6: Depth Completion Using a View-constrained Deep Prior
In this chapter we will describe in detail the technique of using the deep image
prior for depth completion in stereo pipeline.
6.1 Method
Given a RGBD image with I in as RGB component and Din as noisy depth
component, our goal is to generate denoised and inpainted depth image D?. We
leverage recently proposed Deep Image Prior (DIP) [54] to solve this problem. We
first briefly describe the DIP approach.
6.1.1 Deep Image Prior
The DIP method proposed a deep network based technique for solving low
level vision problems such as image denoising, restoration, inpainting, etc. At the
core of their method lies the idea that deep networks can serve as a prior for such
inverse problems. If x is the input image, n is the input noise and xo is the denoised
output of the network f?, then the optimization problem of the DIP method takes
74
Figure 6.1: (a) Input depth map with holes (b) DDP on just depth maps and (c)
DDP on RGBD images. In the black box regions in (b), DDP is filling up the holes
in the sky or background based on the depth from the house or radio because it has
no edge information. RGBD input provides this edge information in (c).
the following form:
?? = arg minL(f?(n);x), x
?
o = f??(n). (6.1)
?
The task of finding the optimal neural network parameters ?? and the optimal
denoised image x?o is solved using the standard backpropagation approach.
A simple approach to address depth denoising and inpainting would be to use
a DIP like encoder-decoder architecture to improve the depth images. Here depth
images would replace RGB as inputs in the original DIP framework. However this
fails to fill the holes with correct depth values. Some of the results are shown in
Figure 6.1. More quantitative results are provided in Table 6.3. We hypothesize
three reasons for this failure. First, holes near object boundary can cause incorrect
depth filling. Second, depth images have more diverse values than RGB images
that leads to large quantization errors. Finally absolute error for far objects may
75
Figure 6.2: Overview of the Deep Depth Prior (DDP): The DDP network is trained
using a combination of L1 and SSIM reconstruction loss with respect to a target
RGB-D image and a photoconsistency loss with respect to neighboring calibrated
images. This network is used to refine a set of noisy depth maps and the refined
depth maps are subsequently fused to obtain the final 3D point cloud model. Iout
and Dout are the RGB and depth output of our network. Inbr is the RGB at a
neighboring viewpoint. Iref and Dref are the input RGB and depth at the current
(reference) viewpoint.
dominate the DIP optimization over important nearby objects.
6.1.2 Deep Depth Prior
In this work, we propose the Deep Depth Prior (DDP) and introduce three
losses to solve the issues discussed before. Our approach is built on top of the
inpainting task in Ulyanov et al. [54] where we create a mask for the holes in the
depth map and calculate the loss over the non-masked regions. Figure 6.2 gives an
overview of our system.
To solve the issue of absolute error for far objects dominating the DIP opti-
mizer, inverted depth or disparity images are used. We also add a constant value
to depth image that reduces the ratio between the maximum and minimum depth
values. Further, masking of all far away objects beyond a certain depth is performed
76
min depth max depth constant
Ignatius 2.0 7.5 0.0
Barn 2.0 16.5 2.0
Caterpillar 2.0 7.5 0.0
Meetingroom 0.2 25.0 4.0
Truck 0.5 10.0 2.0
Courtroom 0.2 46.0 4.0
Church 0.2 16.0 4.0
Table 6.1: Minimum and maximum depth clipping values and the constant depth
value added per scene before doing DDP for 7 scenes in TnT dataset
by clipping to a predefined maximum depth value. We also clip depth to a minimum
value so that the maximum disparity value does not go to infinity. We provide these
values in Table 6.1.
Let Dout be the desired depth output from our network and let f? denote the
generator network. Input to the network is noise nin and the input depth map is
inverted to get Z in as the noisy disparity map. Let us represent the output from the
network as Zout where Zout = f (nin? ;Z
in). On convergence, optimal D? is obtained
by inverting Z?.
We use three different losses to optimize our network. The total loss is defined
as follows.
Ltotal = ? Ldisp1 + ? L
RGB
2 + (1? ? ? ? warp1 2)L . (6.2)
Disparity-based loss (Ldisp). The simplest technique to obtain Z? is to optimize
only on disparity. The disparity based loss Ldisp is a weighted combination of Mean
Absolute Error (MAE) or L1 loss and Structural Similarity metric (SSIM) or LSSIM
77
loss [191], and takes the following form:
Ldisp = ? L1(Z in, Zout) + (1? ? )LSSIM(Z in, Zout). (6.3)z z
We use L1 loss instead of Mean Squared Error(L2) loss to remove the effect of very
high valued noise having a major effect on the optimization. It takes the form as
L1(Z in, Zout) = |Z in?Zout|. The structural similarity LSSIM loss measures similarity
between the input Z in and reconstructed disparity map Zout. Here similarity is de-
fined at the block level where each block size is 11x11. It provides consistency at the
region level. The loss (LSSIM) takes the following form LSSIM = 1?SSIM(Z in, Zout).
where structural similarity index (SSIM) [191] is defined by the equation 6.4,
(2?x?y + c1)(2?xy + c2)
SSIM(x, y) = (6.4)
(?2 + ?2 + c )(?2 2x y 1 x + ?y + c2)
where, ?x and ?y are the averages of x and y, ?
2 and ?2x y are the variances of x and
y, ? 2xy is the covariance of x and y, c1 = (k1L) and c2 = (k2L)
2 where L = dynamic
range of pixel values and k1 and k2 are constants.
RGB-based loss (LRGB). We observed that under certain situations the dispar-
ity based loss leads to blurred edges in the final generated depth map. This happens
when there is a hole in the depth map near an object boundary. The generator net-
work produces a depth map that fuses the depths of different objects appearing
around the hole.
For example, consider regions belonging to sky in the top row of Figure 6.1.
78
Due to the homogeneous nature of the sky pixels, standard disparity/depth esti-
mation methods fail to produce any valid values for such regions. However, pixels
corresponding to house region have depth values. When the image is passed to a
DIP generator, the edge between the house and the sky gets blurred because the net-
work is trying to fill up the space without any additional knowledge, e.g., boundary
information. It just bases its decision on depth of neighboring space to fill up the
incomplete regions as seen in the top row of Figure 6.1(b) in the black box region.
To solve this problem, we also pass the color RGB image along with the
disparity image. The encoder-decoder based DDP architecture is now trained on
the 4 channel RGBD image. The network weights are updated not only on the
masked disparity map but also on the full RGB image. This helps the network to
leverage edge and texture information for the object boundary in the RGB image
to fill the holes in the disparity (and so in depth) image. This important edge
information provided to the network helps to generate crisp depth images as seen
in the Figure 6.1(c).
Let Iout be the output corresponding to input data I in using the noise nin and
generator model f out in in?. It takes the form as I = f?(n ; I ). The Loss L
RGB is also
a weighted combination of L1 and SSIM losses, and is defined as:
LRGB = ?IL
1(I in, Iout) + (1? ? )LSSIMI (I in, Iout). (6.5)
The RGB based loss helps to resolve the issue of blurring observed around
edges near object boundaries. However, putting equal weights to disparity Ldisp and
79
Figure 6.3: (a) RGB image (b) Input disparity map (c) Disparity output from DDP
trained with equal weight for RGB and depth loss. The RGB artifacts are evident
in (c) through the vertical and horizontal lines representing the wooden planks in
the wall (d) Disparity output from the DDP trained with lower weight for the RGB
loss compared to depth loss. The artifacts disappear in (d).
RGB LRGB components of the total loss leads to artifacts appearing in the disparity
image and so in depth output images as well. In particular, these artifacts are due
to textures and edges from an object?s appearance that are unrelated to the depth
boundaries. For example in Figure 6.3 the wall of the house is one surface and
should have smooth depth maps. However, the DDP network trained on RGBD
data generates vertical textures in the depth images that appear due to the vertical
wooden planks in the RGB image.
Warping-based loss (Lwarp). Lastly, we include a warping loss. Before defining
the warping loss, let us first define the warping function T refnbr. Given the camera
poses Cref and Cnbr of the reference and neighboring view, the function T
ref
nbr warps
neighboring view to reference view.
We are trying to generate denoised output Zout for the reference view. We
first find top N neighboring views of the reference view using the method used in
MVSNet [49]. Let nbr denote one of these N views and let W refnbr be the warped
image from neighboring view to reference view. Let Doutref be the predicted depth
(inverted Zoutref ) for the reference view and I
in
nbr is the input RGB for the neighboring
80
view, then the warped image is W ref = T ref (Dout, I innbr nbr ref nbr;Cnbr, Cref). Further, we use
bilinear interpolation while warping.
Now given I inref, the input RGB for the reference view, we can compute warping
loss as:
Lwarp 1nbr-ref = ?wL (I
in ,W refref nbr) + (1? ? )LSSIM(I inw ref,W refnbr). (6.6)
When?there are multiple neighboring views, the loss is averaged over them as
Lwarp 1 N warpref = N nbr=1 Lnbr-ref.
The warping loss not only resolves the issue of artifacts appearing around edges
within objects, it helps to improve the accuracy of the disparity (and depth) values
in other regions as well. The importance of each loss term is explored in section 6.2.
Optimization. All three losses that we use are differentiable with respect to the
network parameters and so the network is optimized using standard backpropaga-
tion.
6.2 Experiments
We demonstrate the effectiveness of our proposed approach on two different
tasks - 1) depth completion and 2) depth refinement. We also show the generaliza-
tion ability of our technique on new datasets with unseen statistical distributions.
We evaluate results on five different datasets: 1) Tanks and Temples(TnT) [146],
2) KITTI stereo benchmark [144], 3) Our own collected videos, 4) NYU depth V2
[192] and 5)Middlebury Dataset [193, 194]
81
6.2.1 Tanks and Temples
We first evaluate the effectiveness of the presented approach on the multi-view
reconstruction task. We demonstrate the qualitative and quantitative improvement
on seven sequences from the Tanks and Temples dataset (TnT dataset) [195]. These
sequences include Ignatius, Caterpillar, Truck, Meetingroom, Barn, Courthouse, and
Church.
Implementation. In this work, we applied the deep depth prior on a sequence of
depth images of a static scene. In all our experiments, the base network is primarily
an encoder-decoder based UNet architecture [196]. The UNet encoder consists of
5 convolution blocks each consisting of 32, 64, 128, 256, and 512 channels. Each
convolution operation uses 3x3 kernels. We have also conducted experiments using
skip networks [54]. The input noise to the network is of size m ? n ? 16, where m
and n are the dimensions of the input depth images. Training is performed using
the Adam optimizer. Initial learning rate is set to 0.00005, that is reduced by a
factor of 0.01 after 12000 and 15000 epochs. The model is trained for 16000 epochs.
Hyper-parameters of the loss function are set through searching on a small subset
(10 images randomly chosen) from each of Ignatius and Barn sequence in TnT.
The depth images are fused using the approach proposed by Galliani et al. [197]
(Fusibile) to reconstruct the final 3D point cloud. Fusibile has hyperparameters
that determine the precision and recall values for the resultant point cloud. These
parameters include the disparity threshold and the number of consistent views. The
82
disparity threshold determines the threshold of difference in disparity that is allowed
for two points from two different depth maps to be merged. The number of consistant
images shows the number of images in which a point has to have consistant depth
values for it to appear in the merged point cloud.
Quantitative results. Our primary comparison is with the popular semi-global
matching (SGM) method [4] for depth image estimation. SGM is an optimization
based method that does not need any training data. We also compare with a state-
of-the-art learning based method: MVSNet [49]. Further, it should be noted that
our approach is agnostic to the depth estimation method, i.e., it can be used to
improve depth maps from any source.
We compare our reconstructed point clouds to the ground truth point clouds
for all 7 sequences in the TnT dataset [195], using the benchmarking code included
with the dataset, which returns the precision (P = TP ), recall (R = TP ) and
TP+FP TP+FN
f-score (F = 2. P.R ) values for each scene given the reconstructed point cloud model
P+R
and a file containing estimated camera poses used for that reconstruction. Here,
TP , FP , and FN are true positives, false positives, and false negatives respectively.
For each of the sequences we report values at the same points of the precision-recall
curve as specified by the TnT dataset.
Ablation studies. We conduct a series of experiments on the Ignatius dataset to
study impact of each parameter and component of the proposed method. We first
study the effect of number of depth images on the final reconstruction. For this
83
Dataset SGM SGM+ SGM+
Zhang et al. DDP (Ours)
Ignatius( 87 images) 45.3 45.2 45.6
Ignatius( 44 images) 44.0 44.0 44.2
Ignatius( 22 images) 30.7 30.7 31.1
Table 6.2: We compare the f-score of DDP on datasets of different sizes. As the
number of images become smaller, the holes increase and the most relative perfor-
mance gain is at 22 images. We also compare to [2] applied to an data that is out
of distribution and show we do better. (Higher is better)
purpose, we skip a constant number of images in the dataset. For example, a skip
size of 2 means that only every second image is used for reconstruction giving us 87
images in total for Ignatius. Such reduction in the data size increases the number of
holes on the SGM-based reconstruction method. The same reduced dataset is used
for the baseline and our approach. One of the goals of this work is for our accurate
hole-filling to allow for fewer images to be captured and used for reconstruction. We
also conducted experiments with skip values of 4 and 8 giving 44 and 22 images.
Table 6.2 provides details about the impact of this skip size on relative improvement.
For SGM+DDP, we keep SGM values where there are no holes, and replace the holes
with DDP output values in those regions. It can be observed that at skip sizes of
2, 4 and 8, we see an improvement of 0.3, 0.2 and 0.4 percentage point of relative
improvement in f-scores over the baseline respectively. It suggests that at higher
skip sizes, our approach provides necessary prior information to fill holes. Please
note that we have not included experiments with skip size of one. Running Fusibile
on all of the Ignatius images produces dense reconstruction without holes. Running
DDP on this dense reconstruction has no effect, and so we have not included results
with skip size of one in the table. Table 6.2 also contains results in comparison to
84
Zhang and Funkhouser [2] which we will discuss later.
Network + Loss P R F
UNet + D 34.3 37.2 35.7
UNet + RGBD 39.2 49.0 43.6
UNet + RGBD + warp 40.0 50.6 44.7
SkipNet + RGBD + warp 40.2 50.3 44.7
Table 6.3: The results of comparing the DDP output using disparity, RGBD, and
warping loss and using UNet or SkipNet as the network are shown here, P:precision,
R:recall, F:f-Score. (Higher is better)
Next we show the advantage of using the RGBD and warping losses over
disparity loss only. Running the optimization for too many epochs on the depth
only DDP can return the holes in the result, thus we run the depth only based
networks for 6000 epochs instead of 16000. The results are in Table 6.3, showing a
7.9% improvement in f-measure by using RGBD. Table 6.3 also shows the benefit of
photo-consistency based warping loss. We observe a relative improvement of 1.1%
in f-measure after incorporation of the warping loss. Finally, we also conducted
experiments with Skip-Net [54] to show the impact of using a network different
from UNet. Table 6.3 shows they give comparable results, and we chose to use
UNet for our other experiments since it is a more commonly used network.
Comparison to prior work. Quantitative comparison with the SGM baseline on
the 7 TnT dataset is shown in the Table 6.4. The SGM+DDP (Ours) column shows
the results of using DDP to improve depth maps from SGM. We combine DDP depth
maps with SGM depth maps, keeping the SGM values everywhere where there are no
holes and replace the holes with DDP depth values. The table includes the precision,
85
SGM SGM+DDP(Ours)
Data N C D P R F P R F
I 87 5 1.0 41.7 49.5 45.3 41.2 51.1 45.6
I 22 2 2.0 32.7 29.0 30.7 32.2 30.1 31.1
B 180 2 0.5 23.3 27.8 25.4 22.8 29.3 25.6
B 45 1 4.0 19.1 21.4 20.2 18.1 22.8 20.2
T 99 4 1.0 35.8 38.8 37.2 34.5 41.4 37.6
T 25 1 2.0 29.4 33.8 31.5 27.7 36.7 31.6
C1 156 4 1.0 24.9 41.5 31.1 24.0 42.9 30.8
C1 39 1 2.0 17.3 36.5 23.5 16.2 37.9 22.7
MR 152 4 1.0 27.5 13.4 18.1 25.2 15.2 19.0
MR 38 1 4.0 17.8 9.0 12.0 15.7 10.7 12.7
CH 110 2 1.0 1.8 0.8 1.1 3.2 1.2 1.7
C2 86 4 1 8.9 8.5 8.7 8.9 8.6 8.8
Table 6.4: Quantitative results comparing 7 sequences for SGM based depths and
applying the DDP on SGM depths. We combine DDP with SGM by replacing the
depth values in the holes of SGM depth with DDP depth. Here the datasets are
I:Ignatius, B: barn, T: truck, C1: caterpillar, MR: Meetingroom, CH: courthouse
and C2: church. Also N: number of images in sequence, C: number of consistent
views while constructing point cloud, D: disparity threshold, P:precision, R:recall,
F:f-score. (Higher is better)
recall and f-scores on each dataset with sub-sampled number of images using the
technique described in the ablation studies sub-section. We observe improvement in
both recall and f-score values of the presented approach over the SGM-method. It
suggests the method helps in hole filing. The f-score improves in 6 of the 7 sequences.
Overall SGM+DDP(Ours) helps to improve recall by 1.5 percentage points and f-
score by 0.2 percentage points. In particular, we see a significant improvement of
1.8 in recall and 0.8 in f-score on the MeetingRoom sequence.
We also tested using the learning-based MVSNet method as an input to our
method. While the results from MVSNet do not contain any holes, as they predict a
value at every pixel, just as we do, they do have areas where the predictions have low
confidence. We use their output probability map which shows confidence of depth
86
Dataset MVSNet MVSNet + DDP
P R F P R F
I (126 images) 45.9 52.2 48.8 45.2 54.9 49.6
I (32 images) 40.3 46.8 43.3 38.9 50.0 43.8
Table 6.5: We compare the reconstruction performance using depth maps generated
by MVSNet and by applying the DDP on MVSNet. I: Ignatius, P:precision, R:recall,
F:f-score. (Higher is better)
prediction and remove depths at places with confidence below a certain threshold
(0.1) to see if we can fill those areas more accurately than MVSNet did. We then fill
up these holes using DDP and compare the results with the original. These results
for Ignatius are in Table 6.5. We see 0.7% improvement in F-score and 3.0% in
recall.
Finally we also compare to Zhang and Funkhouser [2] in Table 6.2. We used
the trained networks provided by Zhang and Funkhouser using SUNCG-RGBD [198]
and ScanNet [199] datasets for inpainting on Ignatius dataset. As we can see here
this method does not work well because it is a learning based system, Ignatius
is an out-of-distribution test data and the model would require finetuning. This
emphasizes the usefulness of our network being optimized on each image separately
and it being independent of training data statistics.
Qualitative results. Next we provide visual results on the TnT dataset to high-
light the impact of our approach in achieving high quality reconstruction. In Fig-
ure 6.4, we show output disparity images at 16000 (column c) epochs of our proposed
approach on 5 TnT sequences. Note that the holes in the input disparity images
(marked as white in column b) are filled in the output images (c).
87
Figure 6.4: (a) Input RGB and (b) Input depth images and (c) Predicted depth at
16000 epochs for Ignatius, Meeting Room, Barn, Caterpillar and Truck.(Best viewed
in digital. Please zoom in.)
6.2.2 KITTI
Next we show the performance of the proposed approach on the KITTI stereo
benchmark (2015) [200] for ELAS [201] based disparity maps in Table 6.6. Tradi-
tional stereo methods like ELAS works without dependence on training data distri-
butions unlike state of the art deep learning based systems. The evaluation metric
used in this benchmark is D1 that measures the error percentage when the pre-
dicted value differs from the groundtruth value by 3px or 5% or more. This error is
88
separately measured for background, foreground, and all regions together. The two
approaches we used for generating input depths for DDP are as follows:
1. We use first set of parameters for ELAS to generate disparity maps with-
out any holes and DDP for refinement and completion. We observe a small
performance improvement (?0.1% reduction in D1-all error).
2. We use a second set of parameters for ELAS to generate depth images with
holes. In this case, DDP applied over ELAS depth helps to inpaint and com-
plete the ELAS depth with a 8.4% reduction in D1-all error.
Method D1-bg D1-fg D1-all
ELAS 7.5 21.1 9.8
params1
ELAS + DDP 7.5 21.1 9.7
ELAS 20.6 28.7 22.0
params2
ELAS + DDP 12.3 20.2 13.6
Table 6.6: Results on KITTI Dataset using D1 error. Lower is better.
6.2.3 Our Dataset
We used an off-the-shelf consumer camera to capture 5 scenes, both indoor
and outdoor. The datasets are monocular image sequences and are named Guitar,
Shed, Stones, Red Couch, and Van.
To better understand the quality of the depth images generated by the pro-
posed method, we warp RGB images using the original and the proposed DDP
based depth images for Shed and Stones from our video sequences and Truck from
TnT dataset. The warped RGB images are shown in the Figure 6.5. Holes can be
89
Figure 6.5: (a) Original image from the reference view point (b) Novel view syn-
thesized from neighboring to reference viewpoint using the original SGM depth (c)
Novel view synthesis from a neighboring to reference viewpoint using DDP depth.
The holes that appear in (b) gets filled in (c) (Best viewed in digital. Please zoom
in.)
observed in the RGB images warped using original depth maps for example along
the roof of the shed, some parts of the truck and along the base of the Stone Wall.
However, RGB images warped using our depth images removes large portions of
these holes and are far smoother.
We show reconstructed point clouds on RedCouch, Guitar and Van in Fig-
ure 6.6. The first row is one of the input RGB images used for the reconstruction,
second row is the reconstruction from the original SGM output, the third one is
from SGM + DDP, and the fourth row is the reconstructed result from MVSNet.
The number of views used for these reconstructions are small (?10). We chose
these datasets to show specific ways in which our reconstructions improve on the
original SGM and even MVSNet outputs. As we can see from our results, there are
90
Figure 6.6: (a) Input RGB image. Reconstructed point-cloud from (b) SGM (c)
DDP (Ours) and (d) MVSNet depth images for RedCouch, Guitar and Van. Our
reconstructions are better and more complete. (Best viewed in digital. Please zoom
in.)
91
a lot less holes in the reconstructions computed from our depth maps compared to
original SGM and MVSNet. For example from the top view of the RedCouch in the
first column, we can see the relatively obscure region behind the pillow. The DDP
successfully fills up a big portion of this hole. Next the reflective surfaces of the
guitar in the second column and the van in the third column also get completely
or at least partially filled depending on how big the original hole was. For example
note how the text ?FREE? on the van is more readable in the DDP result.
6.2.4 NYU v2
We demonstrate performance of our approach on hole-filling Kinect depth
data. For this, we use data from the popular NYU v2 dataset [192]. Qualitative
experiments on the NYU v2 depth images are shown in Figure 6.7. We cannot
directly use the test data from NYU, since we need at least two views as input.
So we downloaded 3 video scenes, and used a structure-from-motion (SfM) pipeline
[202] to get the extrinsic camera pose information. Using this, we apply DDP
to remove holes in the Kinect depth maps given by the dataset. We directly re-
fined the Kinect depth images, and as the originals are incomplete, we cannot use
them for performance analysis. Instead we use the depth maps to project one
RGB view to another in the video sequences to compute and RGB re-projection
error, which we quantify with Peak Signal to noise ratio (PSNR). It is defined as
?
PSNR = 20 log10(MAXI/ MSE), where MAXI is the maximum value of the
image and MSE is the mean squared error of the image. As we can see in Fig-
92
Figure 6.7: (a) Input RGB (b) Input depth (c) Depth completed using a cross-
bilateral filter and (d) Depth completed using DDP (Best viewed in digital. Please
zoom in.)
ure 6.7, DDP fills up holes and improves the PSNR. We also use the cross-bilateral
hole filled depth maps included in the NYU v2 dataset as a baseline. We observe
consistent qualitative and quantitative improvements using DDP compared to the
cross-bilateral method.
6.2.5 Middlebury
Finally we compare to the baseline techniques described in Depthcomp [3] for
Middlebury Dataset scenes [193, 194] in Table 6.7. We took the input images and
disparity maps from Depthcomp [3] with the holes they created artificially in the
93
Method Plastic Baby Bowling Average
SSI [203] 1.7573 2.9638 6.4936 3.74
Linear Inter 1.3432 1.3473 1.4503 1.38
Cubic Inter 1.2661 1.3384 1.4460 1.35
FMM [204] 0.9580 0.8349 1.2422 1.01
GIF [205] 0.7947 0.6008 0.9436 0.78
FBF [206] 0.8643 0.6238 0.5918 0.69
EBI [207] 0.6952 0.6755 0.4857 0.62
DepthComp [3] 0.6618 0.3697 0.4292 0.49
DDP (Ours) 0.4951 0.3232 0.5743 0.46
Table 6.7: Comparison of our method with techniques mentioned in [3] on Middle-
bury dataset using RMSE. (Lower is better)
disparity maps and compared to the baseline techniques they mention. The metric is
Root Mean Square Error (RMSE) which is ||Dout?DGT||2 where Dout is the output
disparity and DGT is the ground-truth disparity. Our average result is better than
all of the other techniques in Table 6.7 for depth completion.
94
Chapter 7: Conclusion
The proposed Stacked-STGCN in Chapter 3 introduces a stacked hourglass
architecture to STGCN for improved generalization performance and localization
accuracy. Its building block STGCN is generic enough to take in a variety of
nodes/edges and to support flexible graph configuration. We applied our Stacked-
STGCN to action segmentation and demonstrated improved performances on the
CAD120 and Charades datasets. We also note that adding spatial edge connections
between nodes from same model lead to performance improvement on Charades
rather than across different feature nodes. This is mainly due to the oversimplified
edge model (i.e., with fixed weights). Instead of using a binary function to decide on
the correlation between these nodes, more sophisticated weights could be explored.
We leave this as future work. Furthermore, graph representation based on actor, ac-
tion, object and scene provides inherent explanations of corresponding detection of
action categories. Such explanation requires visualizing the traces of most activated
nodes/edges, which we also leave as future work. Finally, we anticipate that due to
its generic design Stacked-STGCN can be applied to a wider range of applications
that require inference over a sequence of graphs with heterogeneous data types and
varied temporal extent.
95
In Chapter 4 we investigate different combinations of knowledge graphs (KG)
for actions that give better performance for zero and few shot action recognition. We
show significant improvement on zero shot learning by using a network that models
a sequence of words instead of traditional single word based models. Moreover,
extending KG using other action classes leads to better results. We observe that
combining word based knowledge graphs with visual knowledge graphs help in few
shot learning. Also combining verbs and noun based KG, improves both zero and
few shot learning.
In Chapter 5 we experiment with adaptive learning of adjacency matrix and
constraining neighbors in a KG through triplet loss based training in addition to task
specific loss like MSE loss. We show visually how the graph connectivity changes
with each update. We use previous research on convolutional networks to develop
an understanding of how the GCN itself modifies the input connectivity vs just
displaying the input adjacency matrix connections. Our results beat the state of
the art on many standard datasets for semi-supervised learning and zero/few-shot
learning by a significant margin.
In Chapter 6 we have presented an approach to reconstruct depth maps from
incomplete ones. We leverage the recently proposed idea of utilizing a neural network
as a prior for natural color images, and introduced three new loss terms for depth
map completion. Extensive qualitative and quantitative experiments on sequences
from the Tanks and Temples, KITTI, NYU v2, Middlebury, and our own dataset
demonstrated that the depth maps generated by our method were more accurate. An
important future extension is improving the speed of the method, where an efficient
96
version of the presented approach could be used for a real-time depth enhancement
and 3D reconstruction pipeline. Further, the presented method could benefit from
deeper understanding of the convergence properties of training deep image priors.
Overall we have investigated different aspects of holistic scene understand-
ing including action recognition and depth perception. Future work can consider
combination of these field like using improved depth perception for a simultaneous
localization and mapping (SLAM) system that can help tracking action over longer
periods of time.
97
Bibliography
[1] AJ Piergiovanni and Michael S Ryoo. Learning latent super-events to detect multiple
activities in videos. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), volume 4, 2018.
[2] Yinda Zhang and Thomas Funkhouser. Deep depth completion of a single rgb-d
image. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 175?185, 2018.
[3] Amir Atapour-Abarghouei and Toby P Breckon. Depthcomp: real-time depth image
completion based on prior semantic scene segmentation. 2017.
[4] Heiko Hirschmuller. Stereo processing by semiglobal matching and mutual informa-
tion. IEEE Transactions on pattern analysis and machine intelligence, 30(2):328?
341, 2007.
[5] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional
networks for skeleton-based action recognition. In AAAI, 2018.
[6] Pallabi Ghosh, Yi Yao, Larry Davis, and Ajay Divakaran. Stacked spatio-temporal
graph convolutional networks for action segmentation. In The IEEE Winter Con-
ference on Applications of Computer Vision, pages 576?585, 2020.
[7] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for hu-
man pose estimation. In European Conference on Computer Vision, pages 483?499.
Springer, 2016.
[8] Jing Yang, Qingshan Liu, and Kaihua Zhang. Stacked hourglass network for robust
facial landmark localisation. In Computer Vision and Pattern Recognition Work-
shops (CVPRW), 2017 IEEE Conference on, pages 2025?2033. IEEE, 2017.
[9] Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing
Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Suk-
thankar, et al. Ava: A video dataset of spatio-temporally localized atomic visual
actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 6047?6056, 2018.
98
[10] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra
Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The
kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
[11] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model
and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 6299?6308, 2017.
[12] Ali Diba, Mohsen Fayyaz, Vivek Sharma, Amir Hossein Karami, Mohammad Mahdi
Arzani, Rahman Yousefzadeh, and Luc Van Gool. Temporal 3d convnets:
New architecture and transfer learning for video classification. arXiv preprint
arXiv:1711.08200, 2017.
[13] Zhaofan Qiu, Ting Yao, and Tao Mei. Learning spatio-temporal representation with
pseudo-3d residual networks. In proceedings of the IEEE International Conference
on Computer Vision, pages 5533?5541, 2017.
[14] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar
Paluri. A closer look at spatiotemporal convolutions for action recognition. In
Proceedings of the IEEE conference on Computer Vision and Pattern Recognition,
pages 6450?6459, 2018.
[15] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and
Luc Van Gool. Temporal segment networks: Towards good practices for deep action
recognition. In European conference on computer vision, pages 20?36. Springer,
2016.
[16] Xiang Xiang, Ye Tian, Austin Reiter, Gregory D Hager, and Trac D Tran. S3d:
Stacking segmental p3d for action quality assessment. In 2018 25th IEEE Interna-
tional Conference on Image Processing (ICIP), pages 928?932. IEEE, 2018.
[17] Xiaolong Wang, Yufei Ye, and Abhinav Gupta. Zero-shot recognition via seman-
tic embeddings and knowledge graphs. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 6857?6866, 2018.
[18] Dedre Gentner. Some interesting differences between verbs and nouns. In Cognition
and brain theory, volume 4, pages 161?178, 1981.
[19] Pallabi Ghosh, Nirat Saini, Larry S. Davis, and Abhinav Shrivastava. All about
knowledge graphs for actions, 2020.
[20] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of
word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
[21] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Dis-
tributed representations of words and phrases and their compositionality. In Ad-
vances in neural information processing systems, pages 3111?3119, 2013.
[22] Toma?s? Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in contin-
uous space word representations. In Proceedings of the 2013 conference of the north
american chapter of the association for computational linguistics: Human language
technologies, pages 746?751, 2013.
99
[23] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global
vectors for word representation. In Proceedings of the 2014 conference on empirical
methods in natural language processing (EMNLP), pages 1532?1543, 2014.
[24] Robyn Speer, Joshua Chin, and Catherine Havasi. Conceptnet 5.5: an open multi-
lingual graph of general knowledge. In Proceedings of the Thirty-First AAAI Con-
ference on Artificial Intelligence, pages 4444?4451, 2017.
[25] Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Learning to detect
unseen object classes by between-class attribute transfer. In 2009 IEEE Conference
on Computer Vision and Pattern Recognition, pages 951?958. IEEE, 2009.
[26] Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. Describing objects
by their attributes. In 2009 IEEE Conference on Computer Vision and Pattern
Recognition, pages 1778?1785. IEEE, 2009.
[27] Genevieve Patterson and James Hays. Coco attributes: Attributes for people, an-
imals, and objects. In European Conference on Computer Vision, pages 85?100.
Springer, 2016.
[28] Phillip Isola, Joseph J Lim, and Edward H Adelson. Discovering states and trans-
formations in image collections. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 1383?1391, 2015.
[29] George A Miller. Wordnet: a lexical database for english. Communications of the
ACM, (11):39?41, 1995.
[30] Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R Hruschka,
and Tom M Mitchell. Toward an architecture for never-ending language learning.
In Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010.
[31] Xinlei Chen, Abhinav Shrivastava, and Abhinav Gupta. Neil: Extracting visual
knowledge from web data. In Proceedings of the IEEE International Conference on
Computer Vision, pages 1409?1416, 2013.
[32] Bo Jiang, Ziyan Zhang, Doudou Lin, Jin Tang, and Bin Luo. Semi-supervised
learning with graph learning-convolutional networks. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 11313?11320, 2019.
[33] Xiang Gao, Wei Hu, and Zongming Guo. Exploring structure-adaptive graph learn-
ing for robust semi-supervised classification. In 2020 IEEE International Conference
on Multimedia and Expo (ICME), pages 1?6. IEEE, 2020.
[34] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convo-
lutional networks. In International Conference on Learning Representations, 2017.
[35] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and
Tina Eliassi-Rad. Collective classification in network data. AI magazine, 29(3):93?
93, 2008.
100
[36] Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. Unsupervised learning of sen-
tence embeddings using compositional n-gram features. In NAACL 2018-Conference
of the North American Chapter of the Association for Computational Linguistics,
2018.
[37] Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of dense two-
frame stereo correspondence algorithms. International journal of computer vision,
47(1-3):7?42, 2002.
[38] Yasutaka Furukawa, Brian Curless, Steven M Seitz, and Richard Szeliski. Towards
internet-scale multi-view stereo. In 2010 IEEE computer society conference on com-
puter vision and pattern recognition, pages 1434?1441. IEEE, 2010.
[39] Michael Goesele, Noah Snavely, Brian Curless, Hugues Hoppe, and Steven M Seitz.
Multi-view stereo for community photo collections. In 2007 IEEE 11th International
Conference on Computer Vision, pages 1?8. IEEE, 2007.
[40] Johannes Lutz Scho?nberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael
Frahm. Pixelwise view selection for unstructured multi-view stereo. In European
Conference on Computer Vision (ECCV), 2016.
[41] Oliver Woodford, Philip Torr, Ian Reid, and Andrew Fitzgibbon. Global stereo
reconstruction under second-order smoothness priors. IEEE transactions on pattern
analysis and machine intelligence, 31(12):2115?2128, 2009.
[42] Michael Bleyer, Carsten Rother, Pushmeet Kohli, Daniel Scharstein, and Sudipta
Sinha. Object stereo-joint stereo matching and object segmentation. In CVPR 2011,
pages 3081?3088. IEEE, 2011.
[43] Tatsunori Taniai, Yasuyuki Matsushita, and Takeshi Naemura. Graph cut based
continuous stereo matching using locally shared labels. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 1613?1620, 2014.
[44] Tatsunori Taniai, Yasuyuki Matsushita, Yoichi Sato, and Takeshi Naemura. Con-
tinuous 3d label stereo matching using local expansion moves. IEEE transactions
on pattern analysis and machine intelligence, 40(11):2725?2739, 2017.
[45] Jure Zbontar, Yann LeCun, et al. Stereo matching by training a convolutional
neural network to compare image patches. Journal of Machine Learning Research,
17(1-32):2, 2016.
[46] Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey
Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks
for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 4040?4048, 2016.
[47] Alex Kendall, Hayk Martirosyan, Saumitro Dasgupta, Peter Henry, Ryan Kennedy,
Abraham Bachrach, and Adam Bry. End-to-end learning of geometry and context
for deep stereo regression. In Proceedings of the IEEE International Conference on
Computer Vision, pages 66?75, 2017.
101
[48] Mengqi Ji, Juergen Gall, Haitian Zheng, Yebin Liu, and Lu Fang. Surfacenet: An
end-to-end 3d neural network for multiview stereopsis. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 2307?2315, 2017.
[49] Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference
for unstructured multi-view stereo. In Proceedings of the European Conference on
Computer Vision (ECCV), pages 767?783, 2018.
[50] Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang.
Deepmvs: Learning multi-view stereopsis. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 2821?2830, 2018.
[51] C. Zhou, H. Zhang, X. Shen, and J. Jia. Unsupervised learning of stereo matching.
In 2017 IEEE International Conference on Computer Vision (ICCV), pages 1576?
1584, Oct 2017.
[52] Alessio Tonioni, Matteo Poggi, Stefano Mattoccia, and Luigi Di Stefano. Unsuper-
vised adaptation for deep stereo. In Proceedings of the IEEE International Confer-
ence on Computer Vision, pages 1605?1613, 2017.
[53] Zhe Ren, Junchi Yan, Bingbing Ni, Bin Liu, Xiaokang Yang, and Hongyuan Zha.
Unsupervised deep learning for optical flow estimation. In Thirty-First AAAI Con-
ference on Artificial Intelligence, 2017.
[54] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior. In
CVPR, pages 9446?9454, 2018.
[55] Francis Williams, Teseo Schneider, Claudio Silva, Denis Zorin, Joan Bruna, and
Daniele Panozzo. Deep geometric prior for surface reconstruction. In CVPR, pages
10130?10139, 2019.
[56] Yossi Gandelsman, Assaf Shocher, and Michal Irani. ? double-dip?: Unsupervised
image decomposition via coupled deep-image-priors. In CVPR, 2019.
[57] Pallabi Ghosh, Vibhav Vineet, Larry S. Davis, Abhinav Shrivastava, Sudipta Sinha,
and Neel Joshi. Depth completion using a view constrained deep prior, 2020.
[58] Patrick Knobelreiter, Christian Reinbacher, Alexander Shekhovtsov, and Thomas
Pock. End-to-end training of hybrid cnn-crf models for stereo. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages 2339?2348,
2017.
[59] Youze Xue, Jiansheng Chen, Weitao Wan, Yiqing Huang, Cheng Yu, Tianpeng Li,
and Jiayu Bao. Mvscrf: Learning multi-view stereo with conditional random fields.
In Proceedings of the IEEE International Conference on Computer Vision, pages
4312?4321, 2019.
[60] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for
action recognition in videos. In Advances in neural information processing systems,
pages 568?576, 2014.
102
[61] Junbo Zhang, Yu Zheng, and Dekang Qi. Deep spatio-temporal residual networks
for citywide crowd flows prediction. In Thirty-First AAAI Conference on Artificial
Intelligence, 2017.
[62] Ashesh Jain, Amir R Zamir, Silvio Savarese, and Ashutosh Saxena. Structural-rnn:
Deep learning on spatio-temporal graphs. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 5308?5317, 2016.
[63] Xikun Zhang, Chang Xu, and Dacheng Tao. Graph edge convolutional neural net-
works for skeleton based action recognition. arXiv preprint arXiv:1805.06184, 2018.
[64] Xiaolong Wang and Abhinav Gupta. Videos as space-time region graphs. arXiv
preprint arXiv:1806.01810, 2018.
[65] Effrosyni Mavroudi, Divya Bhaskara, Shahin Sefati, Haider Ali, and Rene? Vidal.
End-to-end fine-grained action segmentation and recognition using conditional ran-
dom field models and discriminative sparse coding. arXiv preprint arXiv:1801.09571,
2018.
[66] Bharat Singh, Tim K Marks, Michael Jones, Oncel Tuzel, and Ming Shao. A multi-
stream bi-directional recurrent neural network for fine-grained action detection. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 1961?1970, 2016.
[67] Serena Yeung, Olga Russakovsky, Ning Jin, Mykhaylo Andriluka, Greg Mori, and
Li Fei-Fei. Every moment counts: Dense detailed labeling of actions in complex
videos. International Journal of Computer Vision, 126(2-4):375?389, 2018.
[68] Colin Lea Michael D Flynn Rene? and Vidal Austin Reiter Gregory D Hager. Tem-
poral convolutional networks for action segmentation and detection. In IEEE Inter-
national Conference on Computer Vision (ICCV), 2017.
[69] Li Ding and Chenliang Xu. Tricornet: A hybrid temporal convolutional and re-
current network for video action segmentation. arXiv preprint arXiv:1705.07818,
2017.
[70] Li Ding and Chenliang Xu. Weakly-supervised action segmentation with iterative
soft boundary assignment. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 6508?6516, 2018.
[71] Peng Lei and Sinisa Todorovic. Temporal deformable residual networks for action
segmentation in videos. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 6742?6751, 2018.
[72] Earl Rennison. Constructing a search query to execute a contextual personalized
search of a knowledge base, January 11 2011. US Patent 7,870,117.
[73] David L Gilmour and Eric Wang. Method and apparatus for constructing and
maintaining a user knowledge profile, July 16 2002. US Patent 6,421,669.
103
[74] Scott Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and Jianfeng Gao. Semantic
parsing via staged query graph generation: Question answering with knowledge base.
2015.
[75] Xuchen Yao and Benjamin Van Durme. Information extraction over structured data:
Question answering with freebase. In Proceedings of the 52nd Annual Meeting of
the Association for Computational Linguistics (Volume 1: Long Papers), volume 1,
pages 956?966, 2014.
[76] Denis Lukovnikov, Asja Fischer, Jens Lehmann, and So?ren Auer. Neural network-
based question answering over knowledge graphs on word and character level. In
Proceedings of the 26th international conference on World Wide Web, pages 1211?
1220. International World Wide Web Conferences Steering Committee, 2017.
[77] Ben Hixon, Peter Clark, and Hannaneh Hajishirzi. Learning knowledge graphs
for question answering through conversational dialog. In Proceedings of the 2015
Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, pages 851?861, 2015.
[78] Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. Learning entity
and relation embeddings for knowledge graph completion. In Twenty-ninth AAAI
conference on artificial intelligence, 2015.
[79] Antoine Bordes and Evgeniy Gabrilovich. Constructing and mining web-scale knowl-
edge graphs: Kdd 2014 tutorial. In Proceedings of the 20th ACM SIGKDD interna-
tional conference on Knowledge discovery and data mining, pages 1967?1967. ACM,
2014.
[80] Sutanay Choudhury, Khushbu Agarwal, Sumit Purohit, Baichuan Zhang, Meg Pir-
rung, Will Smith, and Mathew Thomas. Nous: Construction and querying of dy-
namic knowledge graphs. In 2017 IEEE 33rd International Conference on Data
Engineering (ICDE), pages 1563?1565. IEEE, 2017.
[81] Junyu Gao, Tianzhu Zhang, and Changsheng Xu. I know the relationships: Zero-
shot action recognition via two-stream graph convolutional networks and knowledge
graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33,
pages 8303?8311, 2019.
[82] Collin F Baker, Charles J Fillmore, and John B Lowe. The berkeley framenet
project. In 36th Annual Meeting of the Association for Computational Linguistics
and 17th International Conference on Computational Linguistics, Volume 1, pages
86?90, 1998.
[83] Kenneth Marino, Ruslan Salakhutdinov, and Abhinav Gupta. The more you know:
Using knowledge graphs for image classification. arXiv preprint arXiv:1612.04844,
2016.
[84] Fereshteh Sadeghi, Santosh K Kumar Divvala, and Ali Farhadi. Viske: Visual knowl-
edge extraction and question answering by visual verification of relation phrases. In
Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 1456?1464, 2015.
104
[85] Yuan Fang, Kingsley Kuan, Jie Lin, Cheston Tan, and Vijay Chandrasekhar. Object
detection meets knowledge graphs. In Proceedings of the 26th International Joint
Conference on Artificial Intelligence, pages 1661?1667, 2017.
[86] Zhanglin Peng, Lingyun Wu, Jiamin Ren, Ruimao Zhang, and Ping Luo. Cuimage:
A neverending learning platform on a convolutional knowledge graph of billion web
images. In 2018 IEEE International Conference on Big Data (Big Data), pages
1787?1796. IEEE, 2018.
[87] Xinlei Chen, Abhinav Shrivastava, and Abhinav Gupta. Enriching visual knowledge
bases via object discovery and segmentation. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 2027?2034, 2014.
[88] Yuke Zhu, Alireza Fathi, and Li Fei-Fei. Reasoning about object affordances in a
knowledge base representation. In European conference on computer vision, pages
408?424. Springer, 2014.
[89] Rob Fergus, Hector Bernal, Yair Weiss, and Antonio Torralba. Semantic label shar-
ing for learning with many categories. In European Conference on Computer Vision,
pages 762?775. Springer, 2010.
[90] Ruslan Salakhutdinov, Antonio Torralba, and Josh Tenenbaum. Learning to share
visual appearance for multiclass object detection. In CVPR 2011, pages 1481?1488.
IEEE, 2011.
[91] Franco Scarselli, Sweah Liang Yong, Marco Gori, Markus Hagenbuchner, Ah Chung
Tsoi, and Marco Maggini. Graph neural networks for ranking web pages. In Proceed-
ings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence,
pages 666?672. IEEE Computer Society, 2005.
[92] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele
Monfardini. The graph neural network model. IEEE Transactions on Neural Net-
works, (1):61?80, 2009.
[93] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph
sequence neural networks. arXiv preprint arXiv:1511.05493, 2015.
[94] Ruiyu Li, Makarand Tapaswi, Renjie Liao, Jiaya Jia, Raquel Urtasun, and
Sanja Fidler. Situation recognition with graph neural networks. arXiv preprint
arXiv:1708.04320, 2017.
[95] Yikang Li, Wanli Ouyang, Bolei Zhou, Jianping Shi, Chao Zhang, and Xiaogang
Wang. Factorizable net: an efficient subgraph-based framework for scene graph
generation. In European Conference on Computer Vision, pages 346?363. Springer,
2018.
[96] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Tim-
othy Hirzel, Ala?n Aspuru-Guzik, and Ryan P Adams. Convolutional networks on
graphs for learning molecular fingerprints. In Advances in neural information pro-
cessing systems, pages 2224?2232, 2015.
105
[97] James Atwood and Don Towsley. Diffusion-convolutional neural networks. In Ad-
vances in Neural Information Processing Systems, pages 1993?2001, 2016.
[98] Michae?l Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural
networks on graphs with fast localized spectral filtering. In Advances in neural
information processing systems, pages 3844?3852, 2016.
[99] David K Hammond, Pierre Vandergheynst, and Re?mi Gribonval. Wavelets on
graphs via spectral graph theory. Applied and Computational Harmonic Analysis,
30(2):129?150, 2011.
[100] Fenyu Hu, Yanqiao Zhu, Shu Wu, Liang Wang, and Tieniu Tan. Hierarchical
graph convolutional networks for semi-supervised node classification. arXiv preprint
arXiv:1902.06667, 2019.
[101] Johannes Klicpera, Stefan Wei?enberger, and Stephan Gu?nnemann. Diffusion im-
proves graph learning. In Advances in Neural Information Processing Systems, pages
13354?13366, 2019.
[102] Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional
networks for semi-supervised learning. arXiv preprint arXiv:1801.07606, 2018.
[103] Afshin Rahimi, Trevor Cohn, and Timothy Baldwin. Semi-supervised user geoloca-
tion via graph convolutional networks. arXiv preprint arXiv:1804.08049, 2018.
[104] Petar Velic?kovic?, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro
Lio?, and Yoshua Bengio. Graph attention networks. In International Conference on
Learning Representations, 2018.
[105] Marinka Zitnik and Jure Leskovec. Predicting multicellular function through multi-
layer tissue networks. Bioinformatics, 33(14):i190?i198, 2017.
[106] Michelle Guo, Edward Chou, De-An Huang, Shuran Song, Serena Yeung, and Li Fei-
Fei. Neural graph matching networks for fewshot 3d action recognition. In European
Conference on Computer Vision, pages 673?689. Springer, 2018.
[107] Jongmin Kim, Taesup Kim, Sungwoong Kim, and Chang D Yoo. Edge-labeling
graph neural network for few-shot learning. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 11?20, 2019.
[108] Mark Palatucci, Dean Pomerleau, Geoffrey E Hinton, and Tom M Mitchell. Zero-
shot learning with semantic output codes. In Advances in neural information pro-
cessing systems, pages 1410?1418, 2009.
[109] Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. Zero-shot
learning through cross-modal transfer. In Advances in neural information processing
systems, pages 935?943, 2013.
[110] Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Attribute-based
classification for zero-shot visual object categorization. IEEE Transactions on Pat-
tern Analysis and Machine Intelligence, (3):453?465, 2014.
106
[111] Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens,
Andrea Frome, Greg S Corrado, and Jeffrey Dean. Zero-shot learning by convex
combination of semantic embeddings. arXiv preprint arXiv:1312.5650, 2013.
[112] Marcus Rohrbach, Michael Stark, and Bernt Schiele. Evaluating knowledge transfer
and zero-shot learning in a large-scale setting. In CVPR 2011, pages 1641?1648.
IEEE, 2011.
[113] Elyor Kodirov, Tao Xiang, Zhenyong Fu, and Shaogang Gong. Unsupervised do-
main adaptation for zero-shot learning. In Proceedings of the IEEE international
conference on computer vision, pages 2452?2460, 2015.
[114] Soravit Changpinyo, Wei-Lun Chao, Boqing Gong, and Fei Sha. Synthesized clas-
sifiers for zero-shot learning. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 5327?5336, 2016.
[115] Xun Xu, Timothy Hospedales, and Shaogang Gong. Semantic embedding space
for zero-shot action recognition. In 2015 IEEE International Conference on Image
Processing (ICIP), pages 63?67. IEEE, 2015.
[116] Xun Xu, Timothy M Hospedales, and Shaogang Gong. Multi-task zero-shot ac-
tion recognition with prioritised data augmentation. In European Conference on
Computer Vision, pages 343?359. Springer, 2016.
[117] Xun Xu, Timothy Hospedales, and Shaogang Gong. Transductive zero-shot action
recognition by word-vector embedding. International Journal of Computer Vision,
(3):309?333, 2017.
[118] Meera Hahn, Andrew Silva, and James M Rehg. Action2vec: A crossmodal embed-
ding approach to action learning. arXiv preprint arXiv:1901.00484, 2019.
[119] Mihir Jain, Jan C van Gemert, Thomas Mensink, and Cees GM Snoek. Ob-
jects2action: Classifying and localizing actions without any video example. In Pro-
ceedings of the IEEE international conference on computer vision, pages 4588?4596,
2015.
[120] Mihir Jain, Jan C. van Gemert, and Cees G. M. Snoek. What do 15, 000 object
categories tell us about classifying and localizing actions? In IEEE Conference on
Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June
7-12, 2015, pages 46?55, 2015.
[121] Pascal Mettes and Cees G. M. Snoek. Spatial-aware object embeddings for zero-
shot localization and classification of actions. In IEEE International Conference on
Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 4453?4462,
2017.
[122] Chuang Gan, Yi Yang, Linchao Zhu, Deli Zhao, and Yueting Zhuang. Recognizing
an action using its name: A knowledge-based approach. International Journal of
Computer Vision, (1):61?77, 2016.
107
[123] Ioannis Alexiou, Tao Xiang, and Shaogang Gong. Exploring synonyms as context
in zero-shot action recognition. In 2016 IEEE International Conference on Image
Processing (ICIP), pages 4190?4194. IEEE, 2016.
[124] Yi Zhu, Yang Long, Yu Guan, Shawn Newsam, and Ling Shao. Towards universal
representation for unseen action recognition. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 9436?9445, 2018.
[125] Devraj Mandal, Sanath Narayan, Sai Kumar Dwivedi, Vikram Gupta, Shuaib
Ahmed, Fahad Shahbaz Khan, and Ling Shao. Out-of-distribution detection for
generalized zero-shot action recognition. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 9985?9993, 2019.
[126] Mina Bishay, Georgios Zoumpourlis, and Ioannis Patras. Tarn: Temporal attentive
relation network for few-shot and zero-shot action recognition, 2019.
[127] Bernardino Romera-Paredes and Philip Torr. An embarrassingly simple approach to
zero-shot learning. In International Conference on Machine Learning, pages 2152?
2161, 2015.
[128] Li Zhang, Tao Xiang, and Shaogang Gong. Learning a deep embedding model for
zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 2021?2030, 2017.
[129] Mengye Ren, Eleni Triantafillou, Sachin Ravi, Jake Snell, Kevin Swersky, Joshua B.
Tenenbaum, Hugo Larochelle, and Richard S. Zemel. Meta-learning for semi-
supervised few-shot classification. In Proceedings of 6th International Conference
on Learning Representations ICLR, 2018.
[130] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot
learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-
wanathan, and R. Garnett, editors, Advances in Neural Information Processing Sys-
tems 30, pages 4077?4087. Curran Associates, Inc., 2017.
[131] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan
Wierstra. Matching networks for one shot learning, 2016.
[132] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip H.S. Torr, and Timothy M.
Hospedales. Learning to compare: Relation network for few-shot learning. 2018
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 2018.
[133] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning.
In International Conference on Learning Representations, 2017.
[134] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural
attentive meta-learner. In International Conference on Learning Representations,
2018.
[135] Bharath Hariharan and Ross Girshick. Low-shot visual recognition by shrinking
and hallucinating features. In Proceedings of the IEEE International Conference on
Computer Vision, pages 3018?3027, 2017.
108
[136] Linchao Zhu and Yi Yang. Compound memory networks for few-shot video clas-
sification. In The European Conference on Computer Vision (ECCV), September
2018.
[137] Hongtao Yang, Xuming He, and Fatih Porikli. One-shot action localization by
learning sequence matching network. In The IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), June 2018.
[138] Ashish Mishra, Vinay Kumar Verma, M Shiva Krishna Reddy, S Arulkumar, Piyush
Rai, and Anurag Mittal. A generative approach to zero-shot and few-shot action
recognition. 2018 IEEE Winter Conference on Applications of Computer Vision
(WACV), Mar 2018.
[139] Sai Kumar Dwivedi, Vikram Gupta, Rahul Mitra, Shuaib Ahmed, and Arjun Jain.
Protogan: Towards few shot learning for action recognition. In Proceedings of the
IEEE International Conference on Computer Vision Workshops, 2019.
[140] Keizo Kato, Yin Li, and Abhinav Gupta. Compositional learning for human object
interaction. In The European Conference on Computer Vision (ECCV), September
2018.
[141] Norbert Haala and Mathias Rothermel. Dense multi-stereo matching for high
quality digital elevation models. Photogrammetrie-Fernerkundung-Geoinformation,
2012(4):331?343, 2012.
[142] S. Galliani, K. Lasinger, and K. Schindler. Massively parallel multiview stereopsis
by surface normal diffusion. In 2015 IEEE International Conference on Computer
Vision (ICCV), pages 873?881, Dec 2015.
[143] Daniel Scharstein, Heiko Hirschmu?ller, York Kitajima, Greg Krathwohl, Nera Nes?ic?,
Xi Wang, and Porter Westling. High-resolution stereo datasets with subpixel-
accurate ground truth. In German conference on pattern recognition, pages 31?42.
Springer, 2014.
[144] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets
robotics: The kitti dataset. The International Journal of Robotics Research,
32(11):1231?1237, 2013.
[145] Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad
Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark
with high-resolution images and multi-camera videos. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 3260?3269, 2017.
[146] Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and
temples: Benchmarking large-scale scene reconstruction. ACM Trans. Graph.,
36(4):78:1?78:13, July 2017.
[147] Henrik Aan?s, Rasmus Ramsb?l Jensen, George Vogiatzis, Engin Tola, and An-
ders Bjorholm Dahl. Large-scale data for multiple-view stereopsis. International
Journal of Computer Vision, pages 1?16, 2016.
109
[148] Stefan K Gehrig, Felix Eberli, and Thomas Meyer. A real-time low-power stereo
vision engine using semi-global matching. In International Conference on Computer
Vision Systems, pages 134?143. Springer, 2009.
[149] Christian Banz, Holger Blume, and Peter Pirsch. Real-time semi-global matching
disparity estimation on the gpu. In 2011 IEEE International Conference on Com-
puter Vision Workshops (ICCV Workshops), pages 514?521. IEEE, 2011.
[150] Sudipta N Sinha, Daniel Scharstein, and Richard Szeliski. Efficient high-resolution
stereo matching using local plane sweeps. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 1582?1589, 2014.
[151] Daniel Scharstein, Tatsunori Taniai, and Sudipta N. Sinha. Semi-global stereo
matching with surface orientation priors. 2017 International Conference on 3D
Vision (3DV), pages 215?224, 2017.
[152] Akihito Seki and Marc Pollefeys. Sgm-nets: Semi-global matching with neural net-
works. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 231?240, 2017.
[153] Michael Bleyer, Christoph Rhemann, and Carsten Rother. Patchmatch stereo-stereo
matching with slanted support windows. In Bmvc, volume 11, pages 1?11, 2011.
[154] Jiangbo Lu, Hongsheng Yang, Dongbo Min, and Minh N Do. Patch match filter:
Efficient edge-aware filtering meets randomized search for fast correspondence field
estimation. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 1854?1861, 2013.
[155] Philipp Heise, Sebastian Klose, Brian Jensen, and Alois Knoll. Pm-huber: Patch-
match with huber regularization for stereo matching. In Proceedings of the IEEE
International Conference on Computer Vision, pages 2360?2367, 2013.
[156] Wenjie Luo, Alexander G Schwing, and Raquel Urtasun. Efficient deep learning for
stereo matching. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 5695?5703, 2016.
[157] Zhuoyuan Chen, Xun Sun, Liang Wang, Yinan Yu, and Chang Huang. A deep visual
correspondence embedding model for stereo matching costs. In Proceedings of the
IEEE International Conference on Computer Vision, pages 972?980, 2015.
[158] Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo matching network. In Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 5410?5418, 2018.
[159] Jonathan T Barron and Ben Poole. The fast bilateral solver. In European Conference
on Computer Vision, pages 617?632. Springer, 2016.
[160] Patrick Kno?belreiter and Thomas Pock. Learned collaborative stereo refinement. In
German Conference on Pattern Recognition, pages 3?17. Springer, 2019.
110
[161] Oleg Voynov, Alexey Artemov, Vage Egiazarian, Alexander Notchenko, Gleb Bo-
brovskikh, Evgeny Burnaev, and Denis Zorin. Perceptual deep depth super-
resolution. In Proceedings of the IEEE International Conference on Computer Vi-
sion, pages 5653?5663, 2019.
[162] Jiahao Pang, Wenxiu Sun, Jimmy SJ Ren, Chengxi Yang, and Qiong Yan. Cascade
residual learning: A two-stage convolutional neural network for stereo matching.
In Proceedings of the IEEE International Conference on Computer Vision, pages
887?895, 2017.
[163] Tatsunori Taniai and Takanori Maehara. Neural inverse rendering for general re-
flectance photometric stereo. In ICML, 2018.
[164] Zezhou Cheng, Matheus Gadelha, Subhransu Maji, and Daniel Sheldon. A bayesian
perspective on the deep image prior. In CVPR, pages 5443?5451, 2019.
[165] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards
real-time object detection with region proposal networks. In Advances in neural
information processing systems, pages 91?99, 2015.
[166] Le?onard Blier and Yann Ollivier. The description length of deep learning models.
In NIPS, 2018.
[167] Hema Swetha Koppula, Rudhir Gupta, and Ashutosh Saxena. Learning human
activities and object affordances from rgb-d videos. The International Journal of
Robotics Research, 32(8):951?970, 2013.
[168] http://openni.org.
[169] Hema S Koppula and Ashutosh Saxena. Anticipating human activities using object
affordances for reactive robotic response. IEEE transactions on pattern analysis and
machine intelligence, 38(1):14?29, 2016.
[170] Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, and Karteek
Alahari. Actor and observer: Joint modeling of first and third-person videos. In
CVPR, 2018.
[171] Gunnar A. Sigurdsson, Gu?l Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and
Abhinav Gupta. Hollywood in homes: Crowdsourcing data collection for activity
understanding. In European Conference on Computer Vision, 2016.
[172] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for
large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[173] Mark Yatskar, Luke Zettlemoyer, and Ali Farhadi. Situation recognition: Visual se-
mantic role labeling for image understanding. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 5534?5542, 2016.
[174] Gunnar A Sigurdsson, Santosh Kumar Divvala, Ali Farhadi, and Abhinav Gupta.
Asynchronous temporal fields for action recognition. In CVPR, volume 5, page 7,
2017.
111
[175] Achal Dave, Olga Russakovsky, and Deva Ramanan. Predictivecorrective networks
for action detection. In Proceedings of the Computer Vision and Pattern Recognition,
2017.
[176] Huijuan Xu, Abir Das, and Kate Saenko. R-c3d: region convolutional 3d network
for temporal activity detection. In IEEE Int. Conf. on Computer Vision (ICCV),
pages 5794?5803, 2017.
[177] Edward Loper Bird, Steven and Ewan Klein. Natural language processing with
python, 2009.
[178] Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer.
Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceed-
ings of the 2003 Conference of the North American Chapter of the Association for
Computational Linguistics on Human Language Technology - Volume 1, NAACL
?03, pages 173?180, Stroudsburg, PA, USA, 2003. Association for Computational
Linguistics.
[179] Khurram Soomro, Amir Roshan Zamir, and M Shah. A dataset of 101 human
action classes from videos in the wild. Center for Research in Computer Vision,
2(11), 2012.
[180] Hildegard Kuehne, Hueihan Jhuang, Est??baliz Garrote, Tomaso Poggio, and Thomas
Serre. Hmdb: a large video database for human motion recognition. In 2011 Inter-
national Conference on Computer Vision, pages 2556?2563. IEEE, 2011.
[181] Gunnar A Sigurdsson, Gu?l Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and
Abhinav Gupta. Hollywood in homes: Crowdsourcing data collection for activity un-
derstanding. In European Conference on Computer Vision, pages 510?526. Springer,
2016.
[182] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba.
Learning deep features for discriminative localization. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 2921?2929, 2016.
[183] Qi Qian, Lei Shang, Baigui Sun, Juhua Hu, Hao Li, and Rong Jin. Softtriple loss:
Deep metric learning without triplet sampling. In IEEE International Conference
on Computer Vision, ICCV 2019, 2019.
[184] Galileo Namata, Ben London, Lise Getoor, and Bert Huang. Query-driven active
surveying for collective classification. In 10th International Workshop on Mining
and Learning with Graphs, volume 8, 2012.
[185] Zhilin Yang, William Cohen, and Ruslan Salakhudinov. Revisiting semi-supervised
learning with graph embeddings. In International conference on machine learning,
pages 40?48. PMLR, 2016.
[186] Jason Weston, Fre?de?ric Ratle, Hossein Mobahi, and Ronan Collobert. Deep learning
via semi-supervised embedding. In Neural networks: Tricks of the trade, pages 639?
655. Springer, 2012.
112
[187] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of
social representations. In Proceedings of the 20th ACM SIGKDD international con-
ference on Knowledge discovery and data mining, pages 701?710, 2014.
[188] Q Lu and L Getoor. Link-based classification.(2003). In Proceeding 2003 interna-
tional conference on machine learning, Washington DC, pages 496?503.
[189] Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda,
and Michael M Bronstein. Geometric deep learning on graphs and manifolds using
mixture model cnns. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 5115?5124, 2017.
[190] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional
networks. In European conference on computer vision, pages 818?833. Springer,
2014.
[191] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality
assessment: from error visibility to structural similarity. IEEE transactions on image
processing, 13(4):600?612, 2004.
[192] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor seg-
mentation and support inference from RGBD images. In Computer Vision - ECCV
2012 - 12th European Conference on Computer Vision, Florence, Italy, October 7-
13, 2012, Proceedings, Part V, volume 7576 of Lecture Notes in Computer Science,
pages 746?760. Springer, 2012.
[193] Daniel Scharstein and Chris Pal. Learning conditional random fields for stereo. In
2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1?8.
IEEE, 2007.
[194] Heiko Hirschmuller and Daniel Scharstein. Evaluation of cost functions for stereo
matching. In 2007 IEEE Conference on Computer Vision and Pattern Recognition,
pages 1?8. IEEE, 2007.
[195] Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and tem-
ples: Benchmarking large-scale scene reconstruction. ACM Transactions on Graph-
ics (ToG), 36(4):78, 2017.
[196] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional net-
works for biomedical image segmentation. In International Conference on Medical
image computing and computer-assisted intervention, pages 234?241. Springer, 2015.
[197] Silvano Galliani, Katrin Lasinger, and Konrad Schindler. Massively parallel multi-
view stereopsis by surface normal diffusion. In Proceedings of the IEEE International
Conference on Computer Vision, pages 873?881, 2015.
[198] Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. Sun rgb-d: A rgb-d scene
understanding benchmark suite. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 567?576, 2015.
113
[199] Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser,
and Matthias Nie?ner. Scannet: Richly-annotated 3d reconstructions of indoor
scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 5828?5839, 2017.
[200] Moritz Menze and Andreas Geiger. Object scene flow for autonomous vehicles. In
Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
[201] Andreas Geiger, Martin Roser, and Raquel Urtasun. Efficient large-scale stereo
matching. In Asian conference on computer vision, pages 25?38. Springer, 2010.
[202] Shimon Ullman. The interpretation of structure from motion. Proceedings of the
Royal Society of London. Series B. Biological Sciences, 203(1153):405?426, 1979.
[203] Daniel Herrera, Juho Kannala, Janne Heikkila?, et al. Depth map inpainting under
a second-order smoothness prior. In Scandinavian Conference on Image Analysis,
pages 555?566. Springer, 2013.
[204] Alexandru Telea. An image inpainting technique based on the fast marching method.
Journal of graphics tools, 9(1):23?34, 2004.
[205] Junyi Liu, Xiaojin Gong, and Jilin Liu. Guided inpainting and filtering for kinect
depth maps. In Proceedings of the 21st International Conference on Pattern Recog-
nition (ICPR2012), pages 2055?2058. IEEE, 2012.
[206] Amir Atapour-Abarghouei, Gregoire Payen de La Garanderie, and Toby P Breckon.
Back to butterworth-a fourier basis for 3d surface relief hole filling within rgb-d
imagery. In 2016 23rd International Conference on Pattern Recognition (ICPR),
pages 2813?2818. IEEE, 2016.
[207] Pablo Arias, Gabriele Facciolo, Vicent Caselles, and Guillermo Sapiro. A variational
framework for exemplar-based image inpainting. International journal of computer
vision, 93(3):319?347, 2011.
114