ABSTRACT
Title of Dissertation: EFFICIENT IMAGE AND VIDEO
REPRESENTATIONS FOR RETRIEVAL
Sravanthi Bondugula, Doctor of Philosophy, 2016
Dissertation directed by: Professor Larry S. Davis
Department of Computer Science
Image (Video) retrieval is an interesting problem of retrieving images (videos)
similar to the query. Images (Videos) are represented in an input (feature) space
and similar images (videos) are obtained by finding nearest neighbors in the input
representation space. Numerous input representations both in real valued and bi-
nary space have been proposed for conducting faster retrieval. In this thesis, we
present techniques that obtain improved input representations for retrieval in both
supervised and unsupervised settings for images and videos.
Supervised retrieval is a well known problem of retrieving same class images of
the query. We address the practical aspects of achieving faster retrieval with binary
codes as input representations for the supervised setting in the first part, where
binary codes are used as addresses into hash tables. In practice, using binary codes
as addresses does not guarantee fast retrieval, as similar images are not mapped to
the same binary code (address). We address this problem by presenting an efficient
supervised hashing (binary encoding) method that aims to explicitly map all the
images of the same class ideally to a unique binary code. We refer to the binary
codes of the images as ‘Semantic Binary Codes’ and the unique code for all same
class images as ‘Class Binary Code’. We also propose a new class based Hamming
metric that dramatically reduces the retrieval times for larger databases, where only
hamming distance is computed to the class binary codes. We also propose a Deep
semantic binary code model, by replacing the output layer of a popular convolutional
Neural Network (AlexNet) with the class binary codes and show that the hashing
functions learned in this way outperforms the state of the art, and at the same time
provide fast retrieval times.
In the second part, we also address the problem of supervised retrieval by tak-
ing into account the relationship between classes. For a given query image, we want
to retrieve images that preserve the relative order i.e. we want to retrieve all same
class images first and then, the related classes images before different class images.
We learn such relationship aware binary codes by minimizing the similarity between
inner product of the binary codes and the similarity between the classes. We calcu-
late the similarity between classes using output embedding vectors, which are vector
representations of classes. Our method deviates from the other supervised binary
encoding schemes as it is the first to use output embeddings for learning hashing
functions. We also introduce new performance metrics that take into account the
related class retrieval results and show significant gains over the state of the art.
High Dimensional descriptors like Fisher Vectors or Vector of Locally Aggre-
gated Descriptors have shown to improve the performance of many computer vision
applications including retrieval. In the third part, we will discuss an unsupervised
technique for compressing high dimensional vectors into high dimensional binary
codes, to reduce storage complexity. In this approach, we deviate from adopting
traditional hyperplane hashing functions and instead learn hyperspherical hashing
functions. The proposed method overcomes the computational challenges of directly
applying the spherical hashing algorithm that is intractable for compressing high di-
mensional vectors. A practical hierarchical model that utilizes divide and conquer
techniques using the Random Select and Adjust(RSA) procedure to compress such
high dimensional vectors is presented. We show that our proposed high dimensional
binary codes outperform the binary codes obtained using traditional hyperplane
methods for higher compression ratios.
In the last part of the thesis, we propose a retrieval based solution to the Zero
shot event classification problem - a setting where no training videos are available
for the event. To do this, we learn a generic set of concept detectors and represent
both videos and query events in the concept space. We then compute similarity
between the query event and the video in the concept space and videos similar to
the query event are classified as the videos belonging to the event. We show that we
significantly boost the performance using concept features from other modalities.
EFFICIENT IMAGE AND VIDEO REPRESENTATIONS FOR
RETRIEVAL
by
Sravanthi Bondugula
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2016
Advisory Committee:
Professor Larry S. Davis, Chair/Advisor
Professor Ramani Duraiswami
Professor David Mount
Professor V.S. Subrahmanian
Professor Louiqa Raschid, Dean’s Representative
c© Copyright by
Sravanthi Bondugula
2016
Acknowledgments
I owe my gratitude to all the people who have made this thesis possible and
because of whom my graduate experience has been a learning and life changing
experience.
First and foremost I’d like to thank my advisor, Professor Larry S. Davis for
training me to become a better person and also giving me independence to choose the
problems that I am comfortable with. He has always been patient to edit my drafts,
helped me with improving my presentation skills and more importantly, solved the
problems with me and pointed me in the right direction when I am stuck.
I owe my deepest thanks to my family - my mother, father, sister and husband
who have always stood by me and without whom, I would not have done this. I am
indebtful to them for all the support, strength and courage they have provided me
to complete this journey.
I would also like to thank my roommates/friends - Manisha Ganeshan, Shiv-
angini Pachauri and Soumya Rastogi, who have made the graduate life experience
and the career path comfortable and welcoming. I cannot give enough credit to my
close friends - Pooja Dasari, Aswani Dhulipalla and Priyanka Tangudu for bearing
with me, listening to the graduate success and failure stories and also have patiently
waited and forgiven me for all the meetings that I missed.
I also would like to thank all my co-authors - Varun Manjunatha, David Do-
ermann, Pradeep Natarajan, Shuang Wu, Florian Luisier and working with them
was truly a learning experience.
ii
I would like to acknowledge financial support from the NSF and MURI Projects,
for all the projects discussed herein.
Lastly, thank you all and thank you GOD.
iii
Contents
List of Figures vii
1 Introduction 1
1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Semantic Binary Codes 8
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Semantic Binary Codes . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Class Indicator Matrix Y . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Initialization of V . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.3 Class Hamming Distance(CHD) . . . . . . . . . . . . . . . . . 20
2.3.4 Deep Semantic Binary Codes . . . . . . . . . . . . . . . . . . 22
2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.1 Train/Test Partitions . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.2 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.3 Comparison with other methods . . . . . . . . . . . . . . . . . 26
2.4.4 Comparison Methods for SBC-D: . . . . . . . . . . . . . . . . 28
2.4.5 Experiments for SBC-D: . . . . . . . . . . . . . . . . . . . . . 28
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 SHOE: Sibling Hashing with Output Embeddings 33
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.2 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.3 Preliminary Experiments . . . . . . . . . . . . . . . . . . . . . 44
3.3.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.5 SHOE Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.6 Supervised Dimensionality Reduction . . . . . . . . . . . . . . 51
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.5 Fine-grained Category Classification . . . . . . . . . . . . . . . . . . 56
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
iv
4 Hierarchical Spherical Hashing for Compressing High Dimensional
Vectors 64
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2 Spherical Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2.1 Computation Challenges(d is large and k ∼ d) . . . . . . . . . 68
4.3 Hierarchical Spherical Hashing . . . . . . . . . . . . . . . . . . . . . . 69
4.3.1 Learning Sub-Hypersphere functions . . . . . . . . . . . . . . 70
4.3.2 Cartesian-product of pivots/centers . . . . . . . . . . . . . . . 72
4.3.3 Random-Select and Adjust(RSA) . . . . . . . . . . . . . . . . 73
4.3.4 Divide and Conquer . . . . . . . . . . . . . . . . . . . . . . . 75
4.3.5 Computation time . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.4.1 Datasets, Evaluation Protocol . . . . . . . . . . . . . . . . . . 79
4.4.2 Comparison methods . . . . . . . . . . . . . . . . . . . . . . . 80
4.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5 Zero-shot Event Detection using Multi-modal Fusion of Weakly
Supervised Concepts 87
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.3 Zero-shot Learning Framework . . . . . . . . . . . . . . . . . . . . . . 92
5.3.1 Basic Similarity Computation . . . . . . . . . . . . . . . . . . 92
5.3.2 Expansion-based Similarity Computation . . . . . . . . . . . . 93
5.4 Video Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.4.1 Weakly Supervised Concepts (WSC) . . . . . . . . . . . . . . 95
5.4.1.1 Data Collection and Concept Discovery . . . . . . . 95
5.4.1.2 Low-level feature extraction . . . . . . . . . . . . . . 96
5.4.1.3 Classifier Training . . . . . . . . . . . . . . . . . . . 98
5.4.1.4 Weakly Supervised Concept Feature . . . . . . . . . 98
5.4.2 Concept Training using Web Data . . . . . . . . . . . . . . . . 98
5.4.3 Concept Distance Features . . . . . . . . . . . . . . . . . . . . 99
5.4.4 Off-the-shelf Concept Detectors . . . . . . . . . . . . . . . . . 99
5.4.5 Automatic Speech Recognition (ASR) . . . . . . . . . . . . . . 100
5.4.6 Optical Character Recognition (OCR) . . . . . . . . . . . . . 101
5.5 Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.6.1 Comparison of Similarity Computation . . . . . . . . . . . . . 103
5.6.2 Comparison of Visual Features . . . . . . . . . . . . . . . . . . 104
5.6.3 Comparison of Audio Features . . . . . . . . . . . . . . . . . . 106
5.6.4 Comparison of Language Features . . . . . . . . . . . . . . . . 106
5.6.5 Comparison of Fusion Systems . . . . . . . . . . . . . . . . . . 107
5.6.6 TRECVID Performance . . . . . . . . . . . . . . . . . . . . . 108
5.7 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 108
v
Bibliography 110
vi
List of Figures
2.1 Illustration of the Semantic Binary Codes idea and retrieval using
Class Binary Codes. Here, instead of searching 9 image binary codes,
we search only 3 class binary codes. . . . . . . . . . . . . . . . . . . . 9
2.2 The top row and the bottom row shows the performance of various
metrics for varying numbers of bits on Cifar-100 and Caltech-256
datasets respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Neural Network architecture for the Deep Semantic Binary code model. 21
2.4 Plots showing Precision @ radius 0 , Precision@ radius 2, mAP@10K
for the ILSVRC2010 Train dataset. . . . . . . . . . . . . . . . . . . . 23
2.5 Images of Caltech-256 dataset that have the same semantic binary
code given by the corresponding 128 bit SBC. For each row, we men-
tion the most common category of the images. False Positives are
indicated by red boundaries. . . . . . . . . . . . . . . . . . . . . . . . 31
3.1 We prefer results II over I because they tend to retrieve images of
classes(jaguar and tiger) related to the class label of the query(leopard),
rather than unrelated classes(sharks and dolphins). . . . . . . . . . . 35
3.2 We perform k-means clustering on Word2Vec embeddings [1] of Ima-
geNet classes. The principle behind SHOE is that images belonging
to related classes (like leopard or tiger, which are nearby in the output
embedding space) are mapped to nearby binary codes (represented by
points on a binary hypercube). Images belonging to unrelated classes
(like leopards and aircraft) are mapped to distant binary codes. This
figure was created using [2] and is for illustrative purposes. . . . . . . 37
3.3 Retrieval on CUB dataset comparing our method SHOE(E) with the
state-of-the art hashing techniques. The above plots report precision,
sibling precision and weighted sibling precision for top 5 sibling classes
for bits c = {16, 32, 64, 128, 256}. . . . . . . . . . . . . . . . . . . . . 40
3.4 Retrieval on CUB dataset evaluating the performance of our method(SHOE)
for varying θ values and p = 1000. The left and right y-axis show the
standard metrics and sibling metrics respectively. . . . . . . . . . . . 60
vii
3.5 Retrieval on CUB-2011(first and second row) and SUN(third and
fourth row) dataset comparing our methods SHOE(E) and SHOE(L)
with the state-of-the art hashing techniques. The above plots report
mAP, Sibling and Weighted Sibling mAP for top 5 sibling classes.
For the CUB dataset, we used 2000 training samples and 1000 anchor
points, while for the SUN attribute dataset, we used 3585 training
samples and 1434 anchor points. . . . . . . . . . . . . . . . . . . . . . 61
3.6 Retrieval on ILSVRC2010 dataset comparing SHOE with state-of-the
art hashing techniques. We use 5K training samples and CNN+K+CCA
as features for all the binary encoding schemes. The above plots
report recall, Sibre, Sib
w
re @10K for top 5 sibling classes for bits
c = {32, 64, 128, 256}. . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.7 The first query is of an ovenbird. SHOE retrieves more ovenbirds than
KSH. The second query is of a Brewer black-bird. Neither SHOE nor
KSH retrieve Brewer black-birds. However, SHOE returns ravens,
which are sibling classes of Brewer black-birds, whereas KSH retrieves
pileated woodpeckers, which are unrelated to black-birds. Here, blue
borders represent sibling classes. . . . . . . . . . . . . . . . . . . . . 63
4.1 Performance of our method(SpH-RSA) and SpH-Concat on a subset
of ILSVRC2010 Train dataset with 25600 VLAD vectors for varying
partition size m= 3200 to 10. Plots (Left, middle) show recall for 10
and 50 ground truth neighbors while comparing our method with the
concatenated bit vectors from SpH-Concat(indicated by C). Notice
the poor performance of the concatenated bit vectors. Plot (Right)
evaluates the results of our method for varying sub vector sizes. . . . 69
4.2 Left: RSA technique that randomly selects hyperspheres from the
cartesian product of the subsets and adjusts the hyperspheres to sat-
isfy the hashing properties. Right:RSA method applied in a Divide
and conquer fashion to obtain the k full hyperspheres. . . . . . . . . . 70
4.3 Recall plots showing the performance of Ours(Red) and BPBC(Learned
and Random) methods on ILSVRC2010 Validation data set for ground
truth defined at 10 Nearest Neighbors. Our method performs better
for 8000, 16000 and 32000 bits for all the retrieved images in both
the distance settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4 Recall plots showing the performance of Ours(Red) and several state-
of-the art methods on Holidays+Flickr 1M data set for ground truth
defined at 10 NNs. Top Row shows the performance of the methods
using Assymetric Distance and the Bottom Row using Symmetric
Distance. We see that our method consistently performs better for
1600 and 3200 bits at all the retrieved images in both the distance
settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
viii
4.5 Recall plots showing the performance of Ours(Red), BPBC and PQ
based methods on ILSVRC2010 Train data set for ground truth de-
fined at 10 NNs. Top Row shows the performance of the methods
using Assymetric Distance and the Bottom Row using Symmetric
Distance. We see that our method consistently performs better for
3200 and 6400 bits at all the retrieved images in both the distance
settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.1 Overview of the proposed multi-modal zero-shot learning approach. . 92
ix
Chapter 1: Introduction
Image (Video) retrieval is an interesting problem of retrieving images (videos)
similar to the query. Images (Videos) are represented in an input (feature) space
and similar images (videos) are obtained by finding nearest neighbors in the input
representation space. This has been a challenging task over decades due to the
proliferation of available visual data on the web. To alleviate this problem, effi-
cient input representations have been proposed for visual data in both real valued
and binary space to facilitate faster retrieval and provide efficient storage. But,
input representations learned for one retrieval setting may not be suitable for oth-
ers. Therefore, these representations need to be tailored for different objectives of
retrieval. For example, binary codes that outperform for an unsupervised setting
need not show similar trend in a supervised setting. In an unsupervised setting, bi-
nary codes preserve similarity in the Euclidean space of the original features, while
in a supervised setting, binary codes preserve similarity in the semantic space (given
by the class labels) of images. The performance gap is due to the fact that Euclidean
neighbors simply need not be semantic neighbors. Similarly, encoding schemes suit-
able for compressing small dimensional features cannot be readily extended to high
dimensional features due to computational challenges. In this thesis, we address
1
the problem of learning efficient input representations for some of the less explored
retrieval settings and its practical aspects and show that we remarkably improve the
performance over the existing state-of-the art retrieval methods.
Specifically, we want to improve retrieval of images and videos by learning
efficient binary and concept representations respectively. In particular, we focus on
learning improved binary codes in the first three parts of the thesis, while we learn
concept based representations for videos in the last part. In the first part, we address
the practical aspects of achieving faster retrieval with binary codes for a supervised
setting, where the task is to retrieve images of the same class of the given query
image. We also propose a deep based model and show superior results over the state
of the art. We show that we achieve faster retrieval times with the proposed class
hamming distance metric. In the second part, we also propose a supervised learn-
ing method that takes into account the relationship between classes using output
embeddings-which are vector representation of classes. Our hashing scheme learns
such relationship aware binary codes by minimizing the difference between inner
product of the binary codes and similarity of the classes. We also introduce new
evaluation metrics that consider retrieved images of related classes and show signifi-
cant gains over the state of the art. In the third part, we learn binary codes for large
dimensional vectors like Fisher Vector and Vector of Locally Aggregated Descrip-
tors (VLAD) using hyperspherical hashing functions than adopting the traditional
hyperplane hashing functions. We show that the codes learned using hyperspheri-
cal hashing functions obtain compact codes and yield better performance than the
traditional hyperplane based hashing methods. The last part is different from the
2
previous works, but embodies the retrieval principles for classification. This work
specifically addresses the event classification of videos in zero-shot settings, where
no training examples associated with the event are known prior to the classification
task. We use a concept based video representation to solve this. Further, we show
that multimodal fusion of the scores using visual, audio and text concepts boost
remarkably the performance of our system. The outline is given below.
1.1 Outline
In Chapter 2, we address the practical aspects of achieving faster retrieval
with supervised hashing by mapping all images of the same class to a unique binary
code. In Chapter 3, we learn relationship aware binary codes that rank related
class images before unrelated class images when same class images of the query
are not retrieved. In Chapter 4, we present a hierarchical approach to compress
large dimensional vectors using hyperspherical hashing functions. In Chapter 5, we
present a complete zero-shot classification system for videos, solved in a retrieval
setting.
• Semantic Binary Codes: Fast image retrieval is required for many applica-
tions like Image Search and Shopping, especially for large datasets. Hashing
addresses this problem by learning compact binary codes for images and using
them as direct addresses into hash tables. In practice, using binary codes as
addresses does not guarantee fast retrieval, as similar images are not mapped
to the same binary code (address). We address this problem by presenting an
3
efficient supervised hashing method that aims to explicitly map all the images
of the same class ideally to a unique binary code. We refer to the binary
codes of the images as ‘Semantic Binary Codes’ and the unique code for all
same class images as ‘Class Binary Code’. We formulate this intuitive objec-
tive ‘directly’ by minimizing the squared error criterion between the semantic
binary codes and the corresponding class binary codes. We further propose
a Deep Semantic Binary Code model that utilizes the class binary codes and
show that we significantly outperform the state-of-the-art. We also propose a
class-based Hamming metric that dramatically reduces the retrieval times for
larger databases and also improves the performance of the method by large
margins on Cifar-100, Caltech-256 and ILSVRC2010 datasets.
• SHOE: Supervised Hashing with Output Embeddings: In this work,
we present a supervised binary encoding scheme for image retrieval that learns
projections by taking into account similarity between classes obtained from
output embeddings. Our motivation is that binary hash codes learned in
this way improve both the visual quality of retrieval results and existing su-
pervised hashing schemes. We employ a sequential greedy optimization that
learns relationship aware projections by minimizing the difference between in-
ner products of binary codes and output embedding vectors. We develop a
joint optimization framework to learn projections which improve the accu-
racy of supervised hashing over the current state of the art with respect to
standard and sibling evaluation metrics. We further boost performance by
4
applying the supervised dimensionality reduction technique on kernelized in-
put CNN features. Experiments are performed on three datasets: CUB-2011,
SUN-Attribute and ImageNet ILSVRC 2010. As a by-product of our method,
we show that using a simple k-nn pooling classifier with our discriminative
codes improves over the complex classification models on fine grained datasets
like CUB and offer an impressive compression ratio of 1024 on CNN features.
• Hierarchical Spherical Hashing for Compressing High Dimensional
Vectors: High dimensional vectors like Fisher Vectors and Vector of Locally
Aggregated Descriptors (VLAD) have been shown to be effective for many
computer vision applications like classification, detection including retrieval.
However, due to their storage complexity - it becomes intractable to use these
features for retrieval. In this work, we address the problem of compressing such
high dimensional vectors to high dimensional binary codes in an unsupervised
setting. We present a hierarchical approach to compress large dimensional
vectors using hyperspherical hashing functions. We provide a practical solu-
tion for learning hyperspherical hashing functions by partitioning the vectors
and learning hyperspheres in subspaces. Our method is an efficient way to
preserve the hashing properties of sub-space hashing functions to generate
the full-hashing functions in a divide and conquer fashion. We demonstrate
the performance of our approach on the ILSVRC2010 Validation dataset and
two large scale datasets: ILSVRC2010 Train and Holidays+Flickr 1M with
high dimensional representations of size 128000, 25600 and 12800 respectively.
5
Our results highlight the compact nature of hyperspherical hashing functions
which significantly outperform the state-of-the art methods at compression
ratios of 512, 256 and 128. Furthermore, we boost the retrieval performance
by introducing an assymetric distance for spherical hashing functions.
• Zero-shot Event Detection using Multi-modal Fusion of Weakly Su-
pervised Concepts: Current state-of-the-art systems for visual content anal-
ysis require large training sets for each class of interest, and performance de-
grades rapidly with fewer examples. In this work, we present a general frame-
work for the zero-shot learning problem of performing high-level event detec-
tion with no training exemplars, using only textual descriptions. This task
goes beyond the traditional zero-shot framework of adapting a given set of
classes with training data to unseen classes. We leverage video and image col-
lections with free-form text descriptions from widely available web sources to
learn a large bank of concepts, in addition to using several off-the-shelf concept
detectors, speech, and video text for representing videos. We utilize natural
language processing technologies to generate event description features. The
extracted features are then projected to a common high-dimensional space
using text expansion, and similarity is computed in this space. We present
extensive experimental results on the large TRECVID MED [3] corpus to
demonstrate our approach. Our results show that the proposed concept de-
tection methods significantly outperform current attribute classifiers such as
Classemes [4], ObjectBank [5], and SUN attributes [43]. Further, we find
6
that fusion, both within as well as between modalities, is crucial for optimal
performance.
7
Chapter 2: Semantic Binary Codes
2.1 Introduction
Approximate Nearest Neighbor(ANN) search for image retrieval using com-
pact binary codes has been extensively investigated for faster retrieval and efficient
storage. Encoding images with short bit vectors offers great compression. Retrieval
is faster predominantly because binary codes can be used as direct addresses into
hash tables and ANN’s in principal are identified with just a few lookups, compared
to an exhaustive linear scan [6]. In practice, using binary codes as addresses does
not guarantee fast ANN search, as similar images are not mapped to the same bi-
nary code(address). Instead, nearest neighbors must be discovered from examining
binary codes within some Hamming radius around the query code. So, the num-
ber of lookups required grows exponentially with radius. State-of-the art hashing
algorithms then resort to linear scan for large radius. Therefore, there is an explicit
need for a hashing algorithm that learns to map similar images to the same binary
code.
There exist supervised [7–10] and unsupervised [11–14] hashing algorithms for
learning similarity preserving binary codes. Unsupervised hashing algorithms try to
preserve Euclidean neighbors while supervised hashing algorithms try to preserve
8
Figure 2.1: Illustration of the Semantic Binary Codes idea and retrieval using Class
Binary Codes. Here, instead of searching 9 image binary codes, we search only 3
class binary codes.
semantic neighbors. The notion of similarity is well defined for semantic neighbors
compared to Euclidean neighbors as similarity values are discrete(1(same) or 0(dif-
ferent)) for the former and continuous for the latter. In this work, we develop a
new approach to supervised hashing, where we want to learn both image and
class binary codes such that all images from the same class are ideally
assigned a unique binary code. We refer to the image binary codes as ‘Semantic
Binary Codes’ (SBC) and the unique binary code for a class as the ‘Class Binary
9
Code’ (CBC). Figure 2.1 illustrates the idea of our method. By learning both these
codes, the performance of a hashing algorithm is naturally improved by retrieving
more relevant neighbors within a smaller Hamming ball around the query code.
Only a handful of approaches [7–9, 15, 16] have been developed to learn a
unique binary code for all images from the same class, but they do not formulate
this directly into the objective. They instead, learn same binary codes for same
class images and different binary codes for different class images. Further, their
optimization is complex due to the pair-wise formulation which forces them to resort
to a greedy iterative procedure of learning bit by bit which leads to a slow training
process. In contrast to these approaches, we propose a simple and efficient supervised
hashing algorithm with point-wise formulation of the objective. The first objective
eliminates noise by learning projections of input features such that the projected
vectors are highly correlated with class vectors. The second objective then minimizes
the quantization error between semantic and class binary codes. We jointly optimize
these two objectives to obtain the hash functions that meet our goal. Further, we
propose a Deep Semantic Binary Code model that utilizes the CBC’s.
The contributions of our work are as follows: 1. To the best of our knowl-
edge, our approach is the first ‘direct’ attempt to learn Semantic Binary Codes(a
unique binary code for all semantically similar images) and Class Binary Codes by
formulating an intuitive objective function. 2. We present a novel joint learning
framework for simultaneous dimensionality reduction and semantic binary codes.
Using a least squares formulation for both objectives, we obtain a simple objective
function which has an efficient closed-form solution. 3. We also propose a new
10
metric: Class Hamming Distance(CHD). Using CHD, we retrieve results for a query
by only calculating Hamming Distance to the class binary codes, rather than cal-
culating Hamming Distance to all the database codes, leading to a large decrease
in retrieval times. 4.We also demonstrate superior performance and achieve faster
training and retrieval times compared to the state of the art shallow supervised
hashing techniques for Caltech-256, Cifar-100 and ILSVRC2010 Train datasets. The
Deep Semantic Binary Code model proposed also outperforms the state of the art
Deep Supervised Hashing techniques.
The remainder of the work is organized as follows. We first discuss related
work in Section 2.2. We propose a novel joint framework for dimensionality re-
duction and learning semantic and class binary codes and its deep learning model
in Section 2.3. Retrieval experiments are carried out in Section 4.4. Finally, we
conclude in Section 2.5.
2.2 Related Work
Supervised hashing methods in [7–9,15,17–19] use pair-wise formulations that
lead to complex optimization procedure. Liu et al ( [7]) proposed the Supervised
Hashing with kernels(KSH) method that minimizes the difference between inner
products of the binary code and similarity for pairs of images. Similarity is 1 (-1) for
same (dissimilar) class pairs. They obtain a simple, clean objective function unlike
the complicated objective functions in MLH [15] and BRE [20]. FastHash [8] applies
a two-step procedure [17] to solve the KSH loss function: First, they learn the binary
11
codes using Graph cuts and then learn the parameters of the hash function with
Boosted Decision Trees. They accommodate training of large numbers of samples
with high dimensional features and show better performance than KSH [7]. Recently,
Ge et al [9] introduced the Graph Cuts Coding(GCC) method which also adopts
a two-step approach [17], but iteratively. At each iteration, they first learn the
binary codes using GraphCuts and then learn the hash functions using linear SVMs.
Rastegari et al [18] propose learning predictable and discriminative hash functions
using a linear or kernel SVM formulation using bits as labels. For each hash function,
they minimize(maximize) the distance between the codes of same(dissimilar) class
images. All these hashing methods obtain similar codes for same class images by
minimizing(maximizing) intra(inter) class distance between image codes but not
between class and image codes.
Recently, Wang et al [19] construct compact binary codes for both images
and tags such that the observed tags are consistent with the constructed binary
codes. They minimize three critiera: difference between similar tag and image codes,
difference between similar image codes and difference between similar tag codes.
This is highly suitable for datasets with weakly labeled information which require
fast image tagging. Note that the binary codes for tags are learned but not the class
codes. Very recently, Supervised Discrete Hashing(SDH) [10] learns binary codes
that are optimal for classification. They employ a joint optimization framework
which jointly learns a binary embedding and a linear classifier. SDH also obtains
a simple objective function that yields a closed form solution by incorporating a
Discrete-Coordinate Descent method to learn one bit at a time. While, our end goal
12
is not specifically to learn better classification codes, we intend to learn semantic
binary codes and binary representations for classes.
We also review several deep supervised hashing methods [16,21–25] that have
been proposed and showed superior performance over the shallow methods. Se-
mantic Hashing(SH) [6], one of the earliest works, uses Restricted Boltzmann Ma-
chines(RBMs) to train Binary Auto Encoders. However, SH did not do well due
to limited computational power available then. The following Deep Supervised
Hashing(DSH) Models: Sparse Similarity-Preserving Hashing (SSPH) [21], Deep
Hashing (DH) [22], Convolutional Neural Network Hashing (CNNH) [16], Deep Se-
mantic Ranking Hashing (DSRH) [23], Deep Neural Network Hashing (DNNH) [24]
and Deep Learning of Binary Hash Codes for Fast Image Retrieval (DLH) [25] have
been proposed. The output of these neural networks yields a binary code by ap-
plying either a hyperbolic tangent or a sigmoid function to the output followed by
a thresholding operation, leading to the final binary hash code. While, some net-
works like SSPH, DH assume a hand crafted visual representation to learn binary
codes that minimizes the reconstruction error. These methods are limited by the
performance of the assumed hand crafted visual representation. To overcome this,
CNNH, DSRH, DNNH and DLH networks do not assume an specific feature repre-
sentation and instead, integrate both the feature learning and hash value learning in
the optimization. The later methods have subsequently shown better performance.
For a comprehensive related work on DSH schemes, we refer the readers to the short
survey by Liu1. We also present a deep extension of the SBC model, Deep Semantic
1http://www.ee.columbia.edu/w˜liu/WeiLiu DLHash.pdf
13
Binary Code model, where we learn a deep neural network that utilizes the class
binary codes.
2.3 Semantic Binary Codes
Input features/labels: We are given a data set of n samples {(xi, yi)}ni=1,
where xi ∈ Rd is the input feature and yi ∈ {1, 2, ..k} denotes the class label
of the ith sample. Let Y ∈ Rk×n denote the class indicator matrix with one-of-
k embeddings i.e Yij = 1 if the class of the j
th sample is i and −1 otherwise.
We apply the radial basis kernel features with m random sample points to obtain
φ(x) : x→ x¯ ∈ Rm as done commonly in [7, 10, 26].
Objective: We aim to jointly minimize both the dimensionality reduction
error and quantization error between the image binary codes and the corresponding
class binary codes.
Input Projections/Hash Projections: We want to learn semantic binary
codes bi ∈ {−1, 1}l and the corresponding class binary codes cyi ∈ {−1, 1}l, both
of length l for the ith sample, such that a unique binary code is learned for all
same class images. We utilize the hash functions {hj(φ(x)) = sgn(uTj φ(x)}lj=1 to
obtain l bits, where, sgn(.) is the sign function, which outputs +1 for positive
number and -1 otherwise. uj ∈ Rm is a column vector that gives the bit bij for
the ith sample. Let U = [u1,u2, ...,ul] ∈ Rr×l which can be decomposed into
two projection matrices V and W . Then, the corresponding hash functions are:
{hj(φ(x)) = sgn(wTj V Tφ(x)}lj=1, where V ∈ Rm×r are the input projections and
14
wj ∈ Rr are the hash projections on the reduced features V Tφ(x). We solve the
following joint formulation of the objective:
min
W,V,c
n∑
i=1
∥∥V Tφ(xi)− Yi∥∥22 + n∑
i=1
‖bi − cyi‖22
bi = sgn(W
TV Tφ(xi)) (2.1)
where, ‖.‖22 is the L2-norm.
The first term of the objective eliminates noise by learning projections of
input features such that the projected vectors are highly correlated with the class
vectors. This is a least squares error formulation for dimensionality reduction and
has been shown to be equivalent to CCA [27], with an appropriate form of class
indicator matrix ((Y Y T )−
1
2Y ). Similarly, the second term of the objective decreases
the quantization error between image and class binary codes. Relaxing the sign
function in the objective, we get:
min
W,V,C
∥∥V Tφ(X)− Y ∥∥2
F
+
∥∥W TV Tφ(X)− C∥∥2
F
B = sgn(W TV Tφ(X)) (2.2)
where, ‖.‖2F is the Frobenius norm and C ∈ {−1, 1}l×n with Ci = cyi denote the
corresponding class codes for all the samples. Solving this is an NP-Hard problem
as there are many variables to learn. We therefore employ an iterative procedure,
by solving for one variable at a time.
Solving for V : Given the binary codes of classes C and the hash projections
W , we can solve for the input projections V . Let S = V Tφ(X), we first learn S and
then learn V . Expanding (2.2), we obtain:
15
tr(ST (I +WW T )S)− 2tr(ST (WC + Y )) + nl + const
where tr(.) is the Trace function. Taking derivative of the above equation w.r.t.to
S and equating to zero yields:
S = (WW T + I)−1(WC + Y ) (2.3)
Given the reduced features S, we then obtain the input projections V as:
V = (φ(X)φ(X)T )−1φ(X)ST (2.4)
Now, we can learn binary codes of images B and classes C independently from the
dimensionality reduction step, by solving:
min
W,C
∥∥W TS − C∥∥2
F
s.t. B = sgn(W TS) (2.5)
Solving for W : Given binary codes of classes C, we obtain the hash projec-
tions W on the reduced features S by taking the derivative of (2.5) and setting it
to zero. We obtain the following closed form solution for W :
W = (SST )−1SCT (2.6)
Utilizing the hash projections learned, we now obtain the binary codes for the images
B by simply taking the sign of the projected vector W TS, given as B = sgn(W TS).
Solving for C: Given binary codes of the images B, we can learn the class
binary codes without the relaxed formulation. We rewrite the objective in (2.5) with
discrete constraints on binary codes, for each sample as:
min
c
n∑
i=1
‖bi − cyi‖22 , s.t. cyi ∈ {−1, 1}l (2.7)
16
Set/Tr(t) SBC(C) SBC SDH ITQ FastH KSH-
Cal256 31 31 408 10 2413 2e+5
Cifar100 44 44 378 8 739- 5800
Ilsvrc10 116 116 283 81 8580- 3e+4
Set/R(t) SBC(C) SBC SDH ITQ FastH KSH
Cal256 0.25 1.4 1.4 1.4 2.6 1.4
Cifar100 0.13 2.9 2.9 2.7 4.8 2.9
Ilsvrc10 0.94 351 351 350 354 351
Table 2.1: We compare the training(Tr(t) in sec) and retrieval(R(t) in milli sec)
times on Caltech256(128 bits), Cifar-100(96 bits) and ILSVRC2010(256 bits). (C)
denotes that, we use CHD.
It is easy to see that the objective is minimized when the class code is obtained by
simply taking the sign of the sum of the bit vector of the samples. To see this, let Yκ
contain the indices of the samples belonging to class κ and nκ = |Yκ|, the number
of samples per class. We can now rewrite the objective function independently for
each class:
min
c
∑
j∈Y1
‖bj − c1‖22 +
∑
j∈Y2
‖bj − c2‖22 + ...+
∑
j∈Yk
‖bj − ck‖22 (2.8)
For a single class, Expanding the objective, we get:
min
cκ
∑
j∈Yκ
‖bj − cκ‖ = (
∑
j∈Yκ
‖bj‖) + nκ ‖cκ‖ − 2(
∑
j∈Yκ
bTj )cκ
= nκl + nκl − 2B˜κTcκ (2.9)
17
Method #Train #anchor mAP preH@0 pre@1
SBC(C) 2560 512 40.2 17.5 54.49
SBC(C) 6400 1280 54.7 47.8 64.02
SBC(C) 6400 2560 62.7 60.11 66.03
SBC(C) 12800 2560 63.64 59.73 66.56
SBC 12800 2560 59.63 10.59 65.33
SDH 12800 2560 55.4 18.3 65.13
ITQ+CCA 12800 - 49.44 4.7 59.7
FastHash 12800 - 55.85 0.1 59.8
KSH 6400 1280 46.29 11.6 59.8
BRE- 6400 - 15.98 0.1 40.4
Table 2.2: Comparing various metrics and methods for 128 bits on Caltech-256,
except for BRE, where we report performance with 64 bits as the method did not
converge in 8 hours for 128 bits.
where B˜κ =
∑
j∈Yκ bj . As cκ is a binary code ∈ {−1, 1}l, it is easy to infer that (2.9)
is minimized when the product B˜Tκ cκ is maximized. This happens only when the
class code cκ takes the same sign as the sum of the bit vector B˜κ. In other words,
each bit of the class code is given by the maximum voted bit of all the samples of the
class or equivalently, the centroid of the binary codes of the class samples. Given
binary code vectors, we have:
cκ = sgn(
∑
j∈Yκ
bj), ∀κ (2.10)
18
8 16 32 64 960
0.05
0.1
0.15
0.2
0.25
Number of bits
Pr
ec
is
io
n 
@
 ra
di
us
 2
 
 
BRE
ITQ+CCA
KSH
FastHash
SDH
SBC
SBC(C)
8 16 32 64 960
0.05
0.1
0.15
0.2
Number of bits
m
AP
 
 
BRE
ITQ+CCA
KSH
FastHash
SDH
SBC
SBC(C)
8 16 32 64 960
0.1
0.2
0.3
0.4
Number of bits
Pr
ec
is
io
n 
@
 1
0
 
 
BRE
ITQ+CCA
KSH
FastHash
SDH
SBC
SBC(C)
16 32 64 128 1920
0.2
0.4
0.6
0.8
Number of bits
Pr
ec
is
io
n 
@
 ra
di
us
 2
 
 
BRE
ITQ+CCA
KSH
FastHash
SDH
SBC
SBC(C)
16 32 64 128 1920
0.2
0.4
0.6
0.8
Number of bits
m
AP
 
 
BRE
ITQ+CCA
KSH
FastHash
SDH
SBC
SBC(C)
16 32 64 128 1920
0.2
0.4
0.6
0.8
Number of bits
Pr
ec
is
io
n 
@
 1
0
 
 
BRE
ITQ+CCA
KSH
FastHash
SDH
SBC
SBC(C)
Figure 2.2: The top row and the bottom row shows the performance of various met-
rics for varying numbers of bits on Cifar-100 and Caltech-256 datasets respectively.
We have now learned all the variables, one at a time. We do this iteratively
until the objective converges or we reach a maximum number of iterations. We
empirically observed that the whole algorithm converges very fast and typically takes
fewer than 3 iterations to converge. In either case, we set the maximum number of
iterations to be 5. We use initialization from CCA to learn the initial W and then
obtain an initial B & C, using the initial solution of V as (φ(X)φ(X)T )−1φ(X)Y T .
The detailed SBC algorithm is given in Algorithm 1.
2.3.1 Class Indicator Matrix Y
Given a mean centered Y matrix, Let H = (Y Y T )−
1
2Y . [27] shows that when
Y = H, solving equation (2.11) is equivalent to solving the CCA problem or an
Orthonomal Partial Least Squares problem under the following criteria: 1) The
19
minor condition that rank(X) = n-1 or rank(Y) = k is satisfied, which generally
holds true for high dimensional data and 2) The final goal is to do nearest neighbor
search or train SVMs on the projected features. Since, these criteria hold in our
case, we use the class indicator matrix Y = H and refer to the DR-step as LS-CCA.
2.3.2 Initialization of V
We obtain the initial solution for V by solving the LS-CCA using (2.11).
min
V
∥∥V Tφ(X)− Y ∥∥2
F
(2.11)
where ‖‖2F is the Frobenius norm and Y = H, given in 2.3.1. Solutions of equa-
tion (2.11) are given by ridge regression as:
V = (φ(X)φ(X)T )−1φ(X)Y T (2.12)
2.3.3 Class Hamming Distance(CHD)
Most hashing methods that employ the hyperplane hashing function form:
h(x) = sgn(UTx) use Hamming Distance(HD) between the query code (bq) and
database codes (bd) to compute nearest neighbors in the database for a query(q).
Hamming distance for binary codes bq, bd ∈ {0, 1}l is given as, HD(bq, bd) = |bq⊕bd|,
where⊕ is the XOR operation and |.| denotes the number of +1 bits in a given binary
code. Since, we learn class binary codes, we propose a suitable metric: class-based
Hamming Distance(CHD).
Given binary codes for each class, CHD retrieves nearest neighbors for a query
q in two steps: First, we compute the nearest class binary code c i.e. c = cκ, κ =
20
Figure 2.3: Neural Network architecture for the Deep Semantic Binary code model.
arg minj HD(cj, bq). Second, we retrieve the nearest neighbors in the database to
the class binary code c (instead of the test binary code bq). Note that, we do not
use any class information when we rank the nearest neighbors in the database to
the class binary code.The advantages of doing this are two fold: 1) For a correctly
assigned class binary code, there is a high probability of always obtaining the same
class images as the top retrievals, but the same cannot be said for an incorrect
assignment. We expect to reduce the overall error using this approximation. 2)
Retrieval is much faster as we only need to compute Hamming distance to the class
codes. For each class code, nearest neighbors in the database can be computed
offline once and the ranking is stored. Nearest neighbors can now be returned with
a single lookup. The total retrieval time is only O(k) instead of O(n), k << n
(assuming, linear search).
21
2.3.4 Deep Semantic Binary Codes
Recently, Deep Supervised Hashing(DSH) methods [16,24,25] have been shown
to improve hashing performance by large margins over shallow methods( [7,28]). The
model of SBC is a single layer neural network and does not take advantage of the deep
training procedure. We present a deep neural network method for SBC that also
aims to learn a unique binary code for all same class images. We do so by training a
deep network model using the Class Binary Codes learned to form the output layer
as shown in Figure 2.3. We retrain the AlexNet [29] model by replacing the 1000-
dimensional output layer(i.e. fc8) with the Class Binary Codes(Equation (2.10))
of the images and use Sigmoid Cross Entropy Loss as the loss function for the
output layer. For a new image, hash codes are obtained by thresholding the sigmoid
activations of the output (thr = 0.5). We refer to this model as a Deep Semantic
Binary Code model and the binary codes learned from the network as Deep Semantic
Binary Codes(SBC-D). We expect that the SBC-D model learns to preserve the
semantic similarity between the same class images by learning non-linear mappings
between the image pixels and the class binary codes.
2.4 Experiments
We evaluate our proposed supervised hashing method(SBC) and deep model
on three datasets containing large numbers of categories: Caltech-256 [30], Cifar-
100 [31] and ILSVRC2010 Train [32]. We compare our approach to the state-of-the
art supervised hashing techniques: SDH [10], FastHash [8], KSH [7], ITQ with
22
32 64 128 2560
0.05
0.1
0.15
0.2
0.25
Number of bits
Pr
ec
is
io
n 
@
 ra
di
us
 0
 
 
ITQ+CCA
KSH
FastHash
SDH
SBC
SBC(C)
32 64 128 2560
0.1
0.2
0.3
0.4
Number of bits
Pr
ec
is
io
n 
@
 ra
di
us
 2
 
 
ITQ+CCA
KSH
FastHash
SDH
SBC
SBC(C)
32 64 128 2560
0.05
0.1
0.15
0.2
0.25
Number of bits
m
AP
 
 
ITQ+CCA
KSH
FastHash
SDH
SBC
SBC(C)
Figure 2.4: Plots showing Precision @ radius 0 , Precision@ radius 2, mAP@10K
for the ILSVRC2010 Train dataset.
CCA(ITQ+CCA) [28] and BRE [20]. We use their publicly available implementa-
tions and the suggested parameters to obtain their best performance. Caltech-256
dataset consists of 30607 images from 256 categories and 1 background category. We
compute 4096 dimensional Convolutional Neural Network (CNN) features from the
fc7 layer of AlexNet [29] for each image using the Caffe library developed by [33].
Cifar-100 dataset contains a subset of 60K images from 100 categories of the Tiny
80M dataset [34]. These images are only 32x32 images and so we only compute
512 dimensional GIST features. For the ILSVRC2010 Train dataset containing 1.2
Million Images of 1000 categories, we also compute fc7 CNN features. We kernelize
the features which take the form {κ(x, a1), κ(x, a2), ...κ(x, am)} where κ is a radial
basis kernel, and m is the cardinality of a subset of training sample, designated as
‘anchor points’ as done in [7, 10,26]. All the features are unit normalized.
23
2.4.1 Train/Test Partitions
Caltech-256 dataset contains around 30607 images from 256 categories and 1
background category. In this dataset, the number of images in each category varies
a lot, with a minimum of 80 images in a category and a maximum of around 800
images. To equally weigh the performance of each class, we follow the general test-
ing protocol for Caltech-256 i.e., we choose an equal number of test images(N=25)
per category as queries and 50 images per category to form the retrieval set. We
choose three subsets of the retrieval set for training, correspondingly {10, 25, 50}
images per category to form train sets of 2560, 6400 and 12800 images. We com-
pute the Radial Basis Function(RBF) features φ(x) with varying numbers of anchor
points: m = {512, 1280, 2560}. Cifar-100 constitutes images of 100 classes, with
each class containing 600 samples. The entire dataset is split into a test set with
1000 samples(10 per class) and a retrieval set with all remaining samples. We report
performance for methods trained with subsets of size 5000 and the whole retrieval
set and construct kernelized features with anchor points of size 1000 and 3000. For
the ILSVRC2010 Train dataset containing 1.2M images of 1000 categories, we ran-
domly select 1000 queries and use the rest as the retrieval dataset. For training, we
chose 50K samples(50 per class) from the retrieval set and again compute RBF fea-
tures with 3000 anchor points. For all the datasets, we construct the ground truth
from class label information, required for supervised training. During retrieval, we
do not use any class label information.
24
2.4.2 Evaluation metrics
We report the standard retrieval performance metrics: Precision@K(pre@K),
Mean Average Precision(mAP) and Precision@Hamming radius (preH@r). pre@K
gives the precision at top-K nearest neighbors retrieved. Here, we report for K = 10
for both Caltech-256 and Cifar-100 datasets. mAP calculates the area under the
precision-recall curve, thereby evaluating the performance of the method for all the
ranked images. For Caltech-256 and Cifar-100, we report mAP and for ILSVRC2010
Train, we report mAP@10K. preH@r is the percentage of the number of images
within a Hamming radius r of the query code that belong to the same class as the
query code. For r = 0, these are exact collisions i.e. images are mapped to the same
code as the query code. We penalize the query when there are no retrievals within
a Hamming radius r. We report Precision@Hamming radius 2 for Caltech-256 and
Cifar-100 datasets. For ILSVRC2010 Train, we report both Precision@Hamming
radius 0 and 2.
We show the performance of our method with both Hamming distance and
Class-based Hamming distance (CHD). We append the notation (C) when CHD
based performance is reported. For Caltech-256, Cifar-100 and ImageNet datasets,
we learn {16, 32, 64, 128, 192}, {8, 16, 32, 64, 96} and {32, 64, 128, 256} bits respec-
tively.
25
2.4.3 Comparison with other methods
BRE, KSH and FastHash incur high training times for large training samples.
In our experiments, for efficiency reasons, we show performance of these methods for
only 6400 for Caltech-256 dataset and 5000 training samples for Cifar-100 dataset .
SDH [10] can be solved with both discrete or relaxed constraints on the binary codes
in learning the classifier. They show that discrete binary constraints are critical to
obtain maximum performance. So, we only show the results of SDH with discrete
constraints. Both, SDH and ITQ+CCA are scalable and efficient. They obtain the
maximum performance with larger numbers of training samples and therefore, we
show their performance when trained on the larger training set.
Table 2.2 compares the performance of our methods learned with various num-
bers of training samples for 128 bits on the Caltech-256 dataset. We vary the num-
ber of anchor points up to 2560 and training samples up to 12800 and show that
performance improves with more anchor points and larger training sets. For all the
reported metrics, we significantly outperform the state-of-the art methods by a large
gap and we do this with very small training and retrieval times as indicated in Table
2.1.
The top row in Figure 2.2 shows performance on the Cifar-100 dataset. Cifar-
100 has a smaller number of classes(100) than Caltech-256 but many samples per
class. The results show that SBC(C) generalizes better than the second best method
SDH on all metrics. Here, we see that as we learn a larger number of bits, SDH
performs better than FastHash, KSH, ITQ+CCA and BRE but worse than both
26
SBC and SBC(C). We also observe that the performance of both SBC (without
the Hamming distance) and SDH converges for precision@radius 2 for 96 bits.The
bottom row of Figure 2.2 compares the performance on Caltech-256 dataset for
various numbers of bits. We observe that SBC(C) shows superior performance
consistently over all metrics: preH@2, mAP and pre@10. We report performance of
BRE only up to 64 bits as it did not converge in 8 hours for more bits.
We also compare our method on the large dataset (ILSVRC 2010 Train) to
demonstrate the scalability and efficiency of our method in Figure 2.4. Both, SBC
and SBC(C) perform significantly better than the other methods at varying number
of bits on all the performance measures. Particularly, there is a larger gap with
the second best method SDH for accurately finding nearest neighbors with exact
collisions, even for longer bits. For KSH, we chose 10, 000 training samples for
1000 anchor points to keep the training time under 8 hours. We do not show the
results of BRE due to its high training times. For FastHash, we observe that the
performance is very low with 5K training samples. So, we use 50K samples at the
expense of training time. We still see that it does not perform well on preH@0 and
2, as also shown in [10]. Table 2.1 reports the training and retrieval times of our
method and the competing methods for Caltech-256, Cifar-100 and ILSVRC2010
Train datasets. For all the datasets, SBC is efficient and scalable in training and so
is ITQ. The retrieval times of SBC with CHD is significantly faster than HD, as we
only compute Hamming distance to the class binary codes. Figure 2.5 shows the
qualitative results.
27
Mthd 64 bits 96bits
mAP pre@1 preH@0 mAP pre@1 preH@0
SBC-D 57.9 57.4 52.4 49.7 52.5 45.4
DLH [25] 14.2 26.6 5.8 26.9 42.4 2.7
Table 2.3: Performance comparison of DSH methods for various metrics and various
numbers of bits for Cifar-100 dataset.
2.4.4 Comparison Methods for SBC-D:
It has been shown that DLH [25] outperforms the model of CNNH and its
variants proposed by Xia et al [16] and other Deep Supervised Hashing methods
that preserve semantic similarities ( [21], DH [22]). As training deep neural networks
is computationally expensive, we here compare SBC-D with only the best baseline:
DLH [25]. We here show the results for Cifar-100 with 100 classes and use the same
train/test partitions described above.
2.4.5 Experiments for SBC-D:
For training our model, we use the AlexNet [29] architecture and replace the
output layer with the learned Class Binary Codes in Section 2.3. For training the
model of DLH, we use their publicly available code2. For Cifar-100, we use a batch
size of 100 and we train two models for learning 64 and 96 bits. We train up to
2https://github.com/kevinlin311tw/caffe-cvprw15
28
Iter 64 bits 96bits
SBC-D DLH SBC-D DLH
10,000 32.5 1.0 23.2 1.3
20,000 37.3 4.2 25.3 8.5
30,000 51.1 9.8 38.9 20.2
40,000 55.4 12 45.1 23.4
50,000 57.9 14.2 49.7 26.8
Table 2.4: Performance(mAP) comparison of DSH methods for various numbers of
iterations for Cifar-100 dataset.
50000 iterations, such that training is completed in a reasonable amount of time. We
report the mean average precision(mAP), precision@1(pre@1) and precision@radius
0(preH@0) for both the methods in Table 2.3 at 50K iterations. Table 2.4 compares
the performance of SBC-D and DLH for the learned numbers of bits.
We see that our model SBC-D outperforms DLH for all the metrics for both
64 and 96 bits and has significantly improved over SBC. High Precision of exact
collision binary codes(preH@0) of our method indicates that we achieve the goal
of mapping all same class images to a unique binary code. We report a gain of
43.7% and 23% in mAP over DLH and a gain of 42% and 30.7% over SBC for 64
and 96 bits respectively. We notice that the mAP of SBC-D is lower for 96 bits
than 64 bits. We expect that this is because of learning from an over complete
representation, which resulted in overfitting. However, notice that the performance
29
of DLH improved from 64 to 96 bits, but is still significantly lower than our proposed
deep model. Table 2.3 suggests that the initial SBC-D models learned(i.e. at few
iterations) are also efficient and do better than the best models of DLH for both
64 and 96 bits. Both the methods improve performance with more iterations. It is
surprising that the model of DLH did not do well for the Cifar100 dataset as opposed
to the competitive results shown for the Cifar-10 dataset in its work. We suspect
that this may be because the number of latent functions(i.e. bits) learned are less
than the number of classes. We have shown that by transforming the labels to the
class binary code space and replacing the output layer with the binary codes, an
efficient supervised deep hashing method can be learned. Although, we emphasize
that the improvement in performance with the deep model comes with an expense
of training time(10 hours to train each of the 64 and 96 models), compared to only
few minutes for the SBC model.
2.5 Conclusion
The key idea in our work is to address the practical aspects of faster retrieval
using image binary codes as addresses. Our goal was to learn supervised hashing
functions such that all same class images are mapped to a unique binary code. We
proposed a new supervised hashing algorithm that jointly minimizes the dimension-
ality reduction error and quantization error between the image and class binary
codes, by formulating independent objectives with a least squared criterion. A deep
extension of the SBC model is also presented to take advantage of the deep train-
30
Figure 2.5: Images of Caltech-256 dataset that have the same semantic binary code
given by the corresponding 128 bit SBC. For each row, we mention the most common
category of the images. False Positives are indicated by red boundaries.
ing procedure. We also introduced Class Hamming Distance that utilizes the class
binary codes and shows significant improvement in performance and retrieval times
compared to instance-based Hamming Distance. Our experiments demonstrate the
superiority of both the shallow and deep proposed methods over the state-of-the art.
31
Algorithm 1 Learning Semantic Binary Codes(SBC)
Input: Given {(xi, yi)}ni=1 as (input, label) pair, yi = {1, .., k}. Let X ∈ Rd×n denote
the input feature matrix.
Preprocessing: Choose randomly m anchor points and construct φ(X) ∈ Rm×n using
RBF kernel. Also, construct a binary one-of-k class indicator matrix: Y ∈ {−1, 1}k×n
with Yij = 1 if the class of the j
th sample is i and −1. Mean center the matrix Y.
Equivalent Class indicator matrix for LS-CCA: Construct H = (Y Y T )−
1
2Y .
Goal: To learn l bits i.e. l and r projections for W ∈ Rr×l and V ∈ Rm×r and class
binary codes C.
Initialization:
V : Solve for V using the ridge regression solution in equation (2.12), given Y = H
and φ(X).
W : Given V , apply the DR-mapping: S = V Tφ(X). Then, learn W using CCA
with S and Y = H as input and output modalities.
C: Obtain B = sgn(W TS) and then C using (2.10)
Optimization:
repeat
- Solve for S and then V using equations (2.3) and (2.4), given Y = H and φ(X).
- Solve for W using equation (2.6)
- Obtain B = sgn(W TS) and then cκ using equation (2.10), ∀κ. Ci = cyi ; for each
bit: most occuring bit is assigned
until Convergence Criteria met or max number of iterations
Final Hash Projection: U = VW
Query(q) binary code: binarycode(q) = sgn(UTφ(q))
Class binary codes: Class binary codes given by C.
32
Chapter 3: SHOE: Sibling Hashing with Output Embeddings
3.1 Introduction
The presence of social networks, image hosting websites like Imgur and Flickr,
and the ubiquity of mobile phones with cameras have resulted in large-scale pro-
liferation of images on the internet. Given a database of images, image retrieval
seeks to return images from the database that are most similar to a query. Recent
applications for content based image retrieval (CBIR) include Google’s “Search by
Image” feature [35] and TinEye [36], which enable users to search for an image us-
ing a closely related image, rather than a keyword. Performing image retrieval on
databases with billions of images is challenging due to the linear time complexity
of nearest neighbor retrieval algorithms. Image hashing [7,8,11,12,15,37] addresses
this problem by obtaining similarity preserving binary codes which represent high
dimensional floating point image descriptors and offer efficient storage and scalable
retrieval with sub-linear search times. These binary hash-codes can be learned in
unsupervised or supervised settings. Unsupervised hashing algorithms map nearby
points in a metric space to similar binary codes. Supervised hashing algorithms try
to preserve semantic label information in the Hamming space. Images that belong
to the same class are mapped to similar binary codes.
33
In this work, we develop a new approach to supervised hashing, which we
motivate with the example shown in Figure 3.1. Consider an image retrieval problem
that involves a database of animals and a query image of a leopard. Now consider
the following three scenarios :
1. If the retrieval algorithm returns images of leopards, we can deem the result
to be absolutely satisfactory.
2. If the retrieval algorithm returns images of dolphins, whales or sharks, we con-
sider the results to be absolutely unsatisfactory because not only are dolphins
not leopards, they do not look anything at all like leopards.
3. If the retrieval algorithm returns the image of a jaguar or a tiger, we would
be reasonably satisfied with the results. Although a jaguar is not the same as
a leopard, it is semantically similar to one.
In this example, leopards, dolphins, whales, sharks, jaguars and tigers all
belong to different categories. However, some of these categories are more closely
related to each other than to other categories. Animals which fall under the “big cat”
(Panthera) genus are related to each other, as are the large aquatic vertebrates like
dolphins, sharks and whales. We designate the related categories as “siblings”. The
question now becomes : how can we construct hash functions that rank sibling class
images ahead of unrelated class images? Traditional supervised hashing algorithms
like [7,8] use discrete binary labels to compute similarity between classes. We, on the
other hand, consider information outside of the image domain to define inter-class
relationships.
34
Figure 3.1: We prefer results II over I because they tend to retrieve images of
classes(jaguar and tiger) related to the class label of the query(leopard), rather than
unrelated classes(sharks and dolphins).
To study the relationships between categories, Weinberger et al. [38] suggested
the concept of “output embeddings” - vector representations of category informa-
tion in Euclidean space. There has been extensive work on “input embeddings”,
which are vector-space representations of images [39–41], but less work has been
done on output embeddings, which map similar category labels to similar vectors
in Euclidean space. In an output embedding space of animals, we would expect to
have embeddings for labels so that chimpanzees, orangutans and gorillas are near
each other as are leopards, cheetas, tigers and jaguars. In Section 3.2, we describe
various methods from the literature to obtain output embeddings.
Our method, which uses output embeddings to construct hash codes in a
supervised framework is called SHOE: Sibling Hashing with Output Embeddings
(Figure 3.2). Our motivation for doing this is the following : it is our belief that in
the case of supervised hashing, a more desirable algorithm will retrieve images of
35
sibling classes ahead of images of unrelated classes, while maintaining the retrieval
performance for images of the same class. This improves the overall visual quality
of retrieved results. We perform extensive retrieval experiments on the Caltech-
UCSD Birds(CUB) dataset [42], SUN Attribute Dataset [43] and ImageNet [44].
Our hash-codes can also be used to do classification, and we report accuracy on the
CUB dataset using a nearest-neighbor classifier which is better than R-CNN and its
variants [45,46].
The contributions of our work are as follows :
1. To the best of our knowledge, our approach is the first to introduce the problem
of learning supervised hash functions using the modality of output embeddings.
2. We propose a joint learning method to solve the above problem, and perform
retrieval and classification experiments to experimentally validate our method.
3. We propose two new evaluation criteria - “sibling metrics” and “weighted
sibling metrics”, for gauging the efficacy of our method.
4. We significantly boost retrieval and classification performance by applying
Canonical Correlation Analysis [47] on input features, and learn hash functions
using output embeddings on these features.
The remainder of this work is arranged as follows. Section 3.2 describes
related work. In Section 3.3 we describe our hashing framework, and carry out
experiments in Sections 3.4 and 3.5. We conclude in Section 3.6.
36
Figure 3.2: We perform k-means clustering on Word2Vec embeddings [1] of Im-
ageNet classes. The principle behind SHOE is that images belonging to related
classes (like leopard or tiger, which are nearby in the output embedding space) are
mapped to nearby binary codes (represented by points on a binary hypercube). Im-
ages belonging to unrelated classes (like leopards and aircraft) are mapped to distant
binary codes. This figure was created using [2] and is for illustrative purposes.
3.2 Related Work
Work on image hashing can be divided into unsupervised and supervised meth-
ods. For the purpose of brevity, we only consider supervised methods. Supervised
Hashing algorithms are based on the objective function of minimizing the differ-
ence between hamming distances and similarity of pairs of data points. Supervised
Hashing with Kernels(KSH) [7] uses class labels to determine the similarity. Points
are considered ‘similar’ (value ‘1’) if they belong to the same class and ‘dissimilar‘
37
(value ‘-1’) otherwise. They utilize a simplified objective function using the relation
between the hamming distance and inner products of the binary codes. A sequential
greedy optimization is adapted to obtain supervised projections. FastHash [8] also
uses a KSH objective function but employs decision trees as hash functions and
utilizes a GraphCut based method for binary code inference. Minimal loss hash-
ing [15] uses a structured SVM framework [48] to generate binary codes with an
online learning algorithm.
All these methods except KSH categorize the input pairs to be either similar
or dissimilar. KSH allows a similarity 0 for related pairs, but the authors only use it
to define metric neighbors and not for semantic neighbors. FastHash entirely ignores
the related pairs, as it weighs the KSH loss function by the absolute value of the
label. Furthermore, their work does not support a similarity value other than 1 and
-1, as it violates the submodularity property - a crucial property required to solve the
problem in parts. The idea of ordering binary codes has been recently proposed in
Zhang et al. [49], however, this has been done in an unsupervised setting. To the best
of our knowledge, we are the first to use similarity information of sibling classes in a
supervised hashing framework to obtain binary codes that respect relation between
classes. Our work is also different from others as we compute similarity from output
embeddings and use a joint learning framework to learn the sibling similarity. We
now discuss related work on output embeddings.
Output embeddings can be defined as vector representations of class labels.
We use these vector representations to compute the distance between categories in
the label space, in the belief that sibling classes are nearby in this space compared
38
to unrelated ones. Output embeddings can be thought of as a modality exclusive
of the pixel domain (or “input embeddings”), and can be divided into two types:
data-independent and data-dependent. Some of the data-independent embeddings
include [42,43,50–52]. Langford et al. [50] constructed output embeddings randomly
from the rows of a Hadamard matrix where each embedding was a random vector of
1 or -1. Data dependent embeddings, on the other hand, can be constructed from
side information about classes such as attributes, or a Linnean hierarchy, and are
available with datasets such as [42, 43, 51]. In WSABIE [53], the authors jointly
learn the input embeddings and the output embeddings to maximize classification
accuracy in a structural SVM setting. Akata et al. [54] uses the WSABIE framework
to learn fine grained classification models by mapping the output embeddings to
attributes, taxonomies and their combination.
The binary hash-codes that SHOE learns on the CUB and SUN datasets use
attributes as output embeddings, just like [54]. These attributes are provided either
by expert oracles or Amazon Mechanical Turk - hence, output embeddings can be
considered as a separate modality of information. For ILSVRC2010 experiments,
we use a taxonomy derived embedding similar to [52]. In a taxonomy embedding, a
binary output embedding vector is obtained, where each node in the class hierarchy
and its ancestors are represented as 1 while non-ancestors are represented as 0. Deng
et al. [55] show that classification that takes hierarchies into account can be informa-
tive. Mikolov et al. [1] use a skip-gram architecture trained on a large text corpus to
learn output embeddings for words and short phrases. These Word2Vec embeddings
are used by [56] for large scale image classification and zero-shot learning. Finally,
39
16 32 64 128 2560
0.05
0.1
0.15
0.2
0.25
Number of bits
Pr
ec
is
io
n 
@
 3
0
Precision @ 30 for CUB
 
 
LSH
ITQ
FastHash
KSH
SHOE(E)
16 32 64 128 2560
0.2
0.4
0.6
Number of bits
Si
b 
Pr
ec
isi
on
 @
 3
0
Sib Precision @ 30 for CUB
 
 
LSH
ITQ
FastHash
KSH
SHOE(E)
16 32 64 128 2560
0.1
0.2
0.3
0.4
Number of bits
Si
bw
Pr
ec
is
io
n 
@
 3
0
SibwPrecision @ 30 for CUB
 
 
LSH
ITQ
FastHash
KSH
SHOE(E)
Figure 3.3: Retrieval on CUB dataset comparing our method SHOE(E) with
the state-of-the art hashing techniques. The above plots report precision, sib-
ling precision and weighted sibling precision for top 5 sibling classes for bits
c = {16, 32, 64, 128, 256}.
output embeddings can even be learned from the data. For example, [57] exploits
co-occurences of visual concepts to learn classifiers for unseen labels using known
classifiers. All these methods use output embeddings for classification and zero shot
learning, but none have used them to learn binary codes for retrieval.
3.3 Method
3.3.1 Preliminaries
Given a training set M = {(x1, y1), ...., (xN , yN)} of N (image,label) pairs with
xi ∈ X and yi ∈ Y , let φ : X → X¯ ∈ Rd be the input embedding function and
ψ : Y → Y¯ ∈ Re be the output embedding function. We wish to learn binary
codes bi, bj of length c (i.e., bi ∈ {−1, 1}c) such that for pairs of training images,
the Hamming distance between the codes preserve the distance between their class
40
Method mAP SibmAP Sib
w
mAP
SHOE(E) 0.111 0.250 0.174
KSH 0.113 0.133 0.108
FastHash 0.045 0.062 0.047
ITQ 0.060 0.119 0.084
LSH 0.013 0.044 0.028
Table 3.1: Retrieval on CUB dataset comparing our method SHOE(E) with the
state-of-the art hashing techniques. The table reports mAP, Sibling and Weighted
Sibling mAP for c = 64 bits.
labels (given by their corresponding output embedding vectors). In other words,
for a given query image, retrieved results of sibling(unrelated) classes ought to be
ranked higher(lower). To this end, we obtain the following objective function:
min
b
N∑
i=1
N∑
j=1
(
1
c
dH(bi, bj)− dE( ¯ψ(yi), ¯ψ(yj)))2. (3.1)
where dH(bi, bj) is the Hamming distance between binary codes bi and bj and dE
is the Euclidean distance between the normalized output embedding vectors ¯ψ(yi)
and ¯ψ(yj).
For an input image x with input embedding vector φ(x), we obtain binary
code b of length c bits. Each bit is computed using a hash function hl(x) that takes
the form :
41
hl(x) = sgn(wlφ(x)), wl ∈ Rd. (3.2)
To learn c such hash functions H = {hl|l = 1, . . . , c}, we learn c projection vectors
W = [w1, w2 . . . , wc], which we compactly write as H(x) = sgn(Wφ(x)), W ∈ Rc×d.
Without loss of generality, we can assume that φ(x) is a mean-centered feature. This
ensures we obtain compact codes by satisfying the balanced property of hashing (i.e,
each bit fires approximately 50% of the time) [12]. φ(x) can be an input embedding
that maps images to features in either kernelized or unkernelized forms.
Solving the optimization problem in Equation (1) is not straightforward, so we
utilize the relation between inner product of binary codes and Hamming distances
[7, 58], given as 2dH(bi, bj) = c − bTi bj, where bTi bj =
∑c
l=1 hl(φ(xi)) hl(φ(xj)) is
the inner product of the binary codes bi and bj. Note that the inner product of
binary codes lies between −c and +c, while the Hamming distance ranges from 0 to
c, where the distance between the nearest neighbors is 0 and between the farthest
neighbors is c . By unit normalizing the output embedding vectors, ‖ψ¯(y)‖ = 1,
we exploit the relationship between Euclidean distances and the dot products of
normalized vectors, given as dE(ψ¯(yi), ψ¯(yj)))
2 = 2− 2ψ¯(yi)T ψ¯(yj), and obtain the
following objective function:
min
H
N∑
i=1
N∑
j=1
(
1
c
c∑
l=1
hl(φ(xi))hl(φ(xj))− ψ¯(yi)T ψ¯(yj))2 (3.3)
Let oij = ψ¯(yi)
T ψ¯(yj) and as a consequence of the unit normalization of ψ¯(yi),
−1 ≤ oij ≤ 1, which implies that the similarity between same classes is 1 and
42
similarity between different classes is as low as -1. The objective ensures that the
learned binary codes preserve the similarity between output embeddings, which is
required for supervised hashing and our goal of ranking related neighbors before
farthest neighbors.
This is similar to the KSH [7] objective function, except that KSH assumes
that oij takes only values 1 (−1) for similar (dissimilar) pairs defined with semantic
information. Their work also accomodates the definition of the oij = 0 for related
pairs but only for metric neighbors. Our work is different from theirs, as we empha-
size the learning of binary codes that preserve the similarity between the classes,
given by oij. Regardless of the definition of oij, our optimization is similar, and so
we employ a similar sequential greedy optimization for minimizing (3.3). We refer
the reader to Section 3.3.5 and [7] for further details.
3.3.2 Evaluation Criteria
Standard metrics like precision, recall and mAP defined for semantic neighbors
are not sufficient to evaluate the retrieval of the sibling class images. To measure this,
we define sibling precision, sibling recall and sibling average precision metrics. Let
Ry : (yi, yj) → rank return the rank of class yj for a query class yi, 0 ≤ rank ≤ L,
where L is the number of classes. Note that, Ry(y, y) = 0 for the same class. The
ranking Ry is computed by sorting the distance between the output embedding
vectors ψ(y) and ψ(y∗), y∗ ∈ Y \ y. We obtain the weight of the sibling class
used for evaluation using the following functions for Sibling(Sibm) and Weighted
43
Sibling(Sibwm) metrics:
Sibm : (yi, yj, Ry)→ I(Ry(yi, yj)) ≤ m)
Sibwm : (yi, yj, Ry)→
m−Ry(yi, yj)
m
∗ I(rank ≤ m)
where, m is the number of related classes for each query class and I(.) is the Indi-
cator function that returns 0 when rank > m. The Sibling and Weighted Sibling
precision@k, recall@k and mAP is defined as:
swprecisionq@k =
∑k
l=1 Sib
w
m(yq, yql , Ry)
k
(3.4)
swrecallq@k =
∑k
l=1 Sib
w
m(yq, yql , Ry)∑L
p=1Nl ∗ I(Sibwm(yq, yp, R))
(3.5)
swAPq =
N∑
k=1
sprecisionq@k ×4srecallq@k (3.6)
swmAP =
∑Q
q=1 sAPq
Q
(3.7)
In the above equations, yql refers to the class of the l
th retrieved image.
3.3.3 Preliminary Experiments
We evaluate the proposed hashing scheme that takes into account structure of
the related classes with the following performance metrics: Precision@k, mAP and
their sibling versions previously defined.
44
• Datasets: To test our approach, we use datasets that contain information
about class structure. The CUB-2011 dataset [42] contains 200 fine-grained
bird categories with 312 attributes and is well suited for our purpose. In
this dataset, although both binary and continuous real-valued attributes are
available, we use only the mean-centered real-valued attributes as output em-
beddings ψ(y). We obtain ranking Ry for each class y based on these attribute
embeddings. There are 5994 train and 5774 test images in the dataset. We
select a subset of the dataset of size 2000 for training, use the whole train set
for retrieval and all test images as queries. Ground truth for a query is defined
label-wise and each query class has approximately 30 same class neighbors in
the retrieval set. For input embeddings, we extract state-of-the-art 4096 dime-
sional Convolution Neural Network (CNN) features from the fc7 layer for each
image using the Caffe Deep Learning library developed by [33]. We kernelize
the CNN features which take the form :
∑p
i=1 κ(x, xi) where κ is a radial basis
kernel, and p is the cardinality of a subset of training sample, designated as
“anchor points”. We refer to these as CNN+K features and they are inherently
mean-centered.
• Comparison methods: We compare our method with raw output em-
beddings:SHOE(E), with the following supervised and unsupervised hashing
schemes: KSH [7], FastHash [8], ITQ [28] and LSH [11]. We use their publicly
available implementations and set the parameters to obtain the best perfor-
mance. It is important to note that none of these methods utilize the distri-
45
bution of class labels in the output embedding space. The closest comparison
would be to use KSH, setting the similarity of semantic class neighbors to 0
value. For this purpose, we obtain top-m related pairs for each class using Ry
and set the similarity to 0 for related pairs. To evaluate the unsupervised ITQ
and LSH, we zero-center the data and apply PCA to learn the projections. We
use CNN+K features with p = 300 for evaluating SHOE(E) and KSH since
we learn linear projections in these methods, unlike FastHash which learns
non-linear decision trees on linear features.
• Results: Figure 3.3 shows the precision@30, sibling and weighted sibling
precision@30 plots for 5 related classes, i.e. m = 6(+1 for the same class)
by encoding the input embeddings to bits of length c={16, 32, 64, 128, 256}.
Table 3.1 shows the recall and mAP and its sibling variants for 64 bits. We
observe that SHOE(E) does better than baselines for both sibling and weighted
sibling precision metrics for top-30 retrieved neigbhors for all bit lengths, but
there is a loss in precision compared to the KSH method. In their work,
FastHash [8] shows better performance compared to KSH. However, it does
not perform well here because of the large number of classes and few training
samples available per class.
3.3.4 Analysis
Experiments in Section 3.3.3 show that using similarity directly from output
embeddings actually reduces the performance for same classes, while improving it for
46
sibling classes. To analyse this, we obtained top-m related classes using ranking Ry
and assigned a constant oij = θ value (previously, in Equation(3.3), we had defined
oij = ψ¯(yi)
T ψ¯(yj)). θ measures the similarity between a class and a related class.
Figure 3.4 show the performance of our method(SHOE) for varying θ values for
64 bits on CUB-2011 dataset. Results reveal that, when a fixed similarity is used,
we actually gain performance from the sibling class training examples, and this
gain is maximized for negative values of θ, i.e −1 < θ < 0. The intuition behind
this is: when θ is close to 1, the learned hash-code would not discriminate well
enough between identical classes and sibling classes. For instance, in our “database
of animals” example, we would learn hash-codes that nearly equate leopards with
jaguars, which is not what we desire. On the other hand, when θ is close to -1, a
hash-code for a leopard image will be learned mostly from other training images of
leopards, but with slight consideration towards training images of its sibling classes.
When θ is assigned -1, the sibling classes aren’t considered at all, so our method
becomes identical to KSH [7]. We are now interested in learning θ simultaneously
with the hash functions during the training phase.
3.3.5 SHOE Revisited
We observe that the objective function that we want to minimize in Equa-
tion (3.3) can be split into three parts - for identical classes, sibling classes and
unrelated classes, respectively. We also observe from the preceding analysis that
precision and recall metrics improve for negative values of θ. Therefore, we add reg-
47
Method CUB-2011(2000 train: (Pre,re)@100) SUN Attribute(3585 train: (Pre,re)@25)
pre|re Sibpre|re Sibwpre|re pre|re Sibpre|re Sibwpre|re
SHOE(L) 0.108 0.180 0.325 0.091 0.229 0.109 0.042 0.105 0.140 0.058 0.094 0.067
SHOE(E) 0.077 0.128 0.285 0.079 0.190 0.090 0.023 0.056 0.095 0.040 0.060 0.043
KSH 0.089 0.149 0.225 0.062 0.165 0.079 0.035 0.087 0.091 0.038 0.066 0.047
ITQ 0.069 0.115 0.214 0.060 0.147 0.070 0.031 0.078 0.097 0.040 0.066 0.047
FastHash 0.035 0.117 0.097 0.054 0.069 0.065 0.006 0.015 0.019 0.008 0.013 0.009
LSH 0.013 0.042 0.053 0.030 0.034 0.032 0.005 0.012 0.021 0.009 0.013 0.009
with CCA features
SHOE(L)+CCA 0.142 0.474 0.437 0.243 0.308 0.293 0.060 0.150 0.197 0.082 0.135 0.096
SHOE(E)+CCA 0.121 0.402 0.416 0.232 0.284 0.270 0.040 0.099 0.150 0.062 0.098 0.070
KSH+CCA 0.136 0.454 0.322 0.179 0.244 0.232 0.057 0.143 0.138 0.058 0.102 0.073
ITQ+CCA 0.095 0.318 0.203 0.113 0.158 0.150 0.029 0.071 0.067 0.028 0.050 0.035
Table 3.2: Comparing Precision, recall and their sibling variants with our meth-
ods(SHOE(L) and SHOE(E)) and several baselines for 64 bits. We here show results
with and without applying CCA projections for all the methods except FastHash
and LSH as their performance decreases. We see that SHOE performs significantly
better than the baselines for sibling metrics and performs as well as the baselines
for standard metrics.
ularizer term λ ‖θ + 1‖2 to the objective function, which becomes small when θ lies
close to -1. For easier notation, we denote hli = hl(φ(xi)). Our modified objective
function now becomes :
48
min
W,θ
N∑
i,j∈
N∑
same
class
(
1
c
c∑
l=1
hlihlj − 1)2 +
N∑
i,j∈
N∑
sibling
class
(
1
c
c∑
l=1
hlihlj − θ)2
+
N∑
i,j∈
N∑
unrelated
class
(
1
c
c∑
l=1
hlihlj + 1)
2 + λ ‖θ + 1‖2 (3.8)
Let Hijc = 1c
c∑
l=1
hlihlj denote the sum of the inner product of the binary codes
bi and bj of length c. We now compute the derivative of Equation (3.8) w.r.t θ, set
it to 0 and solve for each θ. We obtain :
θ = θc =
N∑
i,j∈
N∑
sibling
class
(Hijc − λ)
c(nsib + λ)
(3.9)
where nsib is the number of sibling pairs in the training data. We have thus obtained
a closed form solution for the optimal θ. However, we cannot calculate θ directly
as we do not learn all the bits at once. Therefore, we employ a two step alternate
optimization procedure that first learns the bits and then an approximate θl value
calculated from the previously learned bits. For the first iteration, we use an initial
θ0 value, computed from the similarity of the output embeddings. The two step
optimization procedure for learning the lth hash function is:
1. Step 1 : We optimize for Equation (3.3), keeping θl−1 constant and updating
the projection vector W , thus learning hash-code bits hl(φ(xi)).
2. Step 2 : We keep the hash-code bits hl(φ(xi)) constant and learn θl for each
image pair using Equation (3.9).
49
Let O denote the inner product of output embedding vectors such that oij is
assigned 1 for same class pairs, θ for sibling class pairs and -1 for unrelated class
pairs and Hl denote the lth bit for all training samples. Now, we can write equation
(3.8) as:
min
W,θ
∥∥∥∥∥1c
c∑
l=1
HlHTl −O
∥∥∥∥∥
2
F
+ λ ‖θ + 1‖2 (3.10)
Utilizing the independence constraints [12, 59] for learning hashing functions,
we rewrite the objective function to greedily solve for one bit at a time like in [7],
given as:
min
Wl,θ
∥∥∥∥∥1cHlHTl − (O −
l−1∑
m=1
HmHTm)
∥∥∥∥∥
2
F
+ λ ‖θ + 1‖2 (3.11)
Let Ml−1 = O −
∑l−1
m=1HmHTm. Replacing Hl with
sgn(Wlφ(X)) in equation (3.11) and expanding, we get:
min
Wl,θ
−sgn(Wlφ(X))TMl−1sgn(Wlφ(X)) + λ ‖θ + 1‖2 (3.12)
As discussed in the above two step procedure, we solve first for Wl and then
for θ. In the first step, we relax the sgn function and scale the projected vectors
to obtain initial solution Wl, which is the solution of the generalized eigenvalue
problem φ(X)TMl−1φ(X)Wl = σφ(X)Tφ(X)Wl. Then, we approximate the sgn
function with tanh. The gradient for the above equation w.r.t to Wl is now given
as:
δg = −φ(X)T ((Ml−1b) ◦ (1− b ◦ b)) (3.13)
Given the gradient, we solve the optimization using a fast iterative procedure like
Nesterov’s gradient descent [60] or FASTA [61]. In the second step, we learn θ as
50
given in Equation (3.9). We then update Ml−1 with the new value of θ and repeat
the optimization procedure for the next bit until we learn the required number of
bits.
3.3.6 Supervised Dimensionality Reduction
As our datasets contain class label information and corresponding output em-
beddings, we have explored the idea of supervised dimensionality reduction for in-
put embeddings φ(x) ∈ Rd to ω(x) ∈ Rc(c  d), given the output embeddings
ψ(y) ∈ RE. There are many supervised dimensionality reduction techniques avail-
able in the literature like Canonical Correlation Analysis (CCA) [47] and Partial
Least Square Regressions [62], for example. In particular, we have used CCA [47]
(φ → ω) to extract a common latent space from two views that maximizes the
correlation with each other. [28] also leveraged the label information using CCA to
obtain supervised features prior to binary encoding. However, they limit their out-
put embeddings to take the form of one vs remainder embeddings: : ψ(y) ∈ {0, 1}L
is a L-dimensional binary vector with exactly one bit set to 1 i.e. ψ(y)y = 1 where
L is the number of class labels. On the other hand, we apply CCA to the gen-
eral form of output embeddings that are real valued continuous attributes captur-
ing structure between the classes. We observe that when supervised features with
CCA-projections are used, we obtain a significant boost in performance (≈ 100%
improvement) for all of our evaluation metrics.
51
Method CUB-2011(5000 train) SUN Attribute(7000 train)
pre|mAP Sibpre|mAP Sibwpre|mAP pre|mAP Sibpre|mAP Sibwpre|mAP
@100 mAP @100 mAP @100 mAP @25 mAP @25 mAP @25 mAP
SHOE(L)+CCA 0.211 0.527 0.561 0.467 0.417 0.416 0.114 0.201 0.311 0.239 0.225 0.193
SHOE(E)+CCA 0.183 0.429 0.575 0.533 0.411 0.429 0.078 0.134 0.240 0.205 0.167 0.152
KSH+CCA 0.204 0.526 0.421 0.290 0.335 0.303 0.115 0.220 0.227 0.130 0.178 0.126
ITQ+CCA 0.109 0.256 0.193 0.125 0.157 0.129 0.039 0.070 0.081 0.044 0.062 0.041
FastHash 0.109 0.246 0.192 0.120 0.156 0.124 0.014 0.021 0.029 0.017 0.022 0.014
LSH 0.018 0.017 0.069 0.049 0.045 0.032 0.008 0.009 0.030 0.018 0.019 0.012
Table 3.3: Comparing Precision, mAP and their sibling variants with our meth-
ods(SHOE(L) and SHOE(E)) and several baselines for 128 bits. We apply CCA
projections for all the methods except FastHash and LSH as their performance de-
creases. For sibling metrics, SHOE performs significantly better than the baselines
and performs as well as the baselines for standard metrics. We use 1000 anchor
points for CUB dataset and 1434 anchor points for SUN attribute dataset.
52
3.4 Experiments
We evaluate our method on the following datasets: Caltech-UCSD Birds
(CUB) Dataset [42], the SUN Attribute Dataset [43] and Imagenet ILSVRC2010
dataset [44]. We extract CNN features from [33], as mentioned in Section 3.3.3.
In the case of CUB dataset, we extract CNN features for the bounding boxes that
accompany the images. For CUB and SUN datasets, we create two training sets
of different size to examine the variation in performance with number of training
examples. For all the datasets, we define the ground truth using class labels.
Datasets: We have described the CUB dataset in Section 3. The ImageNet
ILSVRC 2010 dataset is a subset of ImageNet and contains about 1.2 million images
distributed amongst 1000 classes. We uniformly select 2 images per class as a test set
and use the rest as retrieval set. We select 5000 training and p = 3000 anchor point
images by uniformly sampling across all classes. We obtain the output embeddings
for the ImageNet class using the method of Tsochantaridis [52]. Each of the 74401
synsets in ImageNet is a node in a hierarchy graph and using this graph, we obtain
the ancestors for each class in ILSVRC 2010 dataset. We then construct a matrix
OImagenet = {0, 1}1000×74401, where the jth column of the ith row is set to 1 if the
jth class is an ancestor of the ith class, 0 otherwise. Thus, the output embedding of
each class is represented by a row of OImagenet.
The SUN Attribute dataset [43] contains 14340 images equally distributed
amongst 717 classes, accompanied by annotations of 102 real valued attributes. We
partition the dataset into equal retrieval and test sets, each containing 7170 images.
53
We derive two variants from the retrieval set - the first has 3585(5 per class) training
and 1434(2 per class) anchor point images, while the second has 7000(10 per class)
training and 3000( 4 per class) anchor point images. In this dataset, each test query
has 10 same class neighbors and 50 sibling class neighbors in the retrieval set. We
compute a per class embedding by averaging embedding vectors for each image in
the class.
Evaluation Protocol: For binary hash-codes of length c = {16, 32, 64, 128, 256},
we evaluate SHOE using standard, sibling and weighted sibling flavors of precision,
recall and mAP. Two variants of SHOE are used - SHOE(E), which uses raw output
embeddings, and SHOE(L), which is learned using the method in Section 3.3.5. For
the smaller CUB and SUN subsets, we compute mAP and their sibling versions.
For the big subsets, in addition to mAP, we also compute precision. In the case of
ImageNet, we compute precision@50, recall@10K, mAP and their sibling variants.
Results: We compare our work to state-of-the-art methods in image-hashing
literature mentioned in Section 3.3.3 and present the results in Figure 3.5, Tables
3.2 and 3.3. Figure 3.5 and Table 3.2 represent experiments performed on a smaller
dataset with 2000 train, 1000 anchor points for CUB and 3585 train, 1434 anchor
point images for SUN Attribute dataset. Table 3.3 shows experiments that use 5000
train, 3000 anchor for CUB and 7000 train, 3000 anchor points for SUN Attribute.
For the baselines, we use publicly available implementations and set the parameters
to obtain the best performance. We use CNN+K features as input embeddings for
SHOE and KSH.
In Figure 3.5, the rows represent weighted sibling, sibling and mAP metrics.
54
The first two columns represent experiments without and with CCA projections
on the CUB dataset, while the latter two columns represent the same on the SUN
Attribute dataset. We notice that without CCA projections, we outperform the
baselines on standard metrics, and comfortably outperform on sibling metrics. These
results confirm our expectation that with fewer training samples, learning hash
codes with SHOE benefits from training samples of related classes. With output
embedded CCA projections, our method performs as well as the best baseline on
standard metrics, and comfortably outperforms on sibling metrics. CCA projections
significantly improve the performance for SHOE, KSH and ITQ, while it lowers the
performance for FastHash and LSH.
Table 3.2 does the same comparison with 64 bits and fewer training samples.
Once again, we notice that our method with and without CCA, SHOE(L) beats all
baselines on both standard and sibling metrics, while SHOE(E), which uses raw out-
put embeddings, without the extra optimization step in Section 3.3.5, outperforms
comfortably on sibling metrics, but does worse on the standard metrics. Table 3.3
shows similar trends for 128 bits for larger training sets, but we notice that the gap
in performance on standard metrics between SHOE and KSH reduces when CCA
projections are applied, but not on the sibling metrics. We think this is because
the CCA step utilizes output embeddings and hence, improves performance of both
KSH and SHOE. Finally, we conduct experiments on the Imagenet ILSVRC 2010
dataset, which contains 1000 classes. Results from Figure 3.6 show that with CCA
features, SHOE surpasses KSH on recall@10K and sibling precision metrics, while
performing as well as KSH on precision@50 and mAP. Qualitative results which
55
Method pre spre s
w
pre
SHOE(L)+CCA 24.5 32.6 29.1
KSH+CCA 24.8 30.4 28.0
ITQ+CCA 6.5 10.8 9.9
Method mAP smAP s
w
mAP
SHOE(L)+CCA 0.039 0.021 0.022
KSH+CCA 0.036 0.010 0.014
ITQ+CCA 0.005 0.001 0.002
Table 3.4: Retrieval on ILSVRC2010 dataset comparing SHOE with state-of-the art
hashing techniques. We use 5K training samples and CNN+K+CCA as features for
all the binary encoding schemes. Table reports precision@50, mAP and their sibling
versions for 256 bits.
compares SHOE and KSH can be found in Figure 3.7.
3.5 Fine-grained Category Classification
In this section, we demonstrate the effectiveness of our proposed codes for fine-
grained classification of bird categories in CUB-2011 dataset. We propose a simple
nearest neighbor pooling classifier that classifies a given test image by assigning it
to the most common label among the top-k retrieved images. Let Rx(q, x)→ rank
give the ranking of the images retrieved based on our binary codes. Thus, rank is 1
56
for the nearest neigbhor and rank is N for the farthest neighbor, where N is the size
of the database. Given such ranking, withMq denoting the top-k ranked neighbors
of a new query q, we define the k-nn pooling classifier as:
classpredict(q) = arg max
y
∑
x∈Mq
I(class(x) == y) (3.14)
We use the above model to obtain the classification accuracy on the CUB
dataset with 200 categories from top-10 neighbors. In particular, we obtain the
following accuracies: top-1 accuracy(top-1) measures if the predicted class matches
the ground truth class, top-5 accuracy(top-5) measures if one of the top-5 predicted
classes match the ground truth class and sibling accuracy(sib) measures if the pre-
dicted class is one of the sibling classes of the ground truth class. As a baseline, we
train a linear SVM model on the CNN features. We compare our proposed binary
codes SHOE(L), KSH and state-of-the-art fine grained classification models that
use CNN features. For this experiment, we use bounding box information, but do
not use any part-based information available with the datasets. Hence, we do a fair
comparison between methods with no part-based information. Table 3.5 shows the
classification performance over the 5794 test images with approximately 30 images
for each of the 200 categories.
Features: For each of the binary coding schemes(SHOE, KSH, ITQ), we
use the CNN+K+CCA(kernelized CNN with CCA projections) features as input
embeddings and mean-centered attributes as the output embeddings. For the ex-
periments, we used only 128 bit codes, while CNN features are 4096 dimensional
vectors.
57
Method top-1 top-5 sib Compression
Baseline(SVM) 50.6 75.6 70.19 1
SHOE(L)+CCA 52.51 77.8 72.4 1024
KSH+CCA 52.48 75.1 69.06 1024
ITQ+CCA 27.5 43.4 37.6 1024
R-CNN [45] 51.5 - - 1
Part-RCNN [46] 52.38 - - 1
Table 3.5: Comparing classification accuracies for CUB dataset. For top-1 and
sibling accuracy, we used k = 10 neighbors. To obtain top-5 accuracy, we used
k = 50 neighbors. For the binary coding schemes, we used only 5000 of the 5994
train images to obtain 128 bits, while the classification models are trained on the
full set. ’-’ indicates that the information is not available in their work.
58
Results: We observe that not only do the proposed binary codes obtain a
marginal improvement in performance over the complex classification models in [45]
[46], but they also offer an astounding compression ratio of 1024. Also, the training
and testing times of binary coding schemes are significantly smaller than those with
SVM classification models.
3.6 Conclusion
The key idea of our work is to learn binary codes that respect the relation-
ship between classes. We utilize output embeddings to define the similarity between
classes and obtain binary codes that preserve class order. To the best of our knowl-
edge, ours is the first work to do so. We devised a method to learn class similarity
jointly with the hash function, along with new metrics for their evaluation. It is
our belief that this scheme improves overall visual quality of the retrieval system,
and we have validated this experimentally with sibling metrics. Our method, called
SHOE, achieves state-of-the art image retrieval results over multiple datasets for
hash codes of varying lengths. Our other innovation was to utilize CCA to learn a
projection of features with output embeddings, which resulted in significant gains
in both retrieval and classification experiments. Upon applying this approach to all
methods, we perform comparable to or better than all baselines over all datasets.
59
−1 −0.5 0 0.5 10.1
0.15
Pr
ec
is
io
n@
30
θ
Precision Variants @ 30
 
 
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
0
0.5
Si
b|S
ibw
 
Pr
ec
is
io
n@
30
Std
Sib
Sibw
−1 −0.5 0 0.5 10.08
0.1
0.12
m
AP
θ
mAP Variants for CUB
 
 
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
Si
b|S
ibw
 
m
AP
Std
Sib
Sibw
Figure 3.4: Retrieval on CUB dataset evaluating the performance of our
method(SHOE) for varying θ values and p = 1000. The left and right y-axis show
the standard metrics and sibling metrics respectively.
60
16 32 64 128 2560
0.1
0.2
0.3
0.4
0.5
0.6
Number of bits
Si
bw
m
AP
SibwmAP for CUB
 
 
LSH
ITQ
FastHash
KSH
SHOE(E)
SHOE(L)
16 32 64 128 2560
0.2
0.4
0.6
Number of bits
Si
b 
m
AP
Sib mAP for CUB
 
 
LSH
ITQ
FastHash
KSH
SHOE(E)
SHOE(L)
16 32 64 128 2560
0.1
0.2
0.3
0.4
0.5
Number of bits
m
AP
mAP for CUB
 
 
LSH
ITQ
FastHash
KSH
SHOE(E)
SHOE(L)
16 32 64 128 2560
0.1
0.2
0.3
0.4
0.5
Number of bits
Si
bw
m
AP
SibwmAP for CUB (CCA)
 
 
LSH
ITQ
FastHash
KSH
SHOE(E)
SHOE(L)
16 32 64 128 2560
0.1
0.2
0.3
0.4
0.5
0.6
Number of bits
Si
b 
m
AP
Sib mAP for CUB (CCA)
 
 
LSH
ITQ
FastHash
KSH
SHOE(E)
SHOE(L)
16 32 64 128 2560
0.1
0.2
0.3
0.4
0.5
Number of bits
m
AP
mAP for CUB (CCA)
 
 
LSH
ITQ
FastHash
KSH
SHOE(E)
SHOE(L)
16 32 64 128 2560
0.05
0.1
0.15
0.2
Number of bits
Si
bw
m
AP
SibwmAP for SUN
 
 
LSH
ITQ
FastHash
KSH
SHOE(E)
SHOE(L)
16 32 64 128 2560
0.05
0.1
0.15
0.2
0.25
Number of bits
Si
b 
m
AP
Sib mAP for SUN
 
 
LSH
ITQ
FastHash
KSH
SHOE(E)
SHOE(L)
16 32 64 128 2560
0.05
0.1
0.15
0.2
0.25
Number of bits
m
AP
mAP for SUN
 
 
LSH
ITQ
FastHash
KSH
SHOE(E)
SHOE(L)
16 32 64 128 2560
0.05
0.1
0.15
Number of bits
Si
bw
m
AP
SibwmAP for SUN (CCA)
 
 
LSH
ITQ
FastHash
KSH
SHOE(E)
SHOE(L)
16 32 64 128 2560
0.05
0.1
0.15
0.2
Number of bits
Si
b 
m
AP
Sib mAP for SUN (CCA)
 
 
LSH
ITQ
FastHash
KSH
SHOE(E)
SHOE(L)
16 32 64 128 2560
0.05
0.1
0.15
0.2
Number of bits
m
AP
mAP for SUN (CCA)
 
 
LSH
ITQ
FastHash
KSH
SHOE(E)
SHOE(L)
Figure 3.5: Retrieval on CUB-2011(first and second row) and SUN(third and fourth
row) dataset comparing our methods SHOE(E) and SHOE(L) with the state-of-the
art hashing techniques. The above plots report mAP, Sibling and Weighted Sibling
mAP for top 5 sibling classes. For the CUB dataset, we used 2000 training samples
and 1000 anchor points, while for the SUN attribute dataset, we used 3585 training
samples and 1434 anchor points.
61
32 64 128 2560
0.05
0.1
0.15
0.2
Number of bits
 
R
ec
al
l @
10
K
Recall@10K:ILSVRC2010
 
 
ITQ
KSH
SHOE(Learn)
32 64 128 2560
0.02
0.04
0.06
0.08
Number of bits
Si
b 
 R
ec
al
l @
10
K
Sib 
re
@10K:ILSVRC2010
 
 
ITQ
KSH
SHOE(Learn)
32 64 128 2560
0.05
0.1
Number of bits
Si
bw
 
R
ec
al
l @
10
K
Sibw
re
@10K:ILSVRC2010
 
 
ITQ
KSH
SHOE(Learn)
Figure 3.6: Retrieval on ILSVRC2010 dataset comparing SHOE with state-of-
the art hashing techniques. We use 5K training samples and CNN+K+CCA as
features for all the binary encoding schemes. The above plots report recall, Sibre,
Sibwre @10K for top 5 sibling classes for bits c = {32, 64, 128, 256}.
62
Figure 3.7: The first query is of an ovenbird. SHOE retrieves more ovenbirds than
KSH. The second query is of a Brewer black-bird. Neither SHOE nor KSH retrieve
Brewer black-birds. However, SHOE returns ravens, which are sibling classes of
Brewer black-birds, whereas KSH retrieves pileated woodpeckers, which are unre-
lated to black-birds. Here, blue borders represent sibling classes.
63
Chapter 4: Hierarchical Spherical Hashing for Compressing
High Dimensional Vectors
4.1 Overview
Large dimensional descriptors like Fisher Vectors (FV) [63, 64], Vector of Lo-
cally Aggregated Descriptors(VLAD) [65] and Locally Constrained Linear Codes(LLC)
[66] have performed very well in problems like retrieval, classification and object de-
tection [67–69]. A major bottleneck in using such vectors is their high storage
requirements. For example, a typical Fisher Vector size ranges from 1000 to 0.5
Million with 16 to 4096 GMMs; that translates to 8 KB to 2 MB. The full dataset
of ILSVRC2010 containing 1.4 Million images [44] needs 3 TBs of storage when stor-
ing FV’s over 4096 GMMs. One can adapt dimensionality reduction techniques like
PCA to reduce the storage requirements at the cost of losing structural properties of
the original vectors. To preserve the discriminative power of such high-dimensional
descriptors without losing structural properties, Perronnin et al [67, 68] found that
they must be encoded using a large number of bits.
There have been many proposals for compressing high-dimensional descriptors
to a small number of bits [11, 12, 28, 65, 70–72] by initially applying PCA, but less
64
has been done on compressing to a large number of bits. Perronin et al [67,68] used
sparsity constraints that employ thresholding based methods to encode to a large
number of bits. Locality Sensitive Hashing [11] and Product Quantization [70] based
methods can be applied to compress large descriptors to large numbers of bits at a
storage cost. Specifically, to compress a 128K dimensional vector to 64K bits, LSH
requires 32 GB for storing the projection matrix, which is a huge storage require-
ment. PQ encoding methods obtain excellent results for retrieval by partitioning the
vectors and learning a codebook for each partition. Binary codes are obtained as a
concatentation of the codebook indices to which each sub-vector belongs. Approxi-
mate nearest neighbors are then computed efficiently using symmetric/asymmetric
distance between the query and the codebook learned for each partition, obtained
by looking up the codebook indices. This method works best when an orthogonal
rotation is applied to the data to balance variance, but incurs a huge storage cost
for the rotation matrices. Gong et al [73] also proposed a fast bilinear projection
method to compress large dimensional vectors without the storage overhead of pro-
jection matrices. They accomplish this by formulating the projection operation as
the Kronecker product of two small orthogonal matrices. Recently, Circulant ma-
trix properties have been used in [74] to obtain a low-storage projection matrix that
requires storing only d values for compressing a d dimensional vector to d bits. We
present an alternative method for compressing high dimensional descriptors based
on hypersphere hashing functions inspired by the Spherical Hashing method(SpH)
of Heo et al [72].
Hypersphere hashing functions have been shown to outperform hyperplane
65
methods [11, 15, 28] for low numbers of bits by capturing the non-linearity of the
data. We explore the potential of the hyperspherical hashing functions for com-
pressing large dimensional descriptors by proposing a hierarchical approach that
overcomes the computational challenges of learning spherical hashing functions for
high dimensional data. We take advantage of the structural properties of VLAD
and FV to split the vectors and learn spherical hashing functions for partitions. Full
spherical hashing functions are constructed from the sub-spherical hashing functions
in a hierarchical way using a Random-Select and Adjust(RSA) method which we
also introduce in this work.
This work is organized as follows. In section 4.2, we discuss the Spherical
Hashing method that defines the hypersphere hashing functions and an efficient
iterative optimization scheme that achieves the hashing properties for the hashing
functions. We also highlight the computational challenges of applying this technique
to high dimensional vectors. In section 4.3 we introduce our Hierarchical Spherical
Hashing method for compressing high dimensional descriptors. We further discuss
the details about partitioning of vectors, learning sub-spherical hashing funtions and
construction of full hashing functions using RSA in a divide and conquer fashion,
while highlighting the computational advantages of our solution. Our experiments in
section 4.4 demonstrate the performance of our approach(SpH-RSA) in comparison
to existing state-of-the art methods for compressing large dimensional descriptors.
66
4.2 Spherical Hashing
Heo et al [72] introduce a novel hypersphere-based hashing function to map
spatially coherent data points into a binary code. Their approach is motivated by
the fact that a single hypersphere can enclose a closed region in d-dimensional space
while it requires d + 1 hyperplanes to define the same region, thus requiring fewer
bits to encode data points. A spherical hashing function is defined by a center(c) and
a threshold(t), whose value is +1/− 1 if a point is inside/outside the hypersphere.
If cp, tp define a hypersphere p, the corresponding hashing function hp for a feature
point x ∈ Rd that generates the pth binary code is given as sgn(tp − dist(cp, x)),
where dist(., .) denotes the Euclidean distance from its center to the point.
To obtain k bits, SpH learns k spherical hashing functions with the objective
of satisfying the balance and independence properties of hashing. These properties
play an important role in producing compact binary codes as independent hashing
functions distribute points in a balanced-manner to different binary codes [75, 76].
An efficient iterative scheme was proposed to meet these hashing constraints. The
algorithm starts with selection of k centers and chooses k thresholds that satisfy
the balance property. With the selected thresholds and centers, the method com-
putes the number of training points inside each hypersphere(oi) and in each pair
of hyperspheres(oij). These statistics are used to adjust the centers by applying
a replusive or attractive force based on the deviation from the ideal independence
criterion n/4 between every pair of hyperspheres, where n is the number of training
samples. This process is repeated until the centers converge, but typically some
67
Time(d/k) 25600 12800 6400 3200
12800 - - - > 8
6400 >8 >8 >8 3.5182
3200 6.2317 4.1013 3.6981 1.1891
400 0.3718 0.1034 0.0599 0.0318
Table 4.1: Training times (in hours) for learning k = {12800, 6400, 3200, 400} spherical hashing
functions for d = {25600, 12800, 6400, 3200} dimensional vectors using [72]. We used 20000 training
samples for training from the ILSVRC2010(Train) dataset. >8 means that the learning did not
finish after 8 hours. ’-’ indicates that we could not learn the hashing functions on a single machine
with 16 GB RAM, as they did not fit in memory. We see that as the number of hyperspheres to
be learned and the dimension of the data increases, the training times increase exponentially.
error is allowed to avoid over-fitting. For the detailed algorithm, refer to [72].
4.2.1 Computation Challenges(d is large and k ∼ d)
The time complexity of the above iterative process for a single iteration is
O((k2 + kd)n), where k is the number of bits to be learned, d is the size of the
vector and n is the number of training samples. For small numbers of bits k  d,
the learning algorithm is practical and very fast. To learn large number of bits
(k ∼ d and d is large), this becomes computationally intractable.
There are two computationally intensive steps in the algorithm that inhibit
its extension to higher dimensional vectors. The first is to compute distances of all
feature points to k hyperspheres. The second is to compute the forces for adjusting
68
0 50 1000
0.2
0.4
0.6
0.8
1
Number of retrieved Neighbors
R
ec
al
l f
or
 1
0 
NN
 
Recall for 10 NN
 
 
m=3200(C)
m=3200
m=800(C)
m=800
m=200(C)
m=200
m=100(C)
m=100
0 200 400 6000
0.2
0.4
0.6
0.8
1
Number of retrieved Neighbors
R
ec
al
l f
or
 5
0 
NN
 
Recall for 50 NN
 
 
m=200(C)
m=200
m=100(C)
m=100
m=40(C)
m=40
m=20(C)
m=20
0 500 10000
0.2
0.4
0.6
0.8
1
Number of retrieved images
R
ec
al
l f
or
 1
00
 N
N 
Recall for 100 NN
 
 
m=3200
m=1600
m=800
m=400
m=200
m=100
m=40
m=20
m=10
Figure 4.1: Performance of our method(SpH-RSA) and SpH-Concat on a subset
of ILSVRC2010 Train dataset with 25600 VLAD vectors for varying partition size
m= 3200 to 10. Plots (Left, middle) show recall for 10 and 50 ground truth neigh-
bors while comparing our method with the concatenated bit vectors from SpH-
Concat(indicated by C). Notice the poor performance of the concatenated bit vec-
tors. Plot (Right) evaluates the results of our method for varying sub vector sizes.
the centers to satisfy the independence property. It is important to keep in mind
that, the amount of training data required to learn a large number of hash func-
tions is also very high. Such large amounts of training data with high dimensional
descriptors cannot fit in memory and the algorithm to adjust the hash functions for
even a single iteration becomes intractable as it suffers from the latency of reading
from disk. Table 4.1 gives the training time of spherical hashing for learning vary-
ing number of hashing functions. Section 4.3.5 will discuss how we overcome these
computational challenges for learning a large number of hashing functions.
4.3 Hierarchical Spherical Hashing
We alleviate the problem of learning hyperspheres for large dimensional vectors
by partitioning the vectors and learning sub-hyperspheres for each partition. This
69
Figure 4.2: Left: RSA technique that randomly selects hyperspheres from the carte-
sian product of the subsets and adjusts the hyperspheres to satisfy the hashing
properties. Right:RSA method applied in a Divide and conquer fashion to obtain
the k full hyperspheres.
results in a feasible procedure for learning the hyperspheres in the full space. We
refer to this method as Hierarchical Spherical Hashing as we start by learning sub-
hypersphere hashing functions at the lower levels and extend them to higher levels
using the Random Select Adjust(RSA) algorithm; we satisfy the hashing properties
at each level as it is constructed in a divide and conquer style. The subsequent
sections discuss the details of the algorithm.
4.3.1 Learning Sub-Hypersphere functions
Data: Let X = {x1, x2, ..., xN}, xi ∈ Rd denote the set of N data points of
dimension d. d is large in our case. We want to compress each data point xi to a
binary vector bi ∈ {0, 1}k with k ∼ d. Let Xn be the subset of n data points from
X that we use for training.
Partitioning the data: To learn k hyperspheres for Xn, we partition each
data point xi to m subfeature vectors {xi1, xi2, ...xim}, xip ∈ Rs, p = 1..m, where
70
s = d
m
denotes the size of the subfeature vector. We obtain subsets(or components)
{X1, X2, ...., Xm}, where each subset Xp = {x1p, x2p, ..., xnp}, p = 1..m is the set of
all pth subfeature vectors of Xn. Note that we use subset or component to refer to
Xp.
Learning Sub-hypersphere Hashing functions: For each subset Xi, we
learn ki hypersphere hash functions as in section 4.2. We refer to the hypersphere
hashing functions learned for each subset as sub-hypersphere hashing functions.
To avoid bias in learning a varying number of hash functions for each component,
we learn an equal number of hashing functions for each subset, ki = κ, i = 1..m.
Therefore, training on m components produces mκ sub-hypersphere hash functions.
One could concatenate the bits obtained from the sub-hyperspheres of the m subsets
to obtain a full bit-vector. We refer to this method SpH-Concat. However, this is
not a good choice1. Figure 4.1 compares the performance of our method with SpH-
Concat for different sizes of m, indicating a poor performance for the latter. We
therefore employ a hierarchical approach that uses these learned sub-hyperspheres
to obtain k full hyperspheres.
Size of (s,m): The parameters that influence the selection of the size of
the partition(m) are the training time, memory required and performance on each
subset. We choose s so as to fit the training data of a subset in memory and achieve
1The concatenated bit vector does not have a global relation to the full feature vector due to
the de-correlation of bits between each sub-bit vector and hence is not a suitable global bit vector.
Also, the independence and balance properties are only satisfied for the sub-bit vectors and not
for the concatenated bit vector.
71
a reasonable training time and accuracy for learning sub-hypersphere hash functions
using the algorithm SpH. We can at most learn up to 4000 sub-hyperspheres for a
subset of dimension 4000 in typically <8 hours on a machine with 16GB RAM.
Figure 4.1 shows the performance of our method on ILSVRC2010 Train dataset for
various m.
4.3.2 Cartesian-product of pivots/centers
To this point we have constructed κ sub-hyperspherical hashing functions for
each component. Let {C1, C2, ....Cm} be the pivots(centers) learned for the m sub-
sets, where each Ci = {ci1, ci2, ...., ciκ}, ∀i = 1..m, define the κ sub-hypersphere
centers learned for Xi. The idea is to select k-pivots for the full space from the
Cartesian product of the learned subspace pivots given as:
C1 × C2 × ...Cm
From the κm combinations of pivots of sub spaces, we randomly select k pivots
in a divide and conquer fashion using the RSA method. It makes sense to select
random k pivots from the Cartesian product based on the down-ward closure prop-
erty, which is commonly used in subspace clustering algorithms [77]. This property
states that ”a cluster in the sub space is a candidate for the cluster in the full feature
space”. In our case, this property suggests that any pivot choosen from the Carte-
sian product is a candidate pivot in the full space. We will see later that learning
the full spherical hashing functions from the sub-spherical hashing functions gives a
huge computational advantage as the hashing properties of the sub-spherical hashing
72
functions are already met.
Similarity to other methods: The idea of using the Cartesian product of
sub-pivots for selection of the pivots in full space is similar to the Product Quanti-
zation method by Jegou et al [65] an Cartesian k-means [71]. While both methods
use the codebook for compression, there are several differences. First, PQ obtains a
full bit vector by concatenating the sub-bit vectors, whereas our full bit vector is en-
coded using the full spherical hashing functions learned from the sub-hyperspheres.
Second, they do not use the bit vectors directly for computing distance, while ours
directly computes the Hamming distance between the bit vectors, providing a com-
putational advantage. To facilitate this, they do a lookup from pre-computed tables
that contain distances between sub-codebooks, adding an additional storage cost
to store these tables. Third, they select exactly one pivot from each sub-space to
define the full pivot, while we select more than one pivot from each sub-space to
obtain the final k pivots, increasing the representability. Finally, our pivot is always
associated with a threshold, which ensures that the hypersphere hashing functions
satisfy balance and independence properties, a requirement to obtain compact bit
representations. There are no such guarantees for these methods [65,71].
4.3.3 Random-Select and Adjust(RSA)
The RSA method is a two-step process that learns hashing functions of the
combined subsets from the sub-hyperspheres of two components. The first is the
Random-Select Step which randomly selects the pivots from the cartesian product
73
of sub-hypersphere centers and the second is the Adjust Step which adjusts the
pivots to define the hashing functions of the combined components. Figure 4.2
illustrates the RSA algorithm.
Let X1 ∈ Rs and X2 ∈ Rs define two subsets of the data, from which we
learned C1 and C2, each containing κ pivots. We learn 2κ hyperspheres for the
combined set X12 = {(x11, x21), ..., (x1n, x2n)} ∈ R2s.
1. Random-Select Step: This step selects 2κ random pivots from the Cartesian
product of C1 and C2. These form the potential candidates for the pivots
of the final hashing functions with the necessity of adjusting the pivots and
thresholds.
2. Adjust Step: We use a similar method as is used for learning spherical
hashing functions for selection of thresholds. The balance property is satisfied
by selecting the radii that partitions the training data into two equal parts and
the independence property is satisfied by applying an attractive or repulsive
force in proportion to the deviation from the ideal independence criterion n
4
.
We do this iteratively until the hashing functions converge, thus producing a
linear number of 2κ hashing functions that satisfies the hashing properties of
the combined set X12.
Convergence: Our goal in learning full hyperspheres from sub-hyperspheres
is to obtain a computational advantage as the hashing properties are already met
for the subsequent iterations. We observe that it typically takes < 2 iterations to
perform RSA on the combined subsets, compared to the original SpH algorithm
74
10 30 50 70 900
0.2
0.4
0.6
0.8
1
Number of retrieved images
R
ec
al
l f
or
 1
0 
NN
8000 bits
 
 
BPBC(R)
BPBC(L)
BPBC(R)(ASD)
BPBC(L)(ASD)
Ours(SHD)
Ours(SD)
Ours(ASD)
10 30 50 70 900
0.2
0.4
0.6
0.8
1
Number of retrieved images
R
ec
al
l f
or
 1
0 
NN
16000 bits
 
 
BPBC(R)
BPBC(L)
BPBC(R)(ASD)
BPBC(L)(ASD)
Ours(SHD)
Ours(SD)
Ours(ASD)
10 30 50 70 900
0.2
0.4
0.6
0.8
1
Number of retrieved images
R
ec
al
l f
or
 1
0 
NN
32000 bits
 
 
BPBC(R)
BPBC(L)
BPBC(R)(ASD)
BPBC(L)(ASD)
Ours(SHD)
Ours(SD)
Ours(ASD)
10 30 50 70 900
0.2
0.4
0.6
0.8
1
Number of retrieved images
R
ec
al
l f
or
 1
0 
NN
64000 bits
 
 
BPBC(R)
BPBC(L)
BPBC(R)(ASD)
BPBC(L)(ASD)
Ours(SHD)
Ours(SD)
Ours(ASD)
10 30 50 70 900
0.2
0.4
0.6
0.8
1
Number of retrieved images
R
ec
al
l f
or
 1
0 
NN
128000 bits
 
 
BPBC(R)
BPBC(L)
BPBC(R)(ASD)
BPBC(L)(ASD)
Ours(SHD)
Ours(SD)
Ours(ASD)
Figure 4.3: Recall plots showing the performance of Ours(Red) and BPBC(Learned
and Random) methods on ILSVRC2010 Validation data set for ground truth defined
at 10 Nearest Neighbors. Our method performs better for 8000, 16000 and 32000
bits for all the retrieved images in both the distance settings.
which takes at least 30 iterations(only when the training data fits in memory), thus
reducing the computational time and providing a tractable solution to learn k12
centers.
4.3.4 Divide and Conquer
We apply the RSA algorithm in a divide and conquer style with a divide step
that partitions the data and a conquer step that applies the RSA method to learn
the hyperspheres for the combined partition. Figure 4.2 illustrates the process of
obtaining the k full hyperspheres from m partitions using RSA in a divide and
conquer fashion. We empirically observed that the adjust step in the final stage
only requires the balanced property to be satisfied defining the k radii and naturally
satisfies the independence property without the re-adjustment. Table 4.2 shows the
number of iterations for RSA at lower levels and the total number of iterations
needed to meet the convergence criteria for the ILSVRC2010 Validation data set
containing 50K images.
75
k #Levels #Iter(L) #Iter(T) avg std-dev
6400 4 235 242 34.47 43.84
12800 4 239 241 34.35 43.86
25600 4 238 238 33.23 43.7
Table 4.2: Table listing the number of iterations at lower level(Iter(L)) and all the lev-
els(Iter(T)) along with the convergence criteria(avg(abs(oij − n4 )) and std− dev(oij)) for learning
k = {6400, 12800, 25600} hashing functions from 10K train samples of ILSVRC2010 Validation
dataset(50K images, m = 10). Allowing 10% error on avg and 15% tolerance on std − dev, our
method has met the convergence criteria for all the bits in few iterations. We observe that only
an additional 7, 2, 0 iterations are needed to meet the convergence at higher levels (given, ∼ 20
iterations for each component in the lower level).
4.3.5 Computation time
The computational complexity of our algorithm is O((κ4m + log(k/s)kd)n)t,
where t is the total sum of the iterations at all levels. The complexity is polynomial
in terms of the number of hyperspheres learned in the lowest level as we typically
observe that t is very low. Using our method, we reduce the computational burden
of computing the force matrix for large numbers of bits. However, we still need
to compute distances to the k-matrix at each iteration. This is only computed in
subsets for learning m submodels and is relatively fast.
We notice from Table 4.2 that no additional iterations are required to satisfy
the hashing properties for obtaining full hyperspheres from sub hyperspheres for
76
learning 25600 bits from 25600 VLAD vectors. This emphasizes that our approach
succeeded in obtaining the hyperspheres that naturally satisfy the independence
properties in the full space by learning only sub-hyperspheres in sub spaces. We
further observe that excluding the independence step at each conquer step would
only require us to compute a single balancing step at the root level for learning k
pivots in the full space, decreasing the computation and storage cost drastically.
The computational complexity of computing only one single balance step for the
full space is O((κ2m + κd)ntlow + kdn), where tlow is the average of the iterations
for computing sub-hypersphere hashing functions at the lowest level. Therefore, we
obtain a feasible solution to train the SpH-RSA model with complexity O(κ2mn)tlow.
The storage cost for such a model with k full hyperspheres is O(dκ+ k), where the
first term dκ refers to storing the mκ sub-hyperspheres and the second term k
corresponds to the k thresholds of the hashing functions. We use this model in our
experiments.
4.4 Experiments
We evaluate our method on the following datasets: ILSVRC2010 Validation
and Train datasets which contain 50,000 and 1.2 Million images and Holidays dataset
with 1M Flickr distractors containing 1491 and 1 Million images respectively. We
download these publicly available datasets [44] and [78] which already contain the
computed dense sift descriptors. We use VLAD descriptors [65] for applying our
compression techniques. We power normalize (α = 0.5) the VLAD vectors and
77
10 30 50 70 900
0.2
0.4
0.6
0.8
1
Number of retrieved images
R
ec
al
l f
or
 1
0 
NN
800 bits (ASD) 
 
 
BPBC(L)
BPBC(R)
SpH
PQ
BR+PQ
Ours
10 30 50 70 900
0.2
0.4
0.6
0.8
1
Number of retrieved images
R
ec
al
l f
or
 1
0 
NN
1600 bits (ASD) 
 
 
BPBC(L)
BPBC(R)
SpH
PQ
BR+PQ
Ours
10 30 50 70 900
0.2
0.4
0.6
0.8
1
Number of retrieved images
R
ec
al
l f
or
 1
0 
NN
3200 bits (ASD) 
 
 
BPBC(L)
BPBC(R)
SpH
PQ
BR+PQ
Ours
10 30 50 70 900
0.2
0.4
0.6
0.8
1
Number of retrieved images
R
ec
al
l f
or
 1
0 
NN
6400 bits (ASD) 
 
 
BPBC(L)
BPBC(R)
SpH
PQ
BR+PQ
Ours
10 30 50 70 900
0.2
0.4
0.6
0.8
1
Number of retrieved images
R
ec
al
l f
or
 1
0 
NN
12800 bits (ASD) 
 
 
BPBC(L)
BPBC(R)
SpH
PQ
BR+PQ
Ours
10 30 50 70 900
0.2
0.4
0.6
0.8
1
Number of retrieved images
R
ec
al
l f
or
 1
0 
NN
800 bits (SD)
 
 
BPBC(L)
BPBC(R)
SpH
Ours
10 30 50 70 900
0.2
0.4
0.6
0.8
1
Number of retrieved images
R
ec
al
l f
or
 1
0 
NN
1600 bits (SD)
 
 
BPBC(L)
BPBC(R)
SpH
Ours
10 30 50 70 900
0.2
0.4
0.6
0.8
1
Number of retrieved images
R
ec
al
l f
or
 1
0 
NN
3200 bits (SD)
 
 
BPBC(L)
BPBC(R)
SpH
Ours
10 30 50 70 900
0.2
0.4
0.6
0.8
1
Number of retrieved images
R
ec
al
l f
or
 1
0 
NN
6400 bits (SD)
 
 
BPBC(L)
BPBC(R)
SpH
Ours
10 30 50 70 900
0.2
0.4
0.6
0.8
1
Number of retrieved images
R
ec
al
l f
or
 1
0 
NN
12800 bits (SD)
 
 
BPBC(L)
BPBC(R)
SpH
Ours
Figure 4.4: Recall plots showing the performance of Ours(Red) and several state-of-
the art methods on Holidays+Flickr 1M data set for ground truth defined at 10 NNs.
Top Row shows the performance of the methods using Assymetric Distance and
the Bottom Row using Symmetric Distance. We see that our method consistently
performs better for 1600 and 3200 bits at all the retrieved images in both the distance
settings
78
L2-normalize the vectors to obtain the best performance for retrieval [68]. We do
not need to mean-center the vectors as the VLAD vectors naturally have a mean of
zero.
4.4.1 Datasets, Evaluation Protocol
We apply our method on the ILSVRC2010 Validation and Train datasets where
each image is represented by a VLAD vector of size 128000 and 25600 with a code-
book of size 1000 and 200 respectively. For both datasets, we randomly select 500K
SIFT descriptors to learn a codebook. We partition the data in two parts:a query
partition containing 500 queries and a non-query partition containing the remaining
images. We use the non-query partition for retrieval and training purposes. From
the non-query partition, we select 10K and 20K training samples to learn the full
spherical hashing functions for the Validation and Train datasets respectively.
We also evaluate our method on the Holidays dataset [78] containing 1491
images with 500 predefined queries. We combine the rest of the 991 images with 1M
Flickr distractor images for retrieval and training(20K). We use the visual dictionar-
ies learned from independent Flickr60K images for these experiments. Specifically,
we compute the VLAD vector of size 12800 (128x100) using a codebook of size 100.
We evaluate the performance of our method and the following state-of-the
art methods by calculating recall for upto 100 retrieved images for top-10 Nearest
Neighbors. For each query, we obtain the ground truth based on Euclidean distance
of their VLAD vectors.
79
4.4.2 Comparison methods
We compare our methods to the following state-of-the art approaches: BPBC(Random),
BPBC(Learned) [73], Product Quantization(PQ) [70] and Spherical Hashing(SpH)
[72] for full vectors. CBE [74], a recent state-of-the art approach also compresses
high dimensional vectors to large number of bits with a low storage cost for pro-
jection matrix. However, it mimics the performance of BPBC for compressing to
both high and low number of bits and our intention is to boost the performance
for low number of bits using compact hyperspherical hashing functions. So, we
do not include this method in our comparison. In our experiments, we compress
the VLAD vector to kILSV RC2010V al = {128000, 64000, 32000, 16000, 8000} bits for
ILSVRC2010 Validation, kILSV RC10Train = {25600, 12800, 6400, 3200, 1600} bits
for ILSVRC2010 Train and kHol+Flickr1M= {12800, 6400, 3200, 1600, 800} bits for
Holidays+Flickr1M datasets.
1. BPBC: For obtaining a compression of k bits, both the learned (BPBCL)
and random versions (BPBCR) learn two bilinear projection matrices R
c×k1
1
and R128×k22 , where k2 =
k
k1
and c is the codebook size. To compress a d
dimensional vector to {d, d
2
, d
4
, d
8
, d
16
} bits, we chose k1 = {c,4c5 , c2 ,2c5 , c4} to ensure
the method’s best performance as suggested in [73].
2. PQ: We obtain binary codes by applying Product Quantization(PQ) to both
the original and rotated vectors(BR+PQ). We chose to rotate the vectors as
per [70] to balance the bias of the vectors. Since, rotating the full vectors
80
is an expensive operation in terms of both storage and computing rotated
vectors, we bilinearly rotate the vectors as suggested in [73]. We then apply
PQ methods by chosing the subvector size as s = 8 and learning mc clusters
for each sub vector as in [65]. For all the datasets, we learn codebook of
size mc = {256, 128, 64, 32, 16} from each subset. For PQ and BR+PQ, we
use only Assysmetric Distance (distance between the query and codebook
vectors) (ASD) to obtain the neighbors, as this is shown to always outperform
the Symmetric Distance(SD).
3. SpH: We also learn directly full spherical hashing functions on the full VLAD
vectors. Due to the limitation of training for large number of hash func-
tions, we only learn the spherical hashing functions for lower numbers of bits
kHol+Flickr1Ml = {1600, 800} for the Holidays+Flickr1M data set only. To be
consistent, we use the same training set we used to learn the sub-hypersphere
hashing functions.
4.4.3 Results
Parameters for SpH-RSA: The following properties are common for all
the datasets. We chose s = 256 and κ = 1280 as suggested from our experiments in
Figure 4.1.To select the hypersphere hashing functions for lower number of bits, we
chose the hashing functions that most satisfy the hashing properties. The results
reported are an average of 5 iterations to be consistent with our random selection
scheme. Typically, we observed a small standard deviation value of 0.0253, 0.048
81
10 30 50 70 900
0.2
0.4
0.6
0.8
1
Number of retrieved images
R
ec
al
l f
or
 1
0 
NN
1600 bits (ASD) 
 
 
BPBC(L)
BPBC(R)
PQ
BR+PQ
Ours
10 30 50 70 900
0.2
0.4
0.6
0.8
1
Number of retrieved images
R
ec
al
l f
or
 1
0 
NN
3200 bits (ASD) 
 
 
BPBC(L)
BPBC(R)
PQ
BR+PQ
Ours
10 30 50 70 900
0.2
0.4
0.6
0.8
1
Number of retrieved images
R
ec
al
l f
or
 1
0 
NN
6400 bits (ASD) 
 
 
BPBC(L)
BPBC(R)
PQ
BR+PQ
Ours
10 30 50 70 900
0.2
0.4
0.6
0.8
1
Number of retrieved images
R
ec
al
l f
or
 1
0 
NN
12800 bits (ASD) 
 
 
BPBC(L)
BPBC(R)
PQ
BR+PQ
Ours
10 30 50 70 900
0.2
0.4
0.6
0.8
1
Number of retrieved images
R
ec
al
l f
or
 1
0 
NN
25600 bits (ASD) 
 
 
BPBC(L)
BPBC(R)
PQ
BR+PQ
Ours
10 30 50 70 900
0.2
0.4
0.6
0.8
1
Number of retrieved images
R
ec
al
l f
or
 1
0 
NN
1600 bits (SD)
 
 
BPBC(L)
BPBC(R)
Ours
10 30 50 70 900
0.2
0.4
0.6
0.8
1
Number of retrieved images
R
ec
al
l f
or
 1
0 
NN
3200 bits (SD)
 
 
BPBC(L)
BPBC(R)
Ours
10 30 50 70 900
0.2
0.4
0.6
0.8
1
Number of retrieved images
R
ec
al
l f
or
 1
0 
NN
6400 bits (SD)
 
 
BPBC(L)
BPBC(R)
Ours
10 30 50 70 900
0.2
0.4
0.6
0.8
1
Number of retrieved images
R
ec
al
l f
or
 1
0 
NN
12800 bits (SD)
 
 
BPBC(L)
BPBC(R)
Ours
10 30 50 70 900
0.2
0.4
0.6
0.8
1
Number of retrieved images
R
ec
al
l f
or
 1
0 
NN
25600 bits (SD)
 
 
BPBC(L)
BPBC(R)
Ours
Figure 4.5: Recall plots showing the performance of Ours(Red), BPBC and PQ
based methods on ILSVRC2010 Train data set for ground truth defined at 10 NNs.
Top Row shows the performance of the methods using Assymetric Distance and
the Bottom Row using Symmetric Distance. We see that our method consistently
performs better for 3200 and 6400 bits at all the retrieved images in both the distance
settings.
82
and 0.0361 for ILSVRC2010 Validation, Train and Holiday datasets for learning k
bits.
Symmetric/Assymetric Distance: As PQ based methods(PQ and BR+PQ)
perform better with using Asymmetric Distance, we report the results on both Sym-
metric and Assymetric Distance. Symmetric distance(SD) is the traditional Ham-
ming distance between the query bit code and the retrieval image bit code. This
can be efficiently obtained by an XOR Operation and POP COUNT. Assymetric
distance(ASD) is obtained by computing the distance between the projected query
vector and the bit code. For SpH and our method, the projected vector of a query
q: qp = [qp1 qp2 .... qpk] is obtained as qpl = tl − dist(cl, q), where dist is the eu-
clidean distance of the hypersphere(l) with center(cl) and threshold tl to the query
q, l = 1..k. For BPBC, the projected vector is obtained as R1
T qR2. ASD is now
given as dist(qp, b) = ||qp|| + ||b|| − 2qpbT , which requires only computation of the
dot product of the projected query vector(qp) and the binary code(b), as the norm
of the query and binary codes are constant.
ILSVRC2010 Validation: Figure 4.3 shows the performance of our method
for compressing 128000-dim VLAD vectors to kILSV RC2010V al bits. We observe that
our method (SpH-RSA) using both Symmetric and Assymetric distance outper-
forms the BPBC(Random and Learned Versions) for 8000, 16000 and 32000 bits
and performs very similar to the learned method of BPBC for 64000 and 128000
bits at recall for top 100 retrieved images. Our results emphasize that the compact
hyperspherical hashing functions learned significantly improve the performance at
lower numbers of bits than the traditional hyperplane hashing functions(∼ 10%
83
for 8000 bits). For this dataset, we also calculated the Spherical Hamming Dis-
tance(SHD) [72], obtained as the ratio of Hamming distance and common number
of +1 bits. Contrary to the results in Spherical Hashing method [72], SHD does not
provide compact distance in our case degrades the performance. This may be be-
cause all the L2-normalized VLAD vectors are lying on the surface of a unit sphere
of radius 1 and the ratio of the average max distance of the nearest neighbors to
distant neighbors is close to 1.
Holidays+Flickr 1M: Figure 4.4 shows the performance of our method for
compressing 12800-dim VLAD vectors to kHol+Flickr1M bits. Our method(SpH-RSA)
performs very similar to the BPBC(Learned) method for all bits, but outperforms
PQ and BR+PQ except for 800 bits. Surprisingly, as the size of the codebook
increases the performance of PQ and BR+PQ decreases. This might be because
of learning too many clusters and forming different clusters with similar proper-
ties, causing an over-segmentation problem. Gong et al [73] reported a similar
trend for PQ and BR+PQ in their results. We achieve a 2% and 5% improve-
ment over BPBC(L) for SD and ASD when compressing to 800 bits(both recall@10
and recall @100). We have also compared our method with the Spherical Hashing
method (SpH) learned from full VLAD vectors, and we observed a low recall for the
kHol+Flickr1Ml bits using SpH.
ILSVRC2010 Train: Figure 4.5 shows the performance of our method for
compressing 25600-dim VLAD vectors to kILSV RC2010Train bits. We observe that
our method (SpH-RSA) significantly outperforms the BPBC(Learned and Random)
methods for 1600, 3200 and 6400 bits for both the Symmetric and Assymetric
84
Distance based results. Similar performance is obtained for recall@100 with the
BPBC(Learned) method for 12800 and 25600 bits and ours consistently improves
over the random Bilinear Projections method in the two distance settings. Com-
paring the performance to the PQ based methods (BR+PQ and PQ), our encoding
schemes obtain higher recall for all the bits except 1600 bits. Notice that the perfor-
mance of PQ methods does not increase with an increase in the number of codebooks
learned as also seen in [73]. This emphasizes that the parameters of s and m for
PQ need to be carefully selected to obtain optimal performance, while our method
is robust to these choices. Finally, we report approximately 4% and 14% improve-
ment over the learned BPBC scheme using the Symmetric and Assymetric Distance
respectively(recall@100 for 1600 bits).
Training times: By partitioning the problem, we have proposed a practical
solution to learn hypersphere hashing functions for high dimensional descriptors.
Our approach takes approximately 60 and 40 minutes for training 128000 and 25600
hyperspheres on the training sets of ILSVRC2010 Validation and Train datasets.
Training times are calculated for learning mκ sub-hyperspheres, m={200, 100} on
a single machine with 16GB RAM, Intel Xeon Quad CPU @ 2.66 GHz.
4.5 Conclusion
We have proposed an efficient method for learning hypersphere hashing func-
tions for high dimensional data. Our method employs a hierarchical procedure that
This work is supported by NSF EAGER grant: IIS1359900, Scalable Video Retrieval.
85
learns hash functions for paritions of the data and utilizes the RSA method to obtain
the hashing functions for the full space. We have shown a significant improvement
of 10% and 14% over the BPBC(learned) method for a compression ratio of 512
for ILSVRC2010 Validation, Train datasets. We also report increases in perfor-
mance over the state of the art methods for compression ratios of 256 and 128 and
demonstrated similar performance for compression ratios of 64 and 32.
86
Chapter 5: Zero-shot Event Detection using Multi-modal Fu-
sion of Weakly Supervised Concepts
5.1 Overview
Popular websites such as YouTube, Google images, and Flickr contain large
volumes of image and video data from a multitude of consumer devices such as digital
and cellphone cameras. Technologies that can rapidly analyze such content and
detect salient concepts and events have several compelling applications. Significant
progress has been made in developing such technologies and the core of most state-
of-the-art methods is based on the bag-of-words model [79]. Here, we first extract
low-level features that capture salient gradient [80,81], color [82], or motion [83,84]
patterns, project them to a pre-trained codebook in the same feature space, and
then aggregate the projections to get the final image or video level feature vector.
Classifiers, typically kernel support vector machines (SVM), are then trained using
labeled data. This approach requires a large number of training examples for each
class of interest and performance decreases rapidly as the training set size decreases.
In this work, we study the problem of video classification using only a tex-
tual description of the events of interest, without exemplar videos pertaining to
87
the events. This zero-shot framework, where we perform video classification with
zero training samples, goes beyond traditional zero-shot problems such as described
in [85], where an existing set of classes with training data is adapted to an unseen
class. We pose this difficult problem of video classification as a retrieval task, where
an event is described as a query defined by a set of concepts, e.g. the event “driving
a car” described by the set of concepts “drive, car, road, person, face.” We aim
to retrieve videos that are most similar to the query, where the similarity score is
treated as the confidence of the video belonging to that event.
Our approach to zero-shot learning is to first transform both video and query
text to a high-dimensional concept space before computing similarity in that space.
For the query, we apply text processing techniques to obtain a vector of salient
words and phrases describing the event. For the video, we apply a bank of concept
detectors to obtain a textual representation of the video using a vector of detection
scores. Since we have no prior knowledge of the events of interest, we need a very
large set of generic concept detectors in order to provide semantic coverage of all
possible queries. To address this challenge, we utilize multiple concept detectors
from different modalities: visual features, including video concepts and multiple
query fusion [86] of multiple features described in this work, in addition to off-
the-shelf detectors such as Classemes [4], ObjectBank [5] and SUN attributes [43];
audio information from concepts learned on low-level Mel-frequency Cepstrum Co-
efficients(MFCC) features; and text from video text and speech transcriptions.
Once we represent both query and videos as vectors of concept scores, we can
compute similarities to retrieve relevant videos. A key challenge here is the mismatch
88
between query and video concept vocabularies. We utilize a text expansion based
method to project query and video concept vectors to a common high-dimensional
concept space where they are compared, using the large text corpus Gigaword [87] to
learn this projection matrix. Finally, we fuse retrievals from each of the features and
modalities using a simple linear combination to exploit the complementary nature
of the different modalities and concept vocabularies.
The work is organized as follows: in Section 5.2, we discuss related approaches
to similar problems. In Section 5.3, we present an overview of our zero-shot learning
framework. Section 5.4 describes the features we extract from video and Section 5.5
outlines the combination of these features. We report experimental results in Sec-
tion 5.6, and discuss our conclusions in Section 5.7.
5.2 Related Work
Extensive research has been performed in recent years on effective representa-
tion and classification of images and videos. The first step in most techniques is to
extract low-level features from local spatial or spatio-temporal patches. Popular fea-
tures include grayscale appearance features such as SIFT [80] and SURF [81], color
features such as Color SIFT [82], and motion features such as STIP [83] and dense
trajectories [84]. These typically extract thousands to millions of feature vectors
per image or video. They are aggregated to a single fixed dimensional representa-
tion by a sequence of coding and pooling steps. Possible coding techniques include
Hard Quantization [79], Soft Quantization [88], Sparse Coding [89] and Fisher Vec-
89
tors [90], using a codebook trained in an unsupervised manner from a large set of
feature vectors. The coded features are then aggregated, typically using average or
max pooling, and classified typically using support vector machines (SVM).
While this approach has shown strong results given a large training set, per-
formance degrades rapidly as the amount of training data decreases and the method
does not generalize to previously unseen events. Only limited attention has been
paid to this challenging problem and most existing approaches introduce an inter-
mediate layer of semantic concepts, which are then used to describe novel classes.
Semantic output codes (SOC) are proposed in [85] to extrapolate novel classes by
utilizing a knowledge base of semantic properties of known classes. A large scale
ontology is used in [91] to learn visual relationships between objects, while [92] uses
knowledge transfer between object classes. An online incremental attribute based
zero-shot learning approach is presented in [93], while a max-margin formulation is
proposed in [94] for zero-shot multi-label classification where the label correlations
on the training set differ significantly from the test set. A constrained optimization
formulation that combines regression and knowledge transfer based functions has
recently been proposed in [95].
All of these techniques rely on extrapolating from an existing set of classes and
training data. The more difficult task of performing video retrieval and classification
with no prior event knowledge or training data has been addressed only recently.
In contrast to [96], we introduce several ways to generate a large visual and audio
concept lexicon without prior knowledge of the event classes, and present a simple
unified framework for effectively combining visual, audio, and textual information.
90
While we are not able to benchmark our method against [96] since we do not have
access to their concept lexicon or data partitions, our results in the TRECVID
evaluation (Section 5.6.6) compare favorably to systems using similar approaches.
Video retrieval using semantic similarity has previously been explored in [97,
98]. However, these approaches focus on highly structured broadcast data, where
a small 374 concept pool [97] can be adequate. In contrast, we focus on more
challenging unconstrained web data where leveraging multiple modalities and larger
concept banks is important to build a robust system. While [97,98] both use a pre-
defined concept ontology, we demonstrate the benefit of training in-domain detectors
in a data driven manner by discovering concepts from free form text descriptions.
There has also been an increasing interest in joint modeling of text and visual
features [99], which can then potentially be used to generate a text description of
query images [100–103] and videos [104, 105]. A large scale study of the relation-
ship between semantic similarity of classes and confusion between them is presented
in [106]. In [107], a large text corpus is used to learn a semantic space using word
distributions and a separate model is trained for seen and unseen classes. How-
ever, given the training data limitations in our problem, we constrain our focus
to attribute mappings produced using off-the-shelf features [4, 5, 43], novel concept
banks developed with video-caption pairs similar to [103], and speech and video text
output.
91
Similarity 
Measure 
Similarity 
Measure 
Similarity 
Measure 
Similarity 
Measure 
Projection 
Projection 
Projection 
Projection 
Projection 
Feature 
Spaces 
Concept 
Spaces Videos 
Events Text 
Descriptions 
Query 
Space 
Natural Language Processing 
Common 
Lexicon Space 
Event/Video 
Similarity Scores  
E030 metal crafts project people fashion object 
metal metal crafts project involves manipulating 
metal create decorative functional object jewelry 
decorative wall hangings wrought iron fences 
objects factory operating factory style machine 
count relevant event metal manipulated bending 
molding vice hammer instrument usually initially 
heating metal placing designated metal hot fire 
targeted heat soldering iron welding machine 
welding involves joining pieces metal heating 
molten joining metal heated degree changes 
color glow orange bright yellow white color thin 
pieces metal holes punched drilled decorative 
purposes allow wire piece connecting material 
screw passed hole usually indoors workshop 
metal pieces soldering iron rivets vice hammer 
various metal rods rivet punch nail punch metal 
stamps riveting machine awl caliper solder 
paintbrush metal mold drill polisher welding 
machine person doing hammering metal onto 
table base block mold hammering metal rod awl 
metal painting metal solder melting metal 
soldering iron polishing metal drilling holes metal 
bending metal using vice hammering soldering 
iron flame narration process 
E030 metal crafts project people fashion object 
metal metal crafts project involves manipulating 
metal reate decorative functional object jewelry 
decorative wall hangings wrought iron fences 
objects factory operating factory style machine 
count relevant ev nt metal manipulate  bending 
molding vice hammer instrument usually initially 
heating metal placi  d signated metal hot fire 
targeted heat solderi  iron welding mac ine 
welding involves joining pieces metal h ating 
molten joining metal heat d degree changes 
color glow orange bright yellow white color thin 
pieces t l holes punched drilled decorative 
purposes allow wire piece connecting material 
screw passed hol  usually indoors workshop 
metal ieces soldering iron rivets vice hamm r 
various metal rods rivet pu c  nail punch t l 
stamps riv ting machine awl c liper solder 
paintbrush metal mold drill p lisher welding 
machine person d ing hammering metal onto 
table base block mold hammering etal rod awl 
metal pai ting met l solder melting metal 
soldering iron polishing metal drilling holes metal 
bending metal using vice hammering soldering 
iron flame narration process 
E030 metal crafts project people fashion obj ct 
metal metal crafts project involves manipu ating 
metal create decorative functional object jewelry 
decorative wall hangings wrought iron f nces 
objects fac ory perating factory style machine 
count relevant ev nt metal manipulate  bending 
molding vice hammer instrum nt sually initially 
heating metal plac designated m tal h t fire 
targeted heat solderi  iron welding mac ine 
welding nvolves joining pieces metal heating 
molten joining metal heat d degree changes 
color glow orange bright yellow w ite color thin 
pieces holes punched drilled ecorative 
purposes allow wire piece c necting material 
screw passed hole usually indoors workshop 
metal pieces soldering iro  rivets vice hamm r 
various metal r s rivet pu c  nail punch
stamps riv ting machine awl caliper solder 
paintbrush metal mold drill p lisher welding 
machine person d i g hammering meta  onto 
table base block mold hammering etal rod awl 
metal pai ting met l solder melting metal 
soldering iron polishing metal drilling holes metal 
bending metal using vice hammering soldering 
iron flame narration process 
Video Frames 
ASR Score 
OCR Score 
Visual 
Score 
Fusion 
Score 
Audio Stream 
Audio 
Score 
Figure 5.1: Overview of the proposed multi-modal zero-shot learning approach.
5.3 Zero-shot Learning Framework
Figure 5.1 displays an overview of our multi-modal zero-shot learning ap-
proach, which involves applying C different concept banks on each video v. Let
L = {cl1 , . . . , clK} be a lexicon defined by K concepts clk for concept bank l ∈
[1, . . . , C]. Each concept bank provides a K-dimensional vector of detection scores
~dv = [dl1 . . . dlK ]
T for each video v = 1 . . . V , that is `2-normalized; i.e., ‖~dv‖2 = 1.
Given a query Q = {cq1 , ...cqN} defined by N concepts cqn , we aim to retrieve videos
that are similar to the query.
5.3.1 Basic Similarity Computation
We first present a direct model to measure video-query similarity. In this
model, we compute the similarity score SQ(v) between a query Q and a video v as
92
a sum of the concept scores of the lexicon that match the query concepts:
SQ(v) =
1
K
K∑
k=1
dlk1Q(clk) (5.1)
where 1Q(clk) is an indicator function of the presence of concept clk in query Q.
This baseline system is very precise for efficient concept detectors. We expect
the system to perform well when there is a large match between the query and video
concepts, but lexicon coverage of the query will limit recall while noise in the video
concept detections will degrade precision.
5.3.2 Expansion-based Similarity Computation
To address the issue of vocabulary mismatch between query and video, we use
an alternative model to measure video-query similarity. In this model, concepts are
expanded and projected to a common global concept space defined by the lexicon L.
The goal is to propagate existing confidence scores to semantically similar concepts
using the knowledge from a text corpus (like Gigaword) to estimate similarity. Let
G : (c1, c2) −→ s be a text model that measures the similarity s ∈ [0, 1] between two
concepts c1 and c2. Let an item I in the database be represented by a set of triplets
describing the concept name, its confidence score, and its index in the lexicon L.
The expansion-based projection method is given in Algorithm 2.
This algorithm obtains the projected vector ~f of an item I in two steps for
each concept. The first step finds the top T similar concepts using the model G.
The second step boosts the scores of the similar concepts for an item by the amount
of similarity between the concepts. The final feature vector ~f is then normalized for
93
Algorithm 2 Expansion-based projection.
Given an item I = {(c1, s1, i1), . . . , (cN , sN , iN )}.
Let ~f ∈ RK be the projected feature vector of item I for L.
Initialization: fk = 0 for k = 1 . . .K.
for each (c, s, i) in I do
fi ← fi + s
Find the top T similar concepts of c in G, given as
IT = {(c1, s1, i1), . . . , (cT , sT , iT )},
where st = G(c, ct).
Update the feature for the similar concepts:
for each (ct, st, it) in IT do
fit ← fit + s · st
end for
end for
Normalize the feature vector ‖~f‖2 = 1.
comparison purposes.
Algorithm 2 is applied to expand both the query and database concepts to a
common lexicon space. Query concept confidences are given as 1, while database
concept confidences are given by the output of the concept detectors. Once the ex-
panded feature vectors ~fQ ∈ RK representing the query Q and ~fv ∈ RK representing
the video v have been obtained, the similarity between the query Q and the video
v is computed as
SQ(v) = ~f
T
Q
~fv. (5.2)
94
Note that other similarity measures may also be considered (e.g., Laplacian or
RBF kernels), although in our experiments we find that (5.2) has the best perfor-
mance.
5.4 Video Feature Extraction
Since existing concept banks are generally trained on out of domain data and
may not contain a large enough vocabulary to cover possible queries, we propose
multiple methods to rapidly learn new concept detectors with easily collected data
from readily available in-domain and web sources.
5.4.1 Weakly Supervised Concepts (WSC)
We train a set of WSCs for concept detection in videos using the following
steps:
5.4.1.1 Data Collection and Concept Discovery
We collect a set of videos with free-form text descriptions of their content.
Such data is widely available online in websites such as YouTube and also in the
research set of the considered TRECVID MED dataset. We apply standard nat-
ural language processing (NLP) techniques to clean up the annotations, including
removal of common stop words and stemming to normalize word inflections. The
remaining vocabulary is taken as our concept dictionary.
95
5.4.1.2 Low-level feature extraction
For each video in the collected corpus, we extract the following set of low-level
visual and audio features:
D-SIFT [89]: This is a dense version of SIFT where, instead of detecting inter-
est points, the 128-dimensional feature vectors are extracted at uniformly-sampled
locations covering the whole image. D-SIFT typically generates 3× the number of
points produced by SIFT [80] and has been shown to outperform SIFT for image
classification [89].
Dense Trajectories (DT) [84]: This feature represents the video using dense op-
tical flow trajectories. Histogram of oriented gradients (HoG) and motion boundary
histograms (MBH) are extracted from the local spatio-temporal neighborhood of
each track to capture salient appearance and motion patterns respectively.
MFCC [108]: These popular audio features are extracted from overlapping 29 ms
frames at a rate of 100 frames per second. From each frame, we compute 14 mel-
frequency warped cepstral coefficients. The resulting 45-dimensional feature vector
captures the short-time spectral structure of the audio stream.
For each of the above low-level features, we first apply principal component
analysis (PCA) to reduce the dimensionality and whiten the feature vectors. For
each video, we then obtain a set X =
{
xt ∈ RD, t = 1 . . . T
}
of T low-level low-
dimensionality feature descriptors. We assume that these features are distributed
96
according to a Gaussian mixture model (GMM) with diagonal covariance matrix:
p(~xt|Λ) =
K∑
k=1
wkN (~xt;µk,σ2k), for t = 1 . . . T. (5.3)
The GMM parameters
Λ =
{
wk ∈ [0, 1],µk ∈ RD,σk ∈ RD, k = 1 . . . K
}
are learned on a training set through maximum likelihood estimation. We then
consider the Fisher vector encoding as proposed in [109] and represent each video
by the normalized gradients of the GMM log-likelihood GµkX ∈ RD and GσkX ∈ RD
with respect to the Gaussian mean µk and standard deviation parameters σk, re-
spectively. For k = 1 . . . K, these D-dimensional normalized gradients are defined
as
GµkX =
1
T
√
wk
T∑
t=1
γk(~xt|Λ)
(
~xt − µk
σk
)
(5.4)
GσkX =
1
T
√
2wk
T∑
t=1
γk(~xt|Λ)
[
(~xt − µk)2
σ2k
− 1
]
, (5.5)
where the posterior probability
γk(~xt|Λ) = wkN (~xt;µk,Σk)∑K
l=1wlN (~xt;µl,Σl)
is the soft assignment of the feature descriptor ~xt to the k-th Gaussian cluster. The
final Fisher vector is the concatenation of the K D-dimensional normalized gradients
GµkX and GσkX , and is thus of dimension 2KD.
Vector multiplications and divisions are element-wise operations here.
97
5.4.1.3 Classifier Training
For each concept identified in Section 5.4.1.1, we collect all videos for which
that concept occurs in the text caption, and utilize them as our positive training set,
with the remaining videos considered as negatives. We then train RBF kernel-based
support vector machine (SVM) classifiers using the Fisher vectors representing the
videos. We train a set of concept detectors for each of the low-level features (D-SIFT,
DT, MFCC) described in Section 5.4.1.2.
5.4.1.4 Weakly Supervised Concept Feature
Given a video, we produce a compact representation by concatenating the
detection scores of our concept detectors. We use this feature vector for event
detection and refer to this representation as WSC, for weakly-supervised concepts.
5.4.2 Concept Training using Web Data
In addition to the concept detectors trained using the research set described
in Section 5.4.1.1, we also train detectors using data downloaded from the web. For
each concept identified in Section 5.4.1.1, we downloaded the top 100 retrievals from
Google images and thumbnails for the top 50 retrievals from YouTube. We then
train WSCs with this data using the same approach as described in Section 5.4.1.
We call the WSCs trained using the TRECVID research set, Google images and
YouTube thumbnails as WSCTRECVID, WSCGoogle and WSCYouTube respectively.
98
5.4.3 Concept Distance Features
We also introduce a novel concept distance (CD) based feature. Let C denote
the set of concepts identified from the text annotations in Section 5.4.1.1. For each
concept c ∈ C, let Vc denote the set of videos in the research set containing the
concept. Let ~xi denote the low-level feature based vector extracted for video i.
Then, we compute the feature vector ~yc for the concept c as:
~yc =
1
|Vc|
∑
i∈Vc
~xi. (5.6)
Given a new video v and its low-level feature vector ~xv, we obtain the CD
feature vector by computing the distance to each ~yc in (5.6) and concatenating:
~CDv =
[‖~xv − ~y1‖2 . . . ‖~xv − ~y|C|‖2]T . (5.7)
In our experiments, we use D-SIFT, DT and MFCC low-level feature vectors.
The proposed feature vector builds on multiple-queries (MQ) [86] and the query ex-
pansion [110] based techniques proposed previously. While these approaches identify
relevant videos at query time and use the retrievals to expand the concept set or
training set, we use a static set of concept vectors ~yc and compute distances at query
time to these vectors.
5.4.4 Off-the-shelf Concept Detectors
We also test three off-the-shelf concept detectors that have been used in recent
literature:
Classemes [4]: This is a bank of concept detectors trained on images. These were
99
chosen using a large ontology of visual concepts. Given an image or a video frame,
the application of all these detectors yields a 2,659-dimensional vector of detection
scores.
ObjectBank [5]: Here, we use a spatial pyramid representation of images and
produce detection confidence scores at different scales and spatial pyramids for each
concept. The concept detectors are trained using linear SVMs and an image is
represented by concatenating the detection scores of different concepts at different
scales and spatial pyramids.
SUN Attributes [43]: The SUN attribute set contains detectors for 102 scene
attributes that were specified using crowd sourced human studies.
We apply each of these concept detectors on a set of frames uniformly sampled
from a video and then average the detection scores across the video to get the final
video-level feature vector.
5.4.5 Automatic Speech Recognition (ASR)
We use GMM-based speech activity detection (SAD) and a hidden Markov
model (HMM) based multi-pass large vocabulary ASR to obtain speech content in
the video, and encode the hypotheses in the form of word lattices.
We first extract MFCC features from the audio stream. Then, the speech seg-
ments are identified by using a speech activity detection (SAD) system that employs
two GMMs, for speech and non-speech observations respectively. The SAD model
incorporates video clips with music content to enrich the non-speech model, in order
100
to handle the heterogeneous audio in consumer video. Given the automatically de-
tected speech segments, we then apply a large-vocabulary ASR system to the speech
data to produce a transcript of the spoken content. The system is adapted from an
ASR system trained on English Broadcast News, and updated with MED 2011 de-
scriptor files [111], relative web text data, and the small set of annotated consumer
video data. We evaluated the ASR model on a held-out set of 100 video clips and
achieved a Word Error Rate (WER) of 35.8%. The system outputs not only the
1-best transcripts but also word lattices with acoustic and language model scores.
After basic processing to remove stop words and normalize word inflections,
the word lattice posteriors are used to generate the concept score vectors used in
the zero shot projection system.
5.4.6 Optical Character Recognition (OCR)
Our OCR system recognizes text in bounding boxes from a video text detector
using an HMM-based multi-pass large vocabulary OCR system. Similar to our
ASR system, word lattices are used to encode alternative hypotheses. We leverage
a statistically trained video text detector based on SVM to estimate video text
bounding boxes.
Text candidate regions are first selected using Maximally Stable Extremal
Regions (MSER) and filtered using an SVM with rich shape descriptors such as His-
togram of Oriented Gradients (HoG), Gabor filter, corners and geometrical features.
Candidate regions are then grouped to form word boundaries, and detected words
101
are binarized and filtered before being passed to the HMM-based OCR system for
recognition. The OCR system finds a sequence of characters that maximizes the
posterior, by using glyph models (similar to the acoustic models in ASR), a dictio-
nary and N-gram language models. The word precision and recall of our system
measured on a small consumer video dataset is 14.7% and 37% respectively.
Since the video text content presents itself in various forms, such as subtitles,
markup titles and in-scene text, it is much more challenging than conventional
scanned document OCR. To address these challenges, we consider two versions of
OCR: one which utilizes the dictionary and N-gram language model, and one which
is character-based. While the language model corrects character-level transcription
errors, it also introduce errors when falsely correcting out of vocabulary words.
For the word model OCR output, we generate a concept score vector from the
word lattice posteriors in the same way as ASR. For the character based model, we
estimate word posteriors by smoothing character errors across adjacent video frames
to produce a concept score vector. In our experiments we find the character model
to be slightly better for video than the word model, as detailed in Section 5.6.
5.5 Fusion
State of the art systems for standard event detection with training data have
shown fusion of multiple features and modalities to be crucial for improving perfor-
mance [112]. Fusion is especially important for the zero-shot problem, due to the
sparse occurrence of speech and video text content, as well as the limited vocabulary
102
intersection between a given concept bank and query. While we do not have any
training data on which to learn parameters for more sophisticated fusion methods,
we find that simple score averaging works well to exploit the complementary infor-
mation in various systems. We further see some benefit to manually increasing the
weights of the higher precision ASR and OCR systems in fusion, and use a linearly
weighted score combination for all fusion experiments below.
5.6 Experiments
We test our approach on the large collection of consumer web videos from the
TRECVID MED 13 [3] dataset. The task is to retrieve videos containing one of 20
diverse high-level multimedia events, each described by a short text document of
∼250 words. The dataset provides a research set that contains ∼12,000 background
videos and no exemplars of the events of interest. We use this research set to learn
our WSCTRECVID and CD features. We report on the designated MEDTest set
containing ∼25,000 videos. More details of the events and data partitions may be
found in [3].
5.6.1 Comparison of Similarity Computation
Table 5.1 compares the two methods of query-video similarity computation
discussed in Section 5.3.1 and Section 5.3.2 for the best feature in each modality.
We observe that expansion consistently improves over the simple approach. We
observed similar gains from using projection based features in fusion, and thus we
103
Feature Basic (MAP) Expanded (MAP)
ASR 3.27% 3.66%
OCR (character) 4.43% 4.72%
CDMFCC 1.04% 1.04%
WSCD-SIFTYouTube 3.42% 3.48%
Table 5.1: Mean average precision (MAP) comparison between basic (5.1) and ex-
panded (5.2) query-video similarity computation for our single best ASR, OCR,
audio, and visual features.
use the expansion-based approach in all experiments below.
5.6.2 Comparison of Visual Features
In these experiments, we compare our proposed WSC and CD features to sev-
eral off-the-shelf detectors. Table 5.2 summarizes our results. Here, WSCD-SIFTYouTube
refers to the weakly supervised concept features trained using D-SIFT features ex-
tracted on pre-downloaded YouTube thumbnails. Overall, the WSCD-SIFTYouTube feature
has the strongest performance, while the off-the-shelf detectors are significantly
weaker than our proposed approaches. A possible reason for this is the large do-
main mismatch between the data used for training them and the video data. The
same issue could explain the weaker performance of the WSCGoogle features com-
pared to WSCTRECVID and WSCYouTube due to the domain mismatch between images
104
Feature MAP AUC
SUN [43] 0.48% 0.605
ObjectBank [5] 0.77% 0.592
Classemes [4] 0.84% 0.630
CDD-SIFT 1.71% 0.770
CDDT 2.28% 0.779
WSCD-SIFTTRECVID 1.92% 0.735
WSCDTTRECVID 2.76% 0.726
WSCD-SIFTGoogle 1.21% 0.543
WSCD-SIFTYouTube 3.48% 0.729
Table 5.2: Comparison of mean average precision (MAP) and area under the curve
(AUC) for visual features.
and videos. Moreover, the CD features that are significantly faster to extract have
comparable performance to the WSC features that require training expensive SVMs.
Finally, the WSC and CD features detected using DT are stronger than the ones
using D-SIFT.
105
Feature MAP AUC
WSCMFCCTRECVID 0.76% 0.507
CDMFCC 1.04% 0.604
Table 5.3: Comparison of mean average precision (MAP) and area under the curve
(AUC) for audio features.
5.6.3 Comparison of Audio Features
We compare the performance of our WSC and CD features trained using the
audio MFCC features. Table 5.3 summarizes the MAP and AUC results. As ob-
served, both of the audio features are weaker than the visual features.
5.6.4 Comparison of Language Features
Feature MAP AUC
ASR 3.66% 0.583
OCR (word) 4.30% 0.636
OCR (character) 4.72% 0.611
Table 5.4: Comparison of mean average precision (MAP) and area under the curve
(AUC) for language features.
106
Table 5.4 compares the performance of our OCR and ASR systems. All the
systems have higher MAP compared to the visual and audio features from Tables
5.2 and 5.3. However, note that the AUCs of many visual features outperform the
language features. This is because although language content, when present, is a
highly accurate source of information, its occurrence is sporadic, leading to low
recall.
5.6.5 Comparison of Fusion Systems
Feature MAP AUC
ASR 3.66% 0.583
OCR 5.87% 0.642
Audio 1.04% 0.623
Visual (CD + WSC) 6.12% 0.853
Full 12.65% 0.733
Table 5.5: Comparison of mean average precision (MAP) and area under the curve
(AUC) for fusion systems.
We fused each of the individual systems described above, both within each
modality as well as across modalities. Table 5.5 compares the performance of the
different fusion systems. Note that within the visual system, we found that off-the-
shelf visual features did not improve the fused system, and only included our CD and
107
WSC features. While none of the individual visual features is stronger than ASR
or OCR, the visual system is the single strongest system after fusion, gaining ∼75%
relative improvement over the single best visual system. The combined OCR system
also outperforms the individual OCR systems, and the full system that combines
all modalities more than doubles the performance of any individual modality as
measured by MAP.
5.6.6 TRECVID Performance
The zero-shot event detection task was introduced as a pilot training condition
as part of the TRECVID MED 13 evaluations. Independent evaluations were con-
ducted by NIST on a blind ∼100000 video dataset, both for the same 20 events as in
our previous experiments (prespecified), as well as for 10 new events given one week
before the evaluation (ad hoc). Our zero-shot system achieved highly competitive
scores for both prespecified and ad hoc conditions, placing among the top three out
of 9 submissions. In particular, our consistent performance between prespecified
and ad hoc events demonstrate the robustness of our event-independent approach
to generalize to new queries.
5.7 Discussion and Conclusion
Only limited attention has been devoted to the task of video retrieval using
only text queries. We present a systematic evaluation of our zero-shot framework
for performing high-level multimedia event detection with no training data, given
108
only text descriptions of the events of interest. Our findings and results on the large
TRECVID MED dataset can serve as an initial baseline for this challenging task.
We present a general framework for zero-shot learning, that utilizes multiple
multi-modal features to map a video to an intermediate semantic attribute space,
which are then projected to a high-dimensional concept space using statistics learned
on a large text corpus. Similarity between the attributes and a text query are
computed in this space, and the scores computed from different attribute sets are
combined to get the final score. We demonstrate the effectiveness of this approach
for aligning disjoint vocabularies between query and various modalities.
We describe two simple but effective methods for rapidly training new concept
detectors using in-domain as well as web data in the form of image/video with
associated text descriptions. Detailed experimental results show that our concept
detectors significantly outperform off-the-shelf detectors for zero-shot retrieval tasks.
Exploiting the complementary nature of speech and video text as well as between
different concept banks, we perform multiple rounds of fusion to produce a final
system that is significantly better than any individual feature or modality.
109
Bibliography
[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estima-
tion of word representations in vector space. CoRR, abs/1301.3781, 2013.
[2] Geoffrey Hinton Laurens van der Maaten. Visualizing data using t-sne. In
Journal of Machine Learning Research, Vol. 9, pages pp. 2579–2605. 2008.
[3] Paul Over, George Awad, Martial Michel, Jonathan Fiscus, Greg Sanders,
Wessel Kraaij, Alan F. Smeaton, and Georges Que´enot. Trecvid 2013 – an
overview of the goals, tasks, data, evaluation mechanisms and metrics. In
Proceedings of TRECVID 2013. NIST, USA, 2013.
[4] Lorenzo Torresani, Martin Szummer, and Andrew Fitzgibbon. Efficient object
category recognition using classemes. In CVPR, 2010.
[5] Li-Jia Li, Hao Su, Eric Xing, and Li Fei-Fei. Object bank: A high-level image
representation for scene classification and semantic feature sparsification. In
NIPS, 2010.
[6] Ruslan Salakhutdinov and Geoffrey Hinton. Semantic hashing. Int. J. Approx.
Reasoning, 2009.
[7] Wei Liu, Jun Wang 0006, Rongrong Ji, Yu-Gang Jiang, and Shih-Fu Chang.
In CVPR, 2012.
[8] Guosheng Lin, Chunhua Shen, Qinfeng Shi, Anton van den Hengel, and David
Suter. Fast supervised hashing with decision trees for high-dimensional data.
In CVPR’ 14.
[9] Tiezheng Ge, Kaiming He, and Jian Sun. Graph cuts for supervised binary
coding. In ECCV ’14.
[10] Fumin Shen, Chunhua Shen, Wei Liu, and Heng Tao Shen. Supervised discrete
hashing. CVPR’ 15.
[11] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. Locality-
sensitive hashing scheme based on p-stable distributions.
[12] Yair Weiss, Antonio Torralba, and Rob Fergus. Spectral hashing. In NIPS,
2009.
110
[13] Wei Liu, Jun Wang, and Shih fu Chang. Hashing with graphs. In In ICML,
2011.
[14] Wei Liu, Cun Mu, Sanjiv Kumar, and Shih-Fu Chang. Discrete graph hashing.
In NIPS, 2014.
[15] Mohammad Norouzi and David J. Fleet. Minimal loss hashing for compact
binary codes. In ICML’11.
[16] Rongkai Xia, Yan Pan, Hanjiang Lai, Cong Liu, and Shuicheng Yan. Super-
vised hashing for image retrieval via image representation learning. In AAAI,
2014.
[17] Guosheng Lin, Chunhua Shen, David Suter, and Anton van den Hengel. A
general two-step approach to learning-based hashing. In ICCV’ 13.
[18] Mohammad Rastegari, Ali Farhadi, and David A. Forsyth. Attribute discovery
via predictable discriminative binary codes. In ECCV ’12.
[19] Qifan Wang, Bin Shen, Shumiao Wang, Liang Li, and Luo Si. Binary codes
embedding for fast image tagging with incomplete labels. In ECCV ’14.
[20] Brian Kulis and Trevor Darrell. Learning to hash with binary reconstructive
embeddings. In NIPS’ 09.
[21] Jonathan Masci, Alexander M. Bronstein, Michael M. Bronstein, Pablo
Sprechmann, and Guillermo Sapiro. Sparse similarity-preserving hashing.
2014.
[22] Venice Erin Liong, Jiwen Lu, Gang Wang, Pierre Moulin, and Jie Zhou. Deep
hashing for compact binary codes learning. In CVPR, 2015.
[23] Fang Zhao, Yongzhen Huang, Liang Wang, and Tieniu Tan. Deep semantic
ranking based hashing for multi-label image retrieval. CVPR, 2015.
[24] Hanjiang Lai, Yan Pan, Ye Liu, and Shuicheng Yan. Simultaneous feature
learning and hash coding with deep neural networks. In CVPR, 2015.
[25] Kevin Lin, Huei-Fang Yang, Jen-Hao Hsiao, and Chu-Song Chen. Deep learn-
ing of binary hash codes for fast image retrieval. In CVPR Workshops, 2015.
[26] Weiran Wang and Miguel A´. Carreira-Perpin˜a´n. The role of dimensionality
reduction in classification. In AAAI’ 14.
[27] Liang Sun, Shuiwang Ji, and Jieping Ye. Canonical Correlation Analysis
for Multilabel Classification: A Least-Squares Formulation, Extensions, and
Analysis. IEEE, 2011.
[28] Yunchao Gong and Svetlana Lazebnik. Iterative quantization: A procrustean
approach to learning binary codes. In CVPR’11.
111
[29] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classifica-
tion with deep convolutional neural networks. In NIPS. 2012.
[30] G. Griffin, A. Holub, and P. Perona. Caltech-256 object category dataset.
Technical Report 7694, California Institute of Technology, 2007.
[31] Alex Krizhevsky. Learning multiple layers of features from tiny images. Tech-
nical report, 2009.
[32] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh,
Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bern-
stein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recog-
nition Challenge. IJCV, 2015.
[33] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long,
Ross B. Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolu-
tional architecture for fast feature embedding. In ACM MM ’14.
[34] Antonio Torralba, Rob Fergus, and William T. Freeman. 80 million tiny
images: A large data set for nonparametric object and scene recognition.
IEEE’ 08.
[35] Google search by image. http://www.google.com/insidesearch/features/
images/searchbyimage.html.
[36] Tineye : Reverse image search. https://www.tineye.com/.
[37] Brian Kulis and Kristen Grauman. Kernelized locality-sensitive hashing for
scalable image search. In IEEE International Conference on Computer Vision
(ICCV), 2009.
[38] Kilian Q. Weinberger and Olivier Chapelle. Large margin taxonomy embed-
ding for document categorization. In D. Koller, D. Schuurmans, Y. Bengio,
and L. Bottou, editors, Advances in Neural Information Processing Systems
21, pages 1737–1744. Curran Associates, Inc., 2009.
[39] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to ob-
ject matching in videos. In Proceedings of the International Conference on
Computer Vision, volume 2, pages 1470–1477, October 2003.
[40] Jorge Sanchez, Florent Perronnin, Thomas Mensink, and Jakob Verbeek. Im-
age classification with the fisher vector: Theory and practice. 2013.
[41] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classifica-
tion with deep convolutional neural networks. In Advances in Neural Infor-
mation Processing Systems, page 2012.
[42] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Per-
ona. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, Califor-
nia Institute of Technology, 2010.
112
[43] Genevieve Patterson, Chen Xu, Hang Su, and James Hays. The sun attribute
database: Beyond categories for deeper scene understanding. International
Journal of Computer Vision, 108(1-2):59–81, 2014.
[44] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A
Large-Scale Hierarchical Image Database. In CVPR09, 2009.
[45] Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich
feature hierarchies for accurate object detection and semantic segmentation.
CoRR, abs/1311.2524, 2013.
[46] Ning Zhang, Jeff Donahue, Ross Girshick, and Trevor Darrell. Part-based
R-CNNs for fine-grained category detection. In Proceedings of the European
Conference on Computer Vision (ECCV), 2014.
[47] David R. Hardoon, Sandor Szedmak, Or Szedmak, and John Shawe-taylor.
Canonical correlation analysis; an overview with application to learning meth-
ods. Technical report, 2007.
[48] Chun-Nam Yu and T. Joachims. Learning structural svms with latent vari-
ables. In International Conference on Machine Learning (ICML), 2009.
[49] Lei Zhang, Yongdong Zhang, Jinhui Tang, Xiaoguang Gu, Jintao Li, and
Qi Tian. Topology preserving hashing for similarity search. In Alejandro
Jaimes, Nicu Sebe, Nozha Boujemaa, Daniel Gatica-Perez, David A. Shamma,
Marcel Worring, and Roger Zimmermann, editors, ACM Multimedia, pages
123–132. ACM, 2013.
[50] Daniel Hsu, Sham Kakade, John Langford, and Tong Zhang. Multi-label pre-
diction via compressed sensing. In Yoshua Bengio, Dale Schuurmans, John D.
Lafferty, Christopher K. I. Williams, and Aron Culotta, editors, NIPS, pages
772–780. Curran Associates, Inc., 2009.
[51] Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. Learning to
detect unseen object classes by between-class attribute transfer. In CVPR,
pages 951–958. IEEE, 2009.
[52] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin
Altun. Large margin methods for structured and interdependent output vari-
ables. J. Mach. Learn. Res., 6:1453–1484, December 2005.
[53] Jason Weston, Samy Bengio, and Nicolas Usunier. N.: Wsabie: Scaling up to
large vocabulary image annotation. In: IJCAI, pages 2764–2770.
[54] Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid.
Attribute-Based Classification with Label-Embedding. In NIPS 2013 Work-
shop on Output Representation Learning, Lake Tahoe, United States, Decem-
ber 2013. Neural Information Processing Systems (NIPS) Foundation.
113
[55] Jia Deng, Alexander C. Berg, Kai Li, and Li Fei-Fei. What does classifying
more than 10,000 image categories tell us? In Proceedings of the 11th Euro-
pean Conference on Computer Vision: Part V, ECCV’10, pages 71–84, Berlin,
Heidelberg, 2010. Springer-Verlag.
[56] Andrea Frome, Gregory S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey
Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. Devise: A deep visual-
semantic embedding model. In Christopher J. C. Burges, Le´on Bottou, Zoubin
Ghahramani, and Kilian Q. Weinberger, editors, NIPS, pages 2121–2129, 2013.
[57] Thomas Mensink, Efstratios Gavves, and Cees G. M. Snoek. Costa: Co-
occurrence statistics for zero-shot classification. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Columbus, Ohio,
USA, June 2014.
[58] Yair Weiss, Rob Fergus, and Antonio Torralba. Multidimensional spectral
hashing. In ECCV (5), pages 340–353, 2012.
[59] Jun Wang, Sanjiv Kumar, and Shih-Fu Chang. Semi-supervised hashing for
scalable image retrieval. In CVPR, San Francisco, USA, June 2010.
[60] Yurii Nesterov. Introductory lectures on convex optimization : a basic course.
Applied optimization. Kluwer Academic Publ., Boston, Dordrecht, London,
2004.
[61] Tom Goldstein, Christoph Studer, and Richard G. Baraniuk. FASTA: A gener-
alized implementation of forward-backward splitting. CoRR, abs/1501.04979,
2015.
[62] Roman Rosipal and Nicole Kra¨mer. Overview and Recent Advances in Par-
tial Least Squares. In Craig Saunders, Marko Grobelnik, Steve Gunn, and
John Shawe-Taylor, editors, Subspace, Latent Structure and Feature Selection,
volume 3940 of Lecture Notes in Computer Science, chapter 2, pages 34–51.
Springer Berlin Heidelberg, Berlin, Heidelberg, 2006.
[63] Florent Perronnin and Christopher R. Dance. Fisher kernels on visual vocab-
ularies for image categorization. In CVPR, 2007.
[64] Jorge Sa´nchez, Florent Perronnin, Thomas Mensink, and Jakob J. Verbeek.
Image classification with the fisher vector: Theory and practice. International
Journal of Computer Vision, 105(3):222–245, 2013.
[65] Herve´ Je´gou, Matthijs Douze, Cordelia Schmid, and Patrick Pe´rez. Aggregat-
ing local descriptors into a compact image representation. In IEEE Conference
on Computer Vision & Pattern Recognition, pages 3304–3311, jun 2010.
[66] Jinjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, Thomas S. Huang, and
Yihong Gong. Locality-constrained linear coding for image classification. In
CVPR, pages 3360–3367, 2010.
114
[67] Florent Perronnin, Yan Liu, Jorge Sa´nchez, and Herve Poirier. Large-scale
image retrieval with compressed fisher vectors. In CVPR, pages 3384–3391,
2010.
[68] Jorge Sa´nchez and Florent Perronnin. High-dimensional signature compression
for large-scale image classification. In CVPR, pages 1665–1672, 2011.
[69] Luca Marchesotti, Claudio Cifarelli, and Gabriela Csurka. A framework for
visual saliency detection with applications to image thumbnailing. In ICCV,
pages 2232–2239. IEEE, 2009.
[70] Herve´ Je´gou, Matthijs Douze, and Cordelia Schmid. Product quantization for
nearest neighbor search. IEEE Transactions on Pattern Analysis & Machine
Intelligence, 33(1):117–128, jan 2011. to appear.
[71] Mohammad Norouzi and David Fleet. Cartesian k-means. In IEEE Conference
on Computer Vision & Pattern Recognition, 2013.
[72] Youngwoon Lee. Spherical hashing. In Proceedings of the 2012 IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR), CVPR ’12, pages
2957–2964, Washington, DC, USA, 2012. IEEE Computer Society.
[73] Yunchao Gong, Sanjiv Kumar, Henry Rowley, and Svetlana Lazebnik. Learn-
ing binary codes for high dimensional data using bilinear projections. In IEEE
Computer Vision and Pattern Recognition, 2013.
[74] F. X. Yu, S. Kumar, Y. Gong, and S.-F. Chang. Circulant Binary Embedding.
ArXiv e-prints, May 2014.
[75] A. Joly and O. Buisson. Random maximum margin hashing. In Proceedings
of the 2011 IEEE Conference on Computer Vision and Pattern Recognition,
CVPR ’11, pages 873–880, Washington, DC, USA, 2011. IEEE Computer
Society.
[76] Junfeng He, Regunathan Radhakrishnan, Shih-Fu Chang, and Claus Bauer.
Compact hashing with joint optimization of search accuracy and time. In IEEE
Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR), oral session, June 2011.
[77] Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar
Raghavan. Automatic subspace clustering of high dimensional data for data
mining applications. In Proceedings of the 1998 ACM SIGMOD international
conference on Management of data, SIGMOD ’98, pages 94–105, New York,
NY, USA, 1998. ACM.
[78] Herve Jegou, Matthijs Douze, and Cordelia Schmid. Hamming embedding
and weak geometric consistency for large scale image search. In Proceedings of
the 10th European Conference on Computer Vision: Part I, ECCV ’08, pages
304–317, Berlin, Heidelberg, 2008. Springer-Verlag.
115
[79] Gabriella Csurka, Christopher Dance, Lixin Fan, Jutta Willamowski, and
Cedric Bray. Visual categorization with bags of keypoints. In ECCVW, 2004.
[80] David G. Lowe. Distinctive image features from scale-invariant keypoints.
IJCV, 60:91–110, 2004.
[81] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. Surf:
Speeded up robust features. CVIU, 110(3):346–359, 2008.
[82] K. van de Sande, T. Gevers, and C. Snoek. Evaluating color descriptors for
object and scene recognition. 32(9):1582–1596, 2010.
[83] Ivan Laptev. On space-time interest points. IJCV, 64(2-3):107–123, 2005.
[84] Heng Wang, Alexander Kla¨ser, Cordelia Schmid, and Cheng-Lin Liu. Dense
trajectories and motion boundary descriptors for action recognition. IJCV,
103(1):60–79, 2013.
[85] Mark Palatucci, Dean Pomerleau, Geoffrey Hinton, and Tom Mitchell. Zero-
shot learning with semantic output codes. In NIPS, December 2009.
[86] R. Arandjelovic´ and A. Zisserman. Multiple queries for large scale specific
object retrieval. In BMVC, 2012.
[87] David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. English gigaword
third edition. In Linguistic Data Consortium, Philadelphia, 2007.
[88] Jan C. van Gemert, Cor J. Veenman, Arnold W. M. Smeulders, and Jan-Mark
Geusebroek. Visual word ambiguity. IEEE PAMI, 32(7):1271–1283, 2010.
[89] Y.L. Boureau, F. Bach, Y.L. Le Cun, and J. Ponce. Learning mid-level features
for recognition. In CVPR, 2010.
[90] Jorge Sa´nchez, Florent Perronnin, Thomas Mensink, and Jakob J. Verbeek.
Image classification with the fisher vector: Theory and practice. International
Journal of Computer Vision, 105(3):222–245, 2013.
[91] Olga Russakovsky and Li Fei-Fei. Attribute learning in large-scale datasets.
In ECCV, Crete, Greece, September 2010.
[92] Marcus Rohrbach, Michael Stark, and Bernt Schiele. Evaluating knowledge
transfer and zero-shot learning in a large-scale setting. In CVPR, 2011.
[93] Pichai Kankuekul, Aram Kawewong, Sirinart Tangruamsub, and Osamu
Hasegawa. Online incremental attribute-based zero-shot learning. In CVPR,
pages 3657–3664, 2012.
[94] B. Hariharan, S. V. N. Vishwanathan, and M. Varma. Efficient max-margin
multi-label classification with applications to zero-shot learning. Machine
Learning Journal, 88(1):127–155, 2012.
116
[95] Mohamed Elhoseiny, Babak Saleh, and Ahmed Elgammal. Write a classifier:
Zero-shot learning using purely textual descriptions. In ICCV, 2013.
[96] Jeffrey Dalton, James Allan, and Mirajkar Pranav. Zero-shot video retrieval
using content and concepts. In ACM Conference of Information and Knowl-
edge Management, 2013.
[97] Yusuf Aytar, Mubarak Shah, and Jiebo Luo. Utilizing semantic word similarity
measures for video retrieval. In CVPR, pages 1–8. IEEE, 2008.
[98] Min Young Jung and Sung Han Park. Semantic similarity based video re-
trieval. In New Directions in Intelligent Interactive Multimedia Systems and
Services-2, pages 381–390. Springer, 2009.
[99] Kobus Barnard, Pinar Duygulu, and David A. Forsyth. Clustering art. In
CVPR, pages 434–441, 2001.
[100] Ali Farhadi, Seyyed Mohammad Mohsen Hejrati, Mohammad Amin Sadeghi,
Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David A. Forsyth.
Every picture tells a story: Generating sentences from images. In ECCV,
pages 15–29, 2010.
[101] Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi,
Alexander C. Berg, and Tamara L. Berg. Baby talk: Understanding and
generating simple image descriptions. In CVPR, pages 1601–1608, 2011.
[102] Yezhou Yang, Ching Lik Teo, Hal Daume´ III, and Yiannis Aloimonos. Corpus-
guided sentence generation of natural images. In EMNLP, pages 444–454,
2011.
[103] Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. Im2text: Describing
images using 1 million captioned photographs. In NIPS, 2011.
[104] Niveda Krishnamoorthy, Girish Malkarnenkar, Raymond J. Mooney, Kate
Saenko, and Sergio Guadarrama. Generating natural-language video descrip-
tions using text-mined knowledge. pages 541–547, 2013.
[105] P. Das, C. Xu, R. F. Doell, and J. J. Corso. A thousand frames in just a few
words: Lingual description of videos through latent topics and sparse object
stitching. In CVPR, 2013.
[106] Jia Deng, Alexander C. Berg, Kai Li, and Li Fei-Fei. What does classifying
more than 10,000 image categories tell us? In ECCV, pages 71–84, 2010.
[107] Richard Socher, Milind Ganjoo, Hamsa Sridhar, Osbert Bastani, Christo-
pher D. Manning, and Andrew Y. Ng. Zero-shot learning through cross-modal
transfer. CoRR, abs/1301.3666, 2013.
117
[108] S. Davis and P. Mermelstein. Comparison of parametric representations for
monosyllabic word recognition in continuously spoken sentences. In IEEE
ASSP, volume 28, pages 357–66, 1980.
[109] Florent Perronnin, Jorge Sa´nchez, and Thomas Mensink. Improving the fisher
kernel for large-scale image classification. In Kostas Daniilidis, Petros Mara-
gos, and Nikos Paragios, editors, ECCV, volume 6314 of Lecture Notes in
Computer Science, pages 143–156. Springer Berlin Heidelberg, 2010.
[110] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman. Total recall: Au-
tomatic query expansion with a generative feature model for object retrieval.
In ICCV, 2007.
[111] Paul Over, George Awad, Martial Michel, Jonathan Fiscus, Wessel Kraaij,
Alan F. Smeaton, and Georges Que´enot. TrecVid 2011 – An Overview of the
Goals, Tasks, Data, Evaluation Mechanisms and Metrics. In Proceedings of
TRECVID 2011. NIST, USA, 2011.
[112] Pradeep Natarajan, Shuang Wu, Shiv Naga Prasad Vitaladevuni, Xiaodan
Zhuang, Stavros Tsakalidis, Unsang Park, Rohit Prasad, and Premkumar
Natarajan. Multimodal feature fusion for robust event detection in web videos.
In CVPR, 2012.
118